<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Find unique characters &#x2F; lines]]></title><description><![CDATA[<p dir="auto">So, I have a list of ~4000 duplicate characters, 99% of the list are duplicates but there are 5 unique characters that don’t repeat. I need a way to find those 5 characters.<br />
Example -<br />
The list looks something like this:<br />
不<br />
不<br />
与<br />
与<br />
且<br />
世<br />
且<br />
I need to find that unique line/character (世) and either take it out of there or remove everything besides it,<br />
so at the end it will look something like this:</p>
<p dir="auto">世</p>
<p dir="auto">I spend some time searching and I honestly couldn’t find a solution. The closest thing I could find is removing all the duplicates but that leaves a list of ~2000 unique characters which looks something like this:<br />
不<br />
与<br />
世<br />
且</p>
<p dir="auto">So is there a way to do this?</p>
]]></description><link>https://community.notepad-plus-plus.org/topic/14279/find-unique-characters-lines</link><generator>RSS for Node</generator><lastBuildDate>Tue, 14 Apr 2026 14:55:56 GMT</lastBuildDate><atom:link href="https://community.notepad-plus-plus.org/topic/14279.rss" rel="self" type="application/rss+xml"/><pubDate>Sun, 06 Aug 2017 13:08:00 GMT</pubDate><ttl>60</ttl><item><title><![CDATA[Reply to Find unique characters &#x2F; lines on Mon, 07 Aug 2017 17:17:28 GMT]]></title><description><![CDATA[<p dir="auto">Hello, <a class="plugin-mentions-user plugin-mentions-a" href="https://community.notepad-plus-plus.org/uid/9453">@カヒノビチアレクセイ</a></p>
<p dir="auto">Not very difficult, indeed !</p>
<p dir="auto">If you don’t mind about a final <strong>sort</strong> of your <strong>unique CJK</strong> characters, here is a way to achieve it, very <strong>quickly</strong> :-))</p>
<p dir="auto">First of all, just backup your <strong>original</strong> list ( A <strong>safe</strong> behaviour to adopt, in any case ! )</p>
<p dir="auto">Now, let suppose you have the following list of <strong>CJK</strong> characters. I just added, after a space, the <strong>Unicode code-point</strong> of each character</p>
<pre><code class="language-diff">丰 4E30
不 4E0D
丆 4E06
与 4E0E
不 4E0D
丰 4E30
且 4E14
世 4E16
中 4E2D
且 4E14
与 4E0E
丰 4E30
丟 4E1F
中 4E2D
与 4E0E
中 4E2D
丆 4E06
丰 4E30
</code></pre>
<p dir="auto">First, perform a <strong>classical</strong> sort, with the menu option <strong>Edit &gt; Line Operations &gt; Sort lines Lexicographically Ascending</strong>. We get, immediately, the <strong>sorted</strong> text, below :</p>
<pre><code class="language-diff">丆 4E06
丆 4E06
不 4E0D
不 4E0D
与 4E0E
与 4E0E
与 4E0E
且 4E14
且 4E14
世 4E16
丟 4E1F
中 4E2D
中 4E2D
中 4E2D
丰 4E30
丰 4E30
丰 4E30
丰 4E30
</code></pre>
<p dir="auto">Now :</p>
<ul>
<li>
<p dir="auto">Move back to the <strong>very beginning</strong> of your file ( <strong><code>Ctrl + Origin</code></strong> )</p>
</li>
<li>
<p dir="auto">Open the <strong>Replace</strong> dialog ( <strong><code>Ctrl + H</code></strong> )</p>
</li>
<li>
<p dir="auto">In the <strong>Find what:</strong> zone, paste or type the regex <strong><code>(?-s)^(.+\R)\1+</code></strong></p>
</li>
<li>
<p dir="auto">Leave the <strong>Replace with:</strong> zone <strong><code>EMPTY</code></strong></p>
</li>
<li>
<p dir="auto">Select the <strong>Regular expression</strong> search mode</p>
</li>
<li>
<p dir="auto">Click on the <strong>Replace All</strong> button</p>
</li>
</ul>
<p dir="auto">=&gt; You should get, only, the <strong>two</strong> lines, below :</p>
<pre><code class="language-diff">世 4E16
丟 4E1F
</code></pre>
<p dir="auto">Et voilà !! It just remains the <strong>two unique</strong> characters of the <strong>original</strong> list :-))</p>
<hr />
<p dir="auto"><strong>Notes</strong> :</p>
<ul>
<li>
<p dir="auto">The first part <strong><code>(?-s)</code></strong> is a modifier which implies that any <strong>dot</strong> will match a <strong>single standard</strong> character and not <strong>EOL</strong> characters</p>
</li>
<li>
<p dir="auto">Then, the <strong><code>^</code></strong> symbol is a <strong>zero-length</strong> assertion, which means <strong>beginning</strong> of line</p>
</li>
<li>
<p dir="auto">Now, the part <strong><code>(.+\R)</code></strong> represents a <strong>non-empty</strong> range of <strong>consecutive standard</strong> characters, followed by its <strong>EOL</strong> character(s). As the current complete line is <strong>enclosed</strong> in parentheses, it’s stored as <strong>group 1</strong></p>
</li>
<li>
<p dir="auto">Finally, the part <strong><code>\1+</code></strong>, is a <strong>repeated back-reference</strong> to <strong><code>group 1</code></strong>, which looks for any <strong>non-empty</strong> range of <strong>consecutive</strong> lines, <strong>identical</strong> to the <strong>first</strong> one !</p>
</li>
<li>
<p dir="auto">As the replacement zone is <strong><code>EMPTY</code></strong>, all these <strong>repeated</strong> lines ( <strong><code>&gt; 1</code></strong> ) are simply <strong>deleted</strong> !</p>
</li>
</ul>
<p dir="auto">Best Regards,</p>
<p dir="auto">guy038</p>
]]></description><link>https://community.notepad-plus-plus.org/post/26155</link><guid isPermaLink="true">https://community.notepad-plus-plus.org/post/26155</guid><dc:creator><![CDATA[guy038]]></dc:creator><pubDate>Mon, 07 Aug 2017 17:17:28 GMT</pubDate></item></channel></rss>