# Find unique characters / lines

• So, I have a list of ~4000 duplicate characters, 99% of the list are duplicates but there are 5 unique characters that don’t repeat. I need a way to find those 5 characters.
Example -
The list looks something like this:

I need to find that unique line/character (世) and either take it out of there or remove everything besides it,
so at the end it will look something like this:

I spend some time searching and I honestly couldn’t find a solution. The closest thing I could find is removing all the duplicates but that leaves a list of ~2000 unique characters which looks something like this:

So is there a way to do this?

• Hello, @カヒノビチアレクセイ

Not very difficult, indeed !

If you don’t mind about a final sort of your unique CJK characters, here is a way to achieve it, very quickly :-))

First of all, just backup your original list ( A safe behaviour to adopt, in any case ! )

Now, let suppose you have the following list of CJK characters. I just added, after a space, the Unicode code-point of each character

``````丰 4E30
不 4E0D
丆 4E06
与 4E0E
不 4E0D
丰 4E30
且 4E14
世 4E16
中 4E2D
且 4E14
与 4E0E
丰 4E30
丟 4E1F
中 4E2D
与 4E0E
中 4E2D
丆 4E06
丰 4E30
``````

First, perform a classical sort, with the menu option Edit > Line Operations > Sort lines Lexicographically Ascending. We get, immediately, the sorted text, below :

``````丆 4E06
丆 4E06
不 4E0D
不 4E0D
与 4E0E
与 4E0E
与 4E0E
且 4E14
且 4E14
世 4E16
丟 4E1F
中 4E2D
中 4E2D
中 4E2D
丰 4E30
丰 4E30
丰 4E30
丰 4E30
``````

Now :

• Move back to the very beginning of your file ( `Ctrl + Origin` )

• Open the Replace dialog ( `Ctrl + H` )

• In the Find what: zone, paste or type the regex `(?-s)^(.+\R)\1+`

• Leave the Replace with: zone `EMPTY`

• Select the Regular expression search mode

• Click on the Replace All button

=> You should get, only, the two lines, below :

``````世 4E16
丟 4E1F
``````

Et voilà !! It just remains the two unique characters of the original list :-))

Notes :

• The first part `(?-s)` is a modifier which implies that any dot will match a single standard character and not EOL characters

• Then, the `^` symbol is a zero-length assertion, which means beginning of line

• Now, the part `(.+\R)` represents a non-empty range of consecutive standard characters, followed by its EOL character(s). As the current complete line is enclosed in parentheses, it’s stored as group 1

• Finally, the part `\1+`, is a repeated back-reference to `group 1`, which looks for any non-empty range of consecutive lines, identical to the first one !

• As the replacement zone is `EMPTY`, all these repeated lines ( `> 1` ) are simply deleted !

Best Regards,

guy038