Find unique characters / lines
-
So, I have a list of ~4000 duplicate characters, 99% of the list are duplicates but there are 5 unique characters that don’t repeat. I need a way to find those 5 characters.
Example -
The list looks something like this:
不
不
与
与
且
世
且
I need to find that unique line/character (世) and either take it out of there or remove everything besides it,
so at the end it will look something like this:世
I spend some time searching and I honestly couldn’t find a solution. The closest thing I could find is removing all the duplicates but that leaves a list of ~2000 unique characters which looks something like this:
不
与
世
且So is there a way to do this?
-
Hello, @カヒノビチアレクセイ
Not very difficult, indeed !
If you don’t mind about a final sort of your unique CJK characters, here is a way to achieve it, very quickly :-))
First of all, just backup your original list ( A safe behaviour to adopt, in any case ! )
Now, let suppose you have the following list of CJK characters. I just added, after a space, the Unicode code-point of each character
丰 4E30 不 4E0D 丆 4E06 与 4E0E 不 4E0D 丰 4E30 且 4E14 世 4E16 中 4E2D 且 4E14 与 4E0E 丰 4E30 丟 4E1F 中 4E2D 与 4E0E 中 4E2D 丆 4E06 丰 4E30
First, perform a classical sort, with the menu option Edit > Line Operations > Sort lines Lexicographically Ascending. We get, immediately, the sorted text, below :
丆 4E06 丆 4E06 不 4E0D 不 4E0D 与 4E0E 与 4E0E 与 4E0E 且 4E14 且 4E14 世 4E16 丟 4E1F 中 4E2D 中 4E2D 中 4E2D 丰 4E30 丰 4E30 丰 4E30 丰 4E30
Now :
-
Move back to the very beginning of your file (
Ctrl + Origin
) -
Open the Replace dialog (
Ctrl + H
) -
In the Find what: zone, paste or type the regex
(?-s)^(.+\R)\1+
-
Leave the Replace with: zone
EMPTY
-
Select the Regular expression search mode
-
Click on the Replace All button
=> You should get, only, the two lines, below :
世 4E16 丟 4E1F
Et voilà !! It just remains the two unique characters of the original list :-))
Notes :
-
The first part
(?-s)
is a modifier which implies that any dot will match a single standard character and not EOL characters -
Then, the
^
symbol is a zero-length assertion, which means beginning of line -
Now, the part
(.+\R)
represents a non-empty range of consecutive standard characters, followed by its EOL character(s). As the current complete line is enclosed in parentheses, it’s stored as group 1 -
Finally, the part
\1+
, is a repeated back-reference togroup 1
, which looks for any non-empty range of consecutive lines, identical to the first one ! -
As the replacement zone is
EMPTY
, all these repeated lines (> 1
) are simply deleted !
Best Regards,
guy038
-