Find unique characters / lines
-
So, I have a list of ~4000 duplicate characters, 99% of the list are duplicates but there are 5 unique characters that don’t repeat. I need a way to find those 5 characters.
Example -
The list looks something like this:
不
不
与
与
且
世
且
I need to find that unique line/character (世) and either take it out of there or remove everything besides it,
so at the end it will look something like this:世
I spend some time searching and I honestly couldn’t find a solution. The closest thing I could find is removing all the duplicates but that leaves a list of ~2000 unique characters which looks something like this:
不
与
世
且So is there a way to do this?
-
Hello, @カヒノビチアレクセイ
Not very difficult, indeed !
If you don’t mind about a final sort of your unique CJK characters, here is a way to achieve it, very quickly :-))
First of all, just backup your original list ( A safe behaviour to adopt, in any case ! )
Now, let suppose you have the following list of CJK characters. I just added, after a space, the Unicode code-point of each character
丰 4E30 不 4E0D 丆 4E06 与 4E0E 不 4E0D 丰 4E30 且 4E14 世 4E16 中 4E2D 且 4E14 与 4E0E 丰 4E30 丟 4E1F 中 4E2D 与 4E0E 中 4E2D 丆 4E06 丰 4E30First, perform a classical sort, with the menu option Edit > Line Operations > Sort lines Lexicographically Ascending. We get, immediately, the sorted text, below :
丆 4E06 丆 4E06 不 4E0D 不 4E0D 与 4E0E 与 4E0E 与 4E0E 且 4E14 且 4E14 世 4E16 丟 4E1F 中 4E2D 中 4E2D 中 4E2D 丰 4E30 丰 4E30 丰 4E30 丰 4E30Now :
-
Move back to the very beginning of your file (
Ctrl + Origin) -
Open the Replace dialog (
Ctrl + H) -
In the Find what: zone, paste or type the regex
(?-s)^(.+\R)\1+ -
Leave the Replace with: zone
EMPTY -
Select the Regular expression search mode
-
Click on the Replace All button
=> You should get, only, the two lines, below :
世 4E16 丟 4E1FEt voilà !! It just remains the two unique characters of the original list :-))
Notes :
-
The first part
(?-s)is a modifier which implies that any dot will match a single standard character and not EOL characters -
Then, the
^symbol is a zero-length assertion, which means beginning of line -
Now, the part
(.+\R)represents a non-empty range of consecutive standard characters, followed by its EOL character(s). As the current complete line is enclosed in parentheses, it’s stored as group 1 -
Finally, the part
\1+, is a repeated back-reference togroup 1, which looks for any non-empty range of consecutive lines, identical to the first one ! -
As the replacement zone is
EMPTY, all these repeated lines (> 1) are simply deleted !
Best Regards,
guy038
-
Hello! It looks like you're interested in this conversation, but you don't have an account yet.
Getting fed up of having to scroll through the same posts each visit? When you register for an account, you'll always come back to exactly where you were before, and choose to be notified of new replies (either via email, or push notification). You'll also be able to save bookmarks and upvote posts to show your appreciation to other community members.
With your input, this post could be even better 💗
Register Login