Find unique characters / lines



  • So, I have a list of ~4000 duplicate characters, 99% of the list are duplicates but there are 5 unique characters that don’t repeat. I need a way to find those 5 characters.
    Example -
    The list looks something like this:







    I need to find that unique line/character (世) and either take it out of there or remove everything besides it,
    so at the end it will look something like this:

    I spend some time searching and I honestly couldn’t find a solution. The closest thing I could find is removing all the duplicates but that leaves a list of ~2000 unique characters which looks something like this:



    So is there a way to do this?



  • Hello, @カヒノビチアレクセイ

    Not very difficult, indeed !

    If you don’t mind about a final sort of your unique CJK characters, here is a way to achieve it, very quickly :-))

    First of all, just backup your original list ( A safe behaviour to adopt, in any case ! )

    Now, let suppose you have the following list of CJK characters. I just added, after a space, the Unicode code-point of each character

    丰 4E30
    不 4E0D
    丆 4E06
    与 4E0E
    不 4E0D
    丰 4E30
    且 4E14
    世 4E16
    中 4E2D
    且 4E14
    与 4E0E
    丰 4E30
    丟 4E1F
    中 4E2D
    与 4E0E
    中 4E2D
    丆 4E06
    丰 4E30
    

    First, perform a classical sort, with the menu option Edit > Line Operations > Sort lines Lexicographically Ascending. We get, immediately, the sorted text, below :

    丆 4E06
    丆 4E06
    不 4E0D
    不 4E0D
    与 4E0E
    与 4E0E
    与 4E0E
    且 4E14
    且 4E14
    世 4E16
    丟 4E1F
    中 4E2D
    中 4E2D
    中 4E2D
    丰 4E30
    丰 4E30
    丰 4E30
    丰 4E30
    

    Now :

    • Move back to the very beginning of your file ( Ctrl + Origin )

    • Open the Replace dialog ( Ctrl + H )

    • In the Find what: zone, paste or type the regex (?-s)^(.+\R)\1+

    • Leave the Replace with: zone EMPTY

    • Select the Regular expression search mode

    • Click on the Replace All button

    => You should get, only, the two lines, below :

    世 4E16
    丟 4E1F
    

    Et voilà !! It just remains the two unique characters of the original list :-))


    Notes :

    • The first part (?-s) is a modifier which implies that any dot will match a single standard character and not EOL characters

    • Then, the ^ symbol is a zero-length assertion, which means beginning of line

    • Now, the part (.+\R) represents a non-empty range of consecutive standard characters, followed by its EOL character(s). As the current complete line is enclosed in parentheses, it’s stored as group 1

    • Finally, the part \1+, is a repeated back-reference to group 1, which looks for any non-empty range of consecutive lines, identical to the first one !

    • As the replacement zone is EMPTY, all these repeated lines ( > 1 ) are simply deleted !

    Best Regards,

    guy038


Log in to reply