Community
    • Login

    Find unique characters / lines

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    2 Posts 2 Posters 7.1k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • カヒノビチアレクセイカ
      カヒノビチアレクセイ
      last edited by

      So, I have a list of ~4000 duplicate characters, 99% of the list are duplicates but there are 5 unique characters that don’t repeat. I need a way to find those 5 characters.
      Example -
      The list looks something like this:
      不
      不
      与
      与
      且
      世
      且
      I need to find that unique line/character (世) and either take it out of there or remove everything besides it,
      so at the end it will look something like this:

      世

      I spend some time searching and I honestly couldn’t find a solution. The closest thing I could find is removing all the duplicates but that leaves a list of ~2000 unique characters which looks something like this:
      不
      与
      世
      且

      So is there a way to do this?

      1 Reply Last reply Reply Quote 1
      • guy038G
        guy038
        last edited by guy038

        Hello, @カヒノビチアレクセイ

        Not very difficult, indeed !

        If you don’t mind about a final sort of your unique CJK characters, here is a way to achieve it, very quickly :-))

        First of all, just backup your original list ( A safe behaviour to adopt, in any case ! )

        Now, let suppose you have the following list of CJK characters. I just added, after a space, the Unicode code-point of each character

        丰 4E30
        不 4E0D
        丆 4E06
        与 4E0E
        不 4E0D
        丰 4E30
        且 4E14
        世 4E16
        中 4E2D
        且 4E14
        与 4E0E
        丰 4E30
        丟 4E1F
        中 4E2D
        与 4E0E
        中 4E2D
        丆 4E06
        丰 4E30
        

        First, perform a classical sort, with the menu option Edit > Line Operations > Sort lines Lexicographically Ascending. We get, immediately, the sorted text, below :

        丆 4E06
        丆 4E06
        不 4E0D
        不 4E0D
        与 4E0E
        与 4E0E
        与 4E0E
        且 4E14
        且 4E14
        世 4E16
        丟 4E1F
        中 4E2D
        中 4E2D
        中 4E2D
        丰 4E30
        丰 4E30
        丰 4E30
        丰 4E30
        

        Now :

        • Move back to the very beginning of your file ( Ctrl + Origin )

        • Open the Replace dialog ( Ctrl + H )

        • In the Find what: zone, paste or type the regex (?-s)^(.+\R)\1+

        • Leave the Replace with: zone EMPTY

        • Select the Regular expression search mode

        • Click on the Replace All button

        => You should get, only, the two lines, below :

        世 4E16
        丟 4E1F
        

        Et voilà !! It just remains the two unique characters of the original list :-))


        Notes :

        • The first part (?-s) is a modifier which implies that any dot will match a single standard character and not EOL characters

        • Then, the ^ symbol is a zero-length assertion, which means beginning of line

        • Now, the part (.+\R) represents a non-empty range of consecutive standard characters, followed by its EOL character(s). As the current complete line is enclosed in parentheses, it’s stored as group 1

        • Finally, the part \1+, is a repeated back-reference to group 1, which looks for any non-empty range of consecutive lines, identical to the first one !

        • As the replacement zone is EMPTY, all these repeated lines ( > 1 ) are simply deleted !

        Best Regards,

        guy038

        1 Reply Last reply Reply Quote 2
        • First post
          Last post
        The Community of users of the Notepad++ text editor.
        Powered by NodeBB | Contributors