• Login
Community
  • Login

Find unique characters / lines

Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
2 Posts 2 Posters 7.2k Views
Loading More Posts
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • カ
    カヒノビチアレクセイ
    last edited by Aug 6, 2017, 1:08 PM

    So, I have a list of ~4000 duplicate characters, 99% of the list are duplicates but there are 5 unique characters that don’t repeat. I need a way to find those 5 characters.
    Example -
    The list looks something like this:
    不
    不
    与
    与
    且
    世
    且
    I need to find that unique line/character (世) and either take it out of there or remove everything besides it,
    so at the end it will look something like this:

    世

    I spend some time searching and I honestly couldn’t find a solution. The closest thing I could find is removing all the duplicates but that leaves a list of ~2000 unique characters which looks something like this:
    不
    与
    世
    且

    So is there a way to do this?

    1 Reply Last reply Reply Quote 1
    • G
      guy038
      last edited by guy038 Aug 7, 2017, 5:17 PM Aug 7, 2017, 5:15 PM

      Hello, @カヒノビチアレクセイ

      Not very difficult, indeed !

      If you don’t mind about a final sort of your unique CJK characters, here is a way to achieve it, very quickly :-))

      First of all, just backup your original list ( A safe behaviour to adopt, in any case ! )

      Now, let suppose you have the following list of CJK characters. I just added, after a space, the Unicode code-point of each character

      丰 4E30
      不 4E0D
      丆 4E06
      与 4E0E
      不 4E0D
      丰 4E30
      且 4E14
      世 4E16
      中 4E2D
      且 4E14
      与 4E0E
      丰 4E30
      丟 4E1F
      中 4E2D
      与 4E0E
      中 4E2D
      丆 4E06
      丰 4E30
      

      First, perform a classical sort, with the menu option Edit > Line Operations > Sort lines Lexicographically Ascending. We get, immediately, the sorted text, below :

      丆 4E06
      丆 4E06
      不 4E0D
      不 4E0D
      与 4E0E
      与 4E0E
      与 4E0E
      且 4E14
      且 4E14
      世 4E16
      丟 4E1F
      中 4E2D
      中 4E2D
      中 4E2D
      丰 4E30
      丰 4E30
      丰 4E30
      丰 4E30
      

      Now :

      • Move back to the very beginning of your file ( Ctrl + Origin )

      • Open the Replace dialog ( Ctrl + H )

      • In the Find what: zone, paste or type the regex (?-s)^(.+\R)\1+

      • Leave the Replace with: zone EMPTY

      • Select the Regular expression search mode

      • Click on the Replace All button

      => You should get, only, the two lines, below :

      世 4E16
      丟 4E1F
      

      Et voilà !! It just remains the two unique characters of the original list :-))


      Notes :

      • The first part (?-s) is a modifier which implies that any dot will match a single standard character and not EOL characters

      • Then, the ^ symbol is a zero-length assertion, which means beginning of line

      • Now, the part (.+\R) represents a non-empty range of consecutive standard characters, followed by its EOL character(s). As the current complete line is enclosed in parentheses, it’s stored as group 1

      • Finally, the part \1+, is a repeated back-reference to group 1, which looks for any non-empty range of consecutive lines, identical to the first one !

      • As the replacement zone is EMPTY, all these repeated lines ( > 1 ) are simply deleted !

      Best Regards,

      guy038

      1 Reply Last reply Reply Quote 2
      2 out of 2
      • First post
        2/2
        Last post
      The Community of users of the Notepad++ text editor.
      Powered by NodeBB | Contributors