Community
    • Login

    Collect duplicates in doc before sign

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    2 Posts 2 Posters 225 Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • RichieR
      Richie
      last edited by

      Good day. Can you help me to solve an issue? I need to collect all dublicates before sign #.
      This is part of my doc. Thank you guys for any help I appreciate it

      Umbrella#17:30
      T-shirt#11:43
      T-shirt#12:04 Polo(M)
      T-shirt#14:32
      T-shirt#15:27
      Cap#12:47
      Jeans#10:43 LEVIS
      Jeans#12:42 Coll
      Jeans#15:27
      Gloves#14:41 Kids
      Coat#11:23 YD523(M)
      Coat#12:54 YD523(L)
      Jacket#14:41

      This is what I need

      T-shirt#11:43
      T-shirt#12:04 Polo(M)
      T-shirt#14:32
      T-shirt#15:27
      Jeans#10:43 LEVIS
      Jeans#12:42 Coll
      Jeans#15:27
      Coat#11:23 YD523(M)
      Coat#12:54 YD523(L)

      Mark OlsonM 1 Reply Last reply Reply Quote 0
      • Mark OlsonM
        Mark Olson @Richie
        last edited by Mark Olson

        @Richie
        How many lines of data do you have? The optimal solution for this problem can vary a lot depending on how much data you have.

        Also, are all the entries with duplicate values before the # consecutive? For example, are all T-shirt entries grouped together, or are there non-T-Shirt entries between the first T-shirt entry and the last?

        Assuming the entries with duplicate values before # are grouped together, a reasonable solution (even for a large amount of data) would be as follows:

        1. Go to the find/replace form, Mark tab (Ctrl+M with default hotkeys)
        2. With Bookmark line selected and using Regular expression as search mode, enter (?-s)(^[^#\r\n]+)#.*\R(\1#.*$\R?)+ into the Find what box, and hit Mark all.
          • Notes on this regular expression (general resources available here):
          • (?-s) means that the . metacharacter won’t match newlines.
          • (^[^#\r\n]+)# tries to match a sequence of characters that are not # or a line-end character at the start of a line followed by #, and then stores it as capture group 1.
          • .*\R matches any number of non-newline characters, then a newline.
          • (\1#.*$\R?)+ matches at least one line where the thing before # in the first line is followed immediately by # and then any number of characters before the end of the line ($) and then possibly a newline (\R? - the only thing there could be other than a newline here is the end of the file)
        3. You will see all the lines with a duplicate entry before # marked.
        4. Use Search->Bookmark->Remove Non-Bookmarked lines from the main menu.
        5. Now all the entries with no duplicate before the # will remain.

        Note that the Remove Non-Bookmarked lines operation can be rather slow if the number of lines to be removed is large (say, 10 thousand or more). If that’s a problem, you can use Copy Marked Text (the third button under the Mark all and Clear all marks buttons in the Mark form) instead to copy the text that you marked in step 2, then paste it into another buffer (or select the entire original file and paste over it). You’d have to do some simple regex-based postprocessing of the result after that, but I’ll leave that as an exercise for you.

        If the entries with duplicate values are not grouped together, you will need to sort the entries by the value before the #, and then follow the steps above. This sorting is difficult to achieve without a custom script, but fortunately I and fellow forum regular AlanKilborn have already written such a script.

        1 Reply Last reply Reply Quote 4
        • First post
          Last post
        The Community of users of the Notepad++ text editor.
        Powered by NodeBB | Contributors