Community
    • Login

    Remove duplicates if only part of the string matches

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    13 Posts 5 Posters 921 Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Viktor SerdiukV
      Viktor Serdiuk
      last edited by

      Thank you so much!

      1 Reply Last reply Reply Quote 0
      • Viktor SerdiukV
        Viktor Serdiuk
        last edited by Viktor Serdiuk

        The only thing is that if the right part of the row is missing, it is not checked.
        part of row #1:part of row #20
        part of row #1:
        part of row #3:part of row #41.
        part of row #3:
        part of row #3:part of row #43
        part of row #5:part of row #60

        I get the result :
        part of row #1:part of row #20
        part of row #1:
        part of row #3:
        part of row #3:part of row #41.
        part of row #5:part of row #60

        1 Reply Last reply Reply Quote 0
        • guy038G
          guy038
          last edited by guy038

          Hi, @viktor-serdiuk, @alan-kilborn and All,

          Ah, yes…, we forgot this case ! No problem, just two modifications to allow the part after the colon to be empty :

          SEARCH (?-s)^((.+?):.*\R)(?s:.*)^\2.*\R?

          REPLACE ${1}

          Thus, given your INPUT text :

          part of row #1:part of row #20
          part of row #1:
          part of row #3:part of row #41.
          part of row #3:
          part of row #3:part of row #43
          part of row #5:part of row #60
          

          You should get the following OUTPUT text :

          part of row #1:part of row #20
          part of row #3:part of row #41.
          part of row #5:part of row #60
          
          

          Notes :

          • When using :.+\R, the regex expects, at least, one standard character, between : and the line-break \R, as the + quantifier is a shortcut for the {1,} quantifier

          • When using :.*\R, the regex allows the part, between : and the line-break \R, to be empty ( so to not exist ), as the * quantifier is a shortcut for the {0,} quantifier

          • And same reason for the part between \2 and \R

          • However, note that the zone, between the beginning of line ^ and the :, must contain, at least, one standard char because of the + and, also, without an other colon because of the additional ? char, giving the (.+?) sub-regex, re-used later, as group 2 with the \2 syntax

          Best Regards,

          guy038

          Alan KilbornA 1 Reply Last reply Reply Quote 0
          • Alan KilbornA
            Alan Kilborn @guy038
            last edited by

            @guy038 said in Remove duplicates if only part of the string matches:

            we forgot this case

            No, this case was not cited by the OP originally.
            We aren’t mind readers here.

            1 Reply Last reply Reply Quote 1
            • Viktor SerdiukV
              Viktor Serdiuk
              last edited by

              @Alan-Kilborn said in Remove duplicates if only part of the string matches:

              No, this case was not cited by the OP originally.
              We aren’t mind readers here.

              Agree!
              THANK YOU again!

              1 Reply Last reply Reply Quote 1
              • Alan KilbornA
                Alan Kilborn @guy038
                last edited by

                @guy038 said :

                By putting the ?s modifier inside its own group, this means that, by default, the whole regex considers the ?-s modifier. Thus, no need to repeat the ?-s, near the end

                Minor quibble about this: For the given data, I think doing it this way might make it more confusing – the syntax requires an extra :, which could be confused with the literal : used in the problem statement, and earlier in the expression.

                1 Reply Last reply Reply Quote 0
                • Viktor SerdiukV
                  Viktor Serdiuk
                  last edited by

                  @guy038 said in Remove duplicates if only part of the string matches:

                  Hello, @viktor-serdiuk, @alan-kilborn and All,

                  Alan, just a small variant of your search regex :

                  SEARCH (?-s)^((.+?):.+\R)(?s:.*)^\2.+\R?

                  REPLACE ${1}


                  Notes :

                  • By putting the ?s modifier inside its own group, this means that, by default, the whole regex considers the ?-s modifier. Thus, no need to repeat the ?-s, near the end

                  • By adding the question mark at the end of the regex, it covers the case of a last line of current file without any line-break !


                  BTW, it would be nice that the two N++ options Remove Duplicate Lines and Remove Consecutive Duplicate Lines would ask us about the contiguous zone to consider when removing duplicates :-) For example, from column 5 to 20 or from column 30 to end of line and, of course, the entire line, by default, if nothing is typed !

                  Best Regards,

                  guy038

                  It doesn’t work if it starts with numbers, for example :

                  1part of row #1:part of row #20
                  1part of row #1:part of row #21
                  3part of row #3:part of row #41
                  3part of row #3:part of row #42
                  3part of row #3:part of row #43
                  1part of row #1:part of row #60

                  PeterJonesP 1 Reply Last reply Reply Quote 0
                  • PeterJonesP
                    PeterJones @Viktor Serdiuk
                    last edited by

                    @Viktor-Serdiuk said in Remove duplicates if only part of the string matches:

                    It doesn’t work if it starts with numbers, for example :

                    It’s not because it starts with a number. It’s because it’s non-contiguous. The sixth line cannot be merged with the rest of the first group because there are other lines between. You could see this yourself by seeing that

                    1part of row #1:part of row #20
                    1part of row #1:part of row #21
                    1part of row #1:part of row #60
                    3part of row #3:part of row #41
                    3part of row #3:part of row #42
                    3part of row #3:part of row #43
                    

                    does work.

                    None of your examples showed that you wanted to be able to split and have the prefixes out of order, so Guy didn’t develop the regex to be able to handle that edge case. Unfortunately, I cannot think of an easy way to change his regex to meet your new requirements. Hopefully for you, he’ll have an idea when he comes back.

                    While waiting, I suggest you avail yourself of the following advice, which someone should have pointed you to previously:

                    • Please Read Before Posting
                    • Template for Search/Replace Questions
                    • Formatting Forum Posts
                    • Notepad++ Online User Manual: Searching/Regex
                    • FAQ: Where to find other regular expressions (regex) documentation
                    1 Reply Last reply Reply Quote 0
                    • Jim DaileyJ
                      Jim Dailey
                      last edited by

                      @Viktor-Serdiuk: I would like to add to @PeterJones suggestions that you also consider some scripting language to perform tasks like this.

                      While @guy038 and others in this forum are able to accomplish some amazing feats using regular expressions in Notepad++, I think many times the tasks could be accomplished easier using Python, Perl, or my scripting language of choice, GAWK.

                      If I understand your desire correctly, I believe this GAWK script would do the trick. The code in the END{} is not needed, but shows how easily you can display statistics about the processed data:

                      {
                          split($0, /:/, Parts)       # Parts[1] <- text before the ":"
                          if (Parts[1] in Prefixes) { # If we've seen this prefix ...
                              next                    # ... skip this line.
                          }
                          print Parts[1]              # Print the prefix
                          Prefixes[Parts[1]]++        # Add it to the Prefixes[] array
                      }
                      END {
                          for (p in Prefixes) {       # Print # of times we saw each
                              printf("Prefix %s appeared %n times.\n", p, Prefixes[p]
                          }
                      }
                      

                      Not to start a language war or get too far off topic, but …

                      I prefer AWK (GNU’s version being GAWK) to newer scripting languages due to its smaller installation footprint and its relative simplicity. Admittedly, its simplicity causes traditional AWK scripts to look quite differently than ones written in most other languages (you don’t see any code above that reads the input file because AWK does it for you), but once one understands how AWK reads the input on his/her behalf, writing simple scripts becomes extremely easy.

                      Jim DaileyJ 1 Reply Last reply Reply Quote 1
                      • Jim DaileyJ
                        Jim Dailey @Jim Dailey
                        last edited by

                        I made some silly mistakes (several syntax errors) in the AWK script above. Also, the additional code in the END block won’t print the total number of times each prefix appeared as I intended. This script does, however:

                        {
                            split($0, Parts, /:/)          # Parts[1] <- text before the ":".
                            if (!(Parts[1] in Prefixes)) { # If we've NOT seen this prefix ...
                                print Parts[1]             # ... print it.
                            }
                            Prefixes[Parts[1]]++           # Count this prefix.
                        }
                        END {
                            for (p in Prefixes) {          # Print # of times we saw each one.
                                printf("Prefix '%s' appeared %d times.\n", p, Prefixes[p])
                            }
                        }
                        
                        
                        1 Reply Last reply Reply Quote 1
                        • First post
                          Last post
                        The Community of users of the Notepad++ text editor.
                        Powered by NodeBB | Contributors