Community
    • Login

    Remove duplicates if only part of the string matches

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    13 Posts 5 Posters 921 Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Viktor SerdiukV
      Viktor Serdiuk
      last edited by Viktor Serdiuk

      There is a list of rows with duplicates to the left of the colon

      part of row #1:part of row #20
      part of row #1:part of row #21
      part of row #3:part of row #41
      part of row #3:part of row #42
      part of row #3:part of row #43
      part of row #5:part of row #60

      It is necessary to remove duplicates that are checked before the colon so that it remains like this:
      Only the left part of the line (#1 #3 #5) should be checked.
      Result :
      part of row #1:part of row #20
      part of row #3:part of row #41
      part of row #5:part of row #60

      TextFX - deletes only if the whole string matches
      Thanks in advance!

      1 Reply Last reply Reply Quote 0
      • Alan KilbornA
        Alan Kilborn
        last edited by

        Perhaps something like this could work as a starting point:

        Find: (?-s)^((.+?):.+\R)(?s).*^\2(?-s).+\R
        Replace: ${1}
        Search mode: Regular expression
        Options: Wrap around
        Action: Replace all

        1 Reply Last reply Reply Quote 0
        • guy038G
          guy038
          last edited by guy038

          Hello, @viktor-serdiuk, @alan-kilborn and All,

          Alan, just a small variant of your search regex :

          SEARCH (?-s)^((.+?):.+\R)(?s:.*)^\2.+\R?

          REPLACE ${1}


          Notes :

          • By putting the ?s modifier inside its own group, this means that, by default, the whole regex considers the ?-s modifier. Thus, no need to repeat the ?-s, near the end

          • By adding the question mark at the end of the regex, it covers the case of a last line of current file without any line-break !


          BTW, it would be nice that the two N++ options Remove Duplicate Lines and Remove Consecutive Duplicate Lines would ask us about the contiguous zone to consider when removing duplicates :-) For example, from column 5 to 20 or from column 30 to end of line and, of course, the entire line, by default, if nothing is typed !

          Best Regards,

          guy038

          Alan KilbornA 1 Reply Last reply Reply Quote 1
          • Viktor SerdiukV
            Viktor Serdiuk
            last edited by

            Thank you so much!

            1 Reply Last reply Reply Quote 0
            • Viktor SerdiukV
              Viktor Serdiuk
              last edited by Viktor Serdiuk

              The only thing is that if the right part of the row is missing, it is not checked.
              part of row #1:part of row #20
              part of row #1:
              part of row #3:part of row #41.
              part of row #3:
              part of row #3:part of row #43
              part of row #5:part of row #60

              I get the result :
              part of row #1:part of row #20
              part of row #1:
              part of row #3:
              part of row #3:part of row #41.
              part of row #5:part of row #60

              1 Reply Last reply Reply Quote 0
              • guy038G
                guy038
                last edited by guy038

                Hi, @viktor-serdiuk, @alan-kilborn and All,

                Ah, yes…, we forgot this case ! No problem, just two modifications to allow the part after the colon to be empty :

                SEARCH (?-s)^((.+?):.*\R)(?s:.*)^\2.*\R?

                REPLACE ${1}

                Thus, given your INPUT text :

                part of row #1:part of row #20
                part of row #1:
                part of row #3:part of row #41.
                part of row #3:
                part of row #3:part of row #43
                part of row #5:part of row #60
                

                You should get the following OUTPUT text :

                part of row #1:part of row #20
                part of row #3:part of row #41.
                part of row #5:part of row #60
                
                

                Notes :

                • When using :.+\R, the regex expects, at least, one standard character, between : and the line-break \R, as the + quantifier is a shortcut for the {1,} quantifier

                • When using :.*\R, the regex allows the part, between : and the line-break \R, to be empty ( so to not exist ), as the * quantifier is a shortcut for the {0,} quantifier

                • And same reason for the part between \2 and \R

                • However, note that the zone, between the beginning of line ^ and the :, must contain, at least, one standard char because of the + and, also, without an other colon because of the additional ? char, giving the (.+?) sub-regex, re-used later, as group 2 with the \2 syntax

                Best Regards,

                guy038

                Alan KilbornA 1 Reply Last reply Reply Quote 0
                • Alan KilbornA
                  Alan Kilborn @guy038
                  last edited by

                  @guy038 said in Remove duplicates if only part of the string matches:

                  we forgot this case

                  No, this case was not cited by the OP originally.
                  We aren’t mind readers here.

                  1 Reply Last reply Reply Quote 1
                  • Viktor SerdiukV
                    Viktor Serdiuk
                    last edited by

                    @Alan-Kilborn said in Remove duplicates if only part of the string matches:

                    No, this case was not cited by the OP originally.
                    We aren’t mind readers here.

                    Agree!
                    THANK YOU again!

                    1 Reply Last reply Reply Quote 1
                    • Alan KilbornA
                      Alan Kilborn @guy038
                      last edited by

                      @guy038 said :

                      By putting the ?s modifier inside its own group, this means that, by default, the whole regex considers the ?-s modifier. Thus, no need to repeat the ?-s, near the end

                      Minor quibble about this: For the given data, I think doing it this way might make it more confusing – the syntax requires an extra :, which could be confused with the literal : used in the problem statement, and earlier in the expression.

                      1 Reply Last reply Reply Quote 0
                      • Viktor SerdiukV
                        Viktor Serdiuk
                        last edited by

                        @guy038 said in Remove duplicates if only part of the string matches:

                        Hello, @viktor-serdiuk, @alan-kilborn and All,

                        Alan, just a small variant of your search regex :

                        SEARCH (?-s)^((.+?):.+\R)(?s:.*)^\2.+\R?

                        REPLACE ${1}


                        Notes :

                        • By putting the ?s modifier inside its own group, this means that, by default, the whole regex considers the ?-s modifier. Thus, no need to repeat the ?-s, near the end

                        • By adding the question mark at the end of the regex, it covers the case of a last line of current file without any line-break !


                        BTW, it would be nice that the two N++ options Remove Duplicate Lines and Remove Consecutive Duplicate Lines would ask us about the contiguous zone to consider when removing duplicates :-) For example, from column 5 to 20 or from column 30 to end of line and, of course, the entire line, by default, if nothing is typed !

                        Best Regards,

                        guy038

                        It doesn’t work if it starts with numbers, for example :

                        1part of row #1:part of row #20
                        1part of row #1:part of row #21
                        3part of row #3:part of row #41
                        3part of row #3:part of row #42
                        3part of row #3:part of row #43
                        1part of row #1:part of row #60

                        PeterJonesP 1 Reply Last reply Reply Quote 0
                        • PeterJonesP
                          PeterJones @Viktor Serdiuk
                          last edited by

                          @Viktor-Serdiuk said in Remove duplicates if only part of the string matches:

                          It doesn’t work if it starts with numbers, for example :

                          It’s not because it starts with a number. It’s because it’s non-contiguous. The sixth line cannot be merged with the rest of the first group because there are other lines between. You could see this yourself by seeing that

                          1part of row #1:part of row #20
                          1part of row #1:part of row #21
                          1part of row #1:part of row #60
                          3part of row #3:part of row #41
                          3part of row #3:part of row #42
                          3part of row #3:part of row #43
                          

                          does work.

                          None of your examples showed that you wanted to be able to split and have the prefixes out of order, so Guy didn’t develop the regex to be able to handle that edge case. Unfortunately, I cannot think of an easy way to change his regex to meet your new requirements. Hopefully for you, he’ll have an idea when he comes back.

                          While waiting, I suggest you avail yourself of the following advice, which someone should have pointed you to previously:

                          • Please Read Before Posting
                          • Template for Search/Replace Questions
                          • Formatting Forum Posts
                          • Notepad++ Online User Manual: Searching/Regex
                          • FAQ: Where to find other regular expressions (regex) documentation
                          1 Reply Last reply Reply Quote 0
                          • Jim DaileyJ
                            Jim Dailey
                            last edited by

                            @Viktor-Serdiuk: I would like to add to @PeterJones suggestions that you also consider some scripting language to perform tasks like this.

                            While @guy038 and others in this forum are able to accomplish some amazing feats using regular expressions in Notepad++, I think many times the tasks could be accomplished easier using Python, Perl, or my scripting language of choice, GAWK.

                            If I understand your desire correctly, I believe this GAWK script would do the trick. The code in the END{} is not needed, but shows how easily you can display statistics about the processed data:

                            {
                                split($0, /:/, Parts)       # Parts[1] <- text before the ":"
                                if (Parts[1] in Prefixes) { # If we've seen this prefix ...
                                    next                    # ... skip this line.
                                }
                                print Parts[1]              # Print the prefix
                                Prefixes[Parts[1]]++        # Add it to the Prefixes[] array
                            }
                            END {
                                for (p in Prefixes) {       # Print # of times we saw each
                                    printf("Prefix %s appeared %n times.\n", p, Prefixes[p]
                                }
                            }
                            

                            Not to start a language war or get too far off topic, but …

                            I prefer AWK (GNU’s version being GAWK) to newer scripting languages due to its smaller installation footprint and its relative simplicity. Admittedly, its simplicity causes traditional AWK scripts to look quite differently than ones written in most other languages (you don’t see any code above that reads the input file because AWK does it for you), but once one understands how AWK reads the input on his/her behalf, writing simple scripts becomes extremely easy.

                            Jim DaileyJ 1 Reply Last reply Reply Quote 1
                            • Jim DaileyJ
                              Jim Dailey @Jim Dailey
                              last edited by

                              I made some silly mistakes (several syntax errors) in the AWK script above. Also, the additional code in the END block won’t print the total number of times each prefix appeared as I intended. This script does, however:

                              {
                                  split($0, Parts, /:/)          # Parts[1] <- text before the ":".
                                  if (!(Parts[1] in Prefixes)) { # If we've NOT seen this prefix ...
                                      print Parts[1]             # ... print it.
                                  }
                                  Prefixes[Parts[1]]++           # Count this prefix.
                              }
                              END {
                                  for (p in Prefixes) {          # Print # of times we saw each one.
                                      printf("Prefix '%s' appeared %d times.\n", p, Prefixes[p])
                                  }
                              }
                              
                              
                              1 Reply Last reply Reply Quote 1
                              • First post
                                Last post
                              The Community of users of the Notepad++ text editor.
                              Powered by NodeBB | Contributors