Community
    • Login

    Find Duplicate lines by the part of line and keep one of them

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    13 Posts 6 Posters 6.5k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Roman ArtiukhinR
      Roman Artiukhin
      last edited by

      Backup your computer :) tick “. matches newline” and “Wrap Around” and try this one:
      ^([^:]+?:[^:]+?:).+?$(?=.+?^\1.+?$)
      and “Replace with” leave empty.

      It will remove first occurrences in text.

      tobelyanT 1 Reply Last reply Reply Quote 0
      • Bill DavisB
        Bill Davis
        last edited by

        I have a similar inquiry but I am not as educated in code as many of you.

        I am trying to compare 2 groups of numbers using he compare plug in but it compares the sequence literally. For example:

        Set 1
        SS 00 01 03 14 SS 00 05 12 06 SS 00 07 07 05 SS 00 08 04 05
        SS 00 38 04 04 SS 01 72 03 92 SS 10 16 01 16 SS 00 60 09 15
        SS 00 61 09 15 SS 04 38 09 09 SS 40 93 07 05 SS 41 12 12 07
        SS 41 51 10 09 SS 41 63 06 11 IH 10 01 09 86 SS 05 18 07 92
        Set 2
        SS 00 01 04 14 SS 00 05 12 06 SS 00 07 07 05 SS 00 08 04 05
        SS 00 38 04 04 SS 01 72 03 92 SS 10 16 01 16 SS 00 60 09 15

        SS 00 61 09 16 SS 04 38 09 09 SS 40 93 07 05 SS 41 12 12 07
        SS 41 51 10 09 SS 91 63 06 11 IH 10 01 09 86 SS 05 18 07 92

        Note I changed a couple of numbers in set 2. If I use compare plug in it compare the lines, not the data. So SS 00 61 09 16 would be flagged as a change when it is not. Can anyone tell me how to set this up to actually find repeats in the sequence?

        1 Reply Last reply Reply Quote 0
        • tobelyanT
          tobelyan @Roman Artiukhin
          last edited by

          @Roman-Artiukhin Nope, this is not working, it shows like “1 occurrence was replaced” but not replacing anything

          1 Reply Last reply Reply Quote 0
          • Roman ArtiukhinR
            Roman Artiukhin
            last edited by

            Well it works for me with your sample. See http://g.recordit.co/0woUi0bDIs.gif

            1 Reply Last reply Reply Quote 0
            • Roman ArtiukhinR
              Roman Artiukhin
              last edited by Roman Artiukhin

              Can your English text contain “:”? If yes try this one instead: ^(.*?es.\s*:).+?$(?=.*?^\1.+?$)

              tobelyanT 1 Reply Last reply Reply Quote 0
              • tobelyanT
                tobelyan @Roman Artiukhin
                last edited by

                @Roman-Artiukhin Still not working, is it possible to contact you somewhere in private? so i can show you the real data. and do you speak russian ? :)

                1 Reply Last reply Reply Quote 0
                • tobelyanT
                  tobelyan
                  last edited by

                  or at least you can contact me so i will contact you back, my email is my login name in forum, just add @list.ru just i am not writing my email publicly to not get spam from bots :)

                  1 Reply Last reply Reply Quote 0
                  • guy038G
                    guy038
                    last edited by

                    Hello, @tobelyan,

                    I think that the shorter regex S/R, to achieve what you want to, is :

                    SEARCH (?-s)^.*("en":".+","es":).*\R(?s).*\K(?-s)^.*\1.*\R

                    REPLACE Leave EMPTY !

                    Remarks : I assume some statements :

                    • The search is case sensitive. If NOT, just change the first part (?-s) by the syntax (?i-s)

                    • The text, to search for, is preceded by the literal string “en”:"

                    • The text, to search for, is followed by the literal string ",“es”:

                    • The initial string “en”:" may begin a line

                    • The random text, after the string ",“es”:, may be present or not

                    Notes :

                    • From beginning of text, this regex simply searches, first, for a line, followed by the greatest range of lines, till the last line, containing the same text ( group 1 ), as the first one

                    • Due the \K syntax, this search is, then, reset and the final searched regex is this last line, only, which is deleted, due to the empty replacement zone !


                    So, let’s start, for instance, with the original text, with a line break, after the last line, below :

                    '{"en":"Text Five","es": (Copyright (C)2016)
                    '{"en":"Text Two","es": (software; you may)
                    '{"en":"Text One","es": (GNU General Public)
                    '{"en":"Text One","es":
                    '{"en":"Text Three","es": (below. This guarantees)
                    '{"en":"Text Two","es": (this software under)
                    '{"en":"Text Two","es": (Note that we consider)
                    '"en":"Text One","es": (for the purpose of)
                    '{"en":"Text Four","es": (Notepad++ into a)
                    '{"en":"Text Five","es": (produced by InstallShielf)
                    '{"en":"Text Three","es": (This program is distributed)
                    '{"en":"Text One","es": (WITHOUT ANY WARRANTY)
                    '{"en":"Text Three","es": (MERCHANTABILITY or)
                    '{"en":"Text Five","es": (GNU General Public)
                    '{"en":"Text One","es": (A copy of the GNU)
                    
                    • Now, move back to the very beginning of your file ( Ctrl+ Origin )

                    • Open the Replace dialog ( Ctrl + H )

                    • UNcheck the wrap around option

                    • Select, of course, the Regular expression search mode

                    • Fill the Find what: and Replace with: boxes, as specified, above

                    • Click, SEVERAL times, on the Replace All button, till the message Replace All: 0 occurrences were replaced occurs !

                    You should obtain the simplified text, which keeps, in addition, the original order of lines :

                    '{"en":"Text Five","es": (Copyright (C)2016)
                    '{"en":"Text Two","es": (software; you may)
                    '{"en":"Text One","es": (GNU General Public)
                    '{"en":"Text Three","es": (below. This guarantees)
                    '{"en":"Text Four","es": (Notepad++ into a)
                    

                    Et voilà !

                    Best Regards,

                    guy038

                    tobelyanT Kosmos HuynhK 2 Replies Last reply Reply Quote 1
                    • tobelyanT
                      tobelyan @guy038
                      last edited by

                      @guy038 thank you very much, worked perfecly

                      1 Reply Last reply Reply Quote 0
                      • guy038G
                        guy038
                        last edited by guy038

                        Hello, @bill-davis,

                        So, let’s imagine that you have these two original sets of data :

                        Set 1
                        
                        SS 00 01 03 14 SS 00 05 12 06 SS 00 07 07 05 SS 00 08 04 05
                        SS 00 38 04 04 SS 01 72 03 92 SS 10 16 01 16 SS 00 60 09 15
                        SS 00 61 09 15 SS 04 38 09 09 SS 40 93 07 05 SS 41 12 12 07
                        SS 41 51 10 09 SS 41 63 06 11 IH 10 01 09 86 SS 05 18 07 92
                        
                        Set 2
                        
                        SS 00 01 04 14 SS 00 05 12 06 SS 00 07 07 06 SS 00 08 04 05
                        SS 99 38 04 04 SS 01 72 03 92 SS 10 16 01 16 SS 00 60 09 15
                        SS 00 61 09 16 SS 04 38 09 09 SS 40 93 07 05 SS 41 12 12 07
                        SS 41 51 10 09 SS 91 63 06 11 IH 10 01 09 86 SS 05 18 07 93
                        

                        If you are able to join the analogue lines ( the second version, right below the first one, and followed by, at least, one empty line ), as below :

                        SS 00 01 03 14 SS 00 05 12 06 SS 00 07 07 05 SS 00 08 04 05
                        SS 00 01 04 14 SS 00 05 12 06 SS 00 07 07 06 SS 00 08 04 05
                        
                        SS 00 38 04 04 SS 01 72 03 92 SS 10 16 01 16 SS 00 60 09 15
                        SS 99 38 04 04 SS 01 72 03 92 SS 10 16 01 16 SS 00 60 09 15
                        
                        SS 00 61 09 15 SS 04 38 09 09 SS 40 93 07 05 SS 41 12 12 07
                        SS 00 61 09 16 SS 04 38 09 09 SS 40 93 07 05 SS 41 12 12 07
                        
                        SS 41 51 10 09 SS 41 63 06 11 IH 10 01 09 86 SS 05 18 07 92
                        SS 41 51 10 09 SS 91 63 06 11 IH 10 01 09 86 SS 05 18 07 93
                        
                        

                        ( I, already, thought about the way to get this new arrangement !! )

                        Then, the regex (\d\d)(?-s)(?=.*\R.+)(?s)(?!.{59}\1) would match any two-digits number, which is present in the first line and NOT in the following line !

                        So, the 3rd and 12th numbers of line 1, the 1st number of line 4, the 4th number of line 7, the 5th and 16th numbers of line 10 would be found or marked with the Search > Mark… dialog

                        If your file is an Unix fie, with, only, the \n EOL character, the correct regex is (\d\d)(?-s)(?=.*\R.+)(?s)(?!.{58}\1)

                        Notes :

                        • The idea is that, with the new organization of the data, any two-digits number is separated from its similar one, on the next line, by, exactly, 59 standard or EOL characters ( or 58, in case of Unix files )

                        • So, we’re looking for a two-digits number (\d\d), stored as group 1, but if two conditions are, also, true :

                          • After the two-digits number, there is an unique line-break ( \R ), at the end of the current line => The positive look-ahead (?-s)(?=.*\R.+)

                          • After the two-digits number AND 59 characters ( standard or EOL ) an other identical two-digits number cannot be found => The negative look-ahead (?s)(?!.{59}\1)

                        Best Regards,

                        guy038

                        1 Reply Last reply Reply Quote 0
                        • Kosmos HuynhK
                          Kosmos Huynh @guy038
                          last edited by

                          @guy038 and all

                          I am happy to see this topic but it does not work in my case. Could you please give me a favor?

                          My example as the followings:
                          Chương 335: Nghiêm trọng
                          Chương 385: Nghiêm trọng
                          Ma Thần nhạc viên Chương 348: Nghiêm trọng

                          I wanted to delete the last two lines. Then, I applied your instruction with “Chương” as key string but it did not work.
                          (?-s)^.(Chương ).\R(?s).\K(?-s)^.\1.*\R

                          Many thanks in advance!

                          1 Reply Last reply Reply Quote 0
                          • Terry RT
                            Terry R
                            last edited by

                            @Kosmos-Huynh said in Find Duplicate lines by the part of line and keep one of them:

                            but it does not work in my case

                            I’m not surprised the regex you showed didn’t work. The example data is just a bit too different and unless you know what each part of the regex does you could well find it doing more damage to your data than good.

                            As this conversation is 3 years old and your request is different enough can you start a new post? By all means reference back to this if you want but in reality it needs dealing with as a separate conversation.

                            Also when including sample data please use the </> button (which you will see above the window where you type) around the examples, this prevents any characters typed from being altered by the interpreter in which you type. Please include more examples as what you have is insufficient to help describe the reason why 2 lines should be deleted when all 3 have the same word but all 3 have different numbers following. Describe more fully the requirement a line must have before it can be described as a “duplicate” and therefore deleted along with the line before it.

                            Terry

                            1 Reply Last reply Reply Quote 2
                            • First post
                              Last post
                            The Community of users of the Notepad++ text editor.
                            Powered by NodeBB | Contributors