Find Duplicate lines by the part of line and keep one of them



  • Hello, i need a RegEx to find lines that match to the part of the text,
    So my file looks like this

    '{“en”:“Text One”,“es”: (any random text)
    '{“en”:“Text Two”,“es”: (any random text)
    '{“en”:“Text Three”,“es”: (any random text)
    '{“en”:“Text One”,“es”: (any random text)
    '{“en”:“Text One”,“es”: (any random text)
    '{“en”:"Text Four,“es”: (any random text)
    '{“en”:“Text Three”,“es”: (any random text)
    So what i want to do, is find text between “en”: and “es”: and remove one of the lines if there any match. so the result will be

    '{“en”:“Text One”,“es”: (any random text)
    '{“en”:“Text Two”,“es”: (any random text)
    '{“en”:“Text Three”,“es”: (any random text)
    '{“en”:"Text Four,“es”: (any random text)

    Thanks



  • Backup your computer :) tick “. matches newline” and “Wrap Around” and try this one:
    ^([^:]+?:[^:]+?:).+?$(?=.+?^\1.+?$)
    and “Replace with” leave empty.

    It will remove first occurrences in text.



  • I have a similar inquiry but I am not as educated in code as many of you.

    I am trying to compare 2 groups of numbers using he compare plug in but it compares the sequence literally. For example:

    Set 1
    SS 00 01 03 14 SS 00 05 12 06 SS 00 07 07 05 SS 00 08 04 05
    SS 00 38 04 04 SS 01 72 03 92 SS 10 16 01 16 SS 00 60 09 15
    SS 00 61 09 15 SS 04 38 09 09 SS 40 93 07 05 SS 41 12 12 07
    SS 41 51 10 09 SS 41 63 06 11 IH 10 01 09 86 SS 05 18 07 92
    Set 2
    SS 00 01 04 14 SS 00 05 12 06 SS 00 07 07 05 SS 00 08 04 05
    SS 00 38 04 04 SS 01 72 03 92 SS 10 16 01 16 SS 00 60 09 15

    SS 00 61 09 16 SS 04 38 09 09 SS 40 93 07 05 SS 41 12 12 07
    SS 41 51 10 09 SS 91 63 06 11 IH 10 01 09 86 SS 05 18 07 92

    Note I changed a couple of numbers in set 2. If I use compare plug in it compare the lines, not the data. So SS 00 61 09 16 would be flagged as a change when it is not. Can anyone tell me how to set this up to actually find repeats in the sequence?



  • @Roman-Artiukhin Nope, this is not working, it shows like “1 occurrence was replaced” but not replacing anything



  • Well it works for me with your sample. See http://g.recordit.co/0woUi0bDIs.gif



  • Can your English text contain “:”? If yes try this one instead: ^(.*?es.\s*:).+?$(?=.*?^\1.+?$)



  • @Roman-Artiukhin Still not working, is it possible to contact you somewhere in private? so i can show you the real data. and do you speak russian ? :)



  • or at least you can contact me so i will contact you back, my email is my login name in forum, just add @list.ru just i am not writing my email publicly to not get spam from bots :)



  • Hello, @tobelyan,

    I think that the shorter regex S/R, to achieve what you want to, is :

    SEARCH (?-s)^.*("en":".+","es":).*\R(?s).*\K(?-s)^.*\1.*\R

    REPLACE Leave EMPTY !

    Remarks : I assume some statements :

    • The search is case sensitive. If NOT, just change the first part (?-s) by the syntax (?i-s)

    • The text, to search for, is preceded by the literal string “en”:"

    • The text, to search for, is followed by the literal string ",“es”:

    • The initial string “en”:" may begin a line

    • The random text, after the string ",“es”:, may be present or not

    Notes :

    • From beginning of text, this regex simply searches, first, for a line, followed by the greatest range of lines, till the last line, containing the same text ( group 1 ), as the first one

    • Due the \K syntax, this search is, then, reset and the final searched regex is this last line, only, which is deleted, due to the empty replacement zone !


    So, let’s start, for instance, with the original text, with a line break, after the last line, below :

    '{"en":"Text Five","es": (Copyright (C)2016)
    '{"en":"Text Two","es": (software; you may)
    '{"en":"Text One","es": (GNU General Public)
    '{"en":"Text One","es":
    '{"en":"Text Three","es": (below. This guarantees)
    '{"en":"Text Two","es": (this software under)
    '{"en":"Text Two","es": (Note that we consider)
    '"en":"Text One","es": (for the purpose of)
    '{"en":"Text Four","es": (Notepad++ into a)
    '{"en":"Text Five","es": (produced by InstallShielf)
    '{"en":"Text Three","es": (This program is distributed)
    '{"en":"Text One","es": (WITHOUT ANY WARRANTY)
    '{"en":"Text Three","es": (MERCHANTABILITY or)
    '{"en":"Text Five","es": (GNU General Public)
    '{"en":"Text One","es": (A copy of the GNU)
    
    • Now, move back to the very beginning of your file ( Ctrl+ Origin )

    • Open the Replace dialog ( Ctrl + H )

    • UNcheck the wrap around option

    • Select, of course, the Regular expression search mode

    • Fill the Find what: and Replace with: boxes, as specified, above

    • Click, SEVERAL times, on the Replace All button, till the message Replace All: 0 occurrences were replaced occurs !

    You should obtain the simplified text, which keeps, in addition, the original order of lines :

    '{"en":"Text Five","es": (Copyright (C)2016)
    '{"en":"Text Two","es": (software; you may)
    '{"en":"Text One","es": (GNU General Public)
    '{"en":"Text Three","es": (below. This guarantees)
    '{"en":"Text Four","es": (Notepad++ into a)
    

    Et voilà !

    Best Regards,

    guy038



  • @guy038 thank you very much, worked perfecly



  • Hello, @bill-davis,

    So, let’s imagine that you have these two original sets of data :

    Set 1
    
    SS 00 01 03 14 SS 00 05 12 06 SS 00 07 07 05 SS 00 08 04 05
    SS 00 38 04 04 SS 01 72 03 92 SS 10 16 01 16 SS 00 60 09 15
    SS 00 61 09 15 SS 04 38 09 09 SS 40 93 07 05 SS 41 12 12 07
    SS 41 51 10 09 SS 41 63 06 11 IH 10 01 09 86 SS 05 18 07 92
    
    Set 2
    
    SS 00 01 04 14 SS 00 05 12 06 SS 00 07 07 06 SS 00 08 04 05
    SS 99 38 04 04 SS 01 72 03 92 SS 10 16 01 16 SS 00 60 09 15
    SS 00 61 09 16 SS 04 38 09 09 SS 40 93 07 05 SS 41 12 12 07
    SS 41 51 10 09 SS 91 63 06 11 IH 10 01 09 86 SS 05 18 07 93
    

    If you are able to join the analogue lines ( the second version, right below the first one, and followed by, at least, one empty line ), as below :

    SS 00 01 03 14 SS 00 05 12 06 SS 00 07 07 05 SS 00 08 04 05
    SS 00 01 04 14 SS 00 05 12 06 SS 00 07 07 06 SS 00 08 04 05
    
    SS 00 38 04 04 SS 01 72 03 92 SS 10 16 01 16 SS 00 60 09 15
    SS 99 38 04 04 SS 01 72 03 92 SS 10 16 01 16 SS 00 60 09 15
    
    SS 00 61 09 15 SS 04 38 09 09 SS 40 93 07 05 SS 41 12 12 07
    SS 00 61 09 16 SS 04 38 09 09 SS 40 93 07 05 SS 41 12 12 07
    
    SS 41 51 10 09 SS 41 63 06 11 IH 10 01 09 86 SS 05 18 07 92
    SS 41 51 10 09 SS 91 63 06 11 IH 10 01 09 86 SS 05 18 07 93
    
    

    ( I, already, thought about the way to get this new arrangement !! )

    Then, the regex (\d\d)(?-s)(?=.*\R.+)(?s)(?!.{59}\1) would match any two-digits number, which is present in the first line and NOT in the following line !

    So, the 3rd and 12th numbers of line 1, the 1st number of line 4, the 4th number of line 7, the 5th and 16th numbers of line 10 would be found or marked with the Search > Mark… dialog

    If your file is an Unix fie, with, only, the \n EOL character, the correct regex is (\d\d)(?-s)(?=.*\R.+)(?s)(?!.{58}\1)

    Notes :

    • The idea is that, with the new organization of the data, any two-digits number is separated from its similar one, on the next line, by, exactly, 59 standard or EOL characters ( or 58, in case of Unix files )

    • So, we’re looking for a two-digits number (\d\d), stored as group 1, but if two conditions are, also, true :

      • After the two-digits number, there is an unique line-break ( \R ), at the end of the current line => The positive look-ahead (?-s)(?=.*\R.+)

      • After the two-digits number AND 59 characters ( standard or EOL ) an other identical two-digits number cannot be found => The negative look-ahead (?s)(?!.{59}\1)

    Best Regards,

    guy038


Log in to reply