Find Duplicate lines by the part of line and keep one of them

tobelyan

Hello, i need a RegEx to find lines that match to the part of the text,
So my file looks like this

'{“en”:“Text One”,“es”: (any random text)
'{“en”:“Text Two”,“es”: (any random text)
'{“en”:“Text Three”,“es”: (any random text)
'{“en”:“Text One”,“es”: (any random text)
'{“en”:“Text One”,“es”: (any random text)
'{“en”:"Text Four,“es”: (any random text)
'{“en”:“Text Three”,“es”: (any random text)
So what i want to do, is find text between “en”: and “es”: and remove one of the lines if there any match. so the result will be

'{“en”:“Text One”,“es”: (any random text)
'{“en”:“Text Two”,“es”: (any random text)
'{“en”:“Text Three”,“es”: (any random text)
'{“en”:"Text Four,“es”: (any random text)

Thanks

Roman Artiukhin

Backup your computer :) tick “. matches newline” and “Wrap Around” and try this one:
^([^:]+?:[^:]+?:).+?$(?=.+?^\1.+?$)
and “Replace with” leave empty.

It will remove first occurrences in text.

Bill Davis

I have a similar inquiry but I am not as educated in code as many of you.

I am trying to compare 2 groups of numbers using he compare plug in but it compares the sequence literally. For example:

Set 1
SS 00 01 03 14 SS 00 05 12 06 SS 00 07 07 05 SS 00 08 04 05
SS 00 38 04 04 SS 01 72 03 92 SS 10 16 01 16 SS 00 60 09 15
SS 00 61 09 15 SS 04 38 09 09 SS 40 93 07 05 SS 41 12 12 07
SS 41 51 10 09 SS 41 63 06 11 IH 10 01 09 86 SS 05 18 07 92
Set 2
SS 00 01 04 14 SS 00 05 12 06 SS 00 07 07 05 SS 00 08 04 05
SS 00 38 04 04 SS 01 72 03 92 SS 10 16 01 16 SS 00 60 09 15

SS 00 61 09 16 SS 04 38 09 09 SS 40 93 07 05 SS 41 12 12 07
SS 41 51 10 09 SS 91 63 06 11 IH 10 01 09 86 SS 05 18 07 92

Note I changed a couple of numbers in set 2. If I use compare plug in it compare the lines, not the data. So SS 00 61 09 16 would be flagged as a change when it is not. Can anyone tell me how to set this up to actually find repeats in the sequence?

tobelyan

@Roman-Artiukhin Nope, this is not working, it shows like “1 occurrence was replaced” but not replacing anything

Roman Artiukhin

Well it works for me with your sample. See http://g.recordit.co/0woUi0bDIs.gif

Roman Artiukhin

Can your English text contain “:”? If yes try this one instead: ^(.*?es.\s*:).+?$(?=.*?^\1.+?$)

tobelyan

@Roman-Artiukhin Still not working, is it possible to contact you somewhere in private? so i can show you the real data. and do you speak russian ? :)

tobelyan

or at least you can contact me so i will contact you back, my email is my login name in forum, just add @list.ru just i am not writing my email publicly to not get spam from bots :)

guy038

Hello, @tobelyan,

I think that the shorter regex S/R, to achieve what you want to, is :

SEARCH (?-s)^.*("en":".+","es":).*\R(?s).*\K(?-s)^.*\1.*\R

REPLACE Leave EMPTY !

Remarks : I assume some statements :

The search is case sensitive. If NOT, just change the first part (?-s) by the syntax (?i-s)
The text, to search for, is preceded by the literal string “en”:"
The text, to search for, is followed by the literal string ",“es”:
The initial string “en”:" may begin a line
The random text, after the string ",“es”:, may be present or not

Notes :

From beginning of text, this regex simply searches, first, for a line, followed by the greatest range of lines, till the last line, containing the same text ( group 1 ), as the first one
Due the \K syntax, this search is, then, reset and the final searched regex is this last line, only, which is deleted, due to the empty replacement zone !

So, let’s start, for instance, with the original text, with a line break, after the last line, below :

'{"en":"Text Five","es": (Copyright (C)2016)
'{"en":"Text Two","es": (software; you may)
'{"en":"Text One","es": (GNU General Public)
'{"en":"Text One","es":
'{"en":"Text Three","es": (below. This guarantees)
'{"en":"Text Two","es": (this software under)
'{"en":"Text Two","es": (Note that we consider)
'"en":"Text One","es": (for the purpose of)
'{"en":"Text Four","es": (Notepad++ into a)
'{"en":"Text Five","es": (produced by InstallShielf)
'{"en":"Text Three","es": (This program is distributed)
'{"en":"Text One","es": (WITHOUT ANY WARRANTY)
'{"en":"Text Three","es": (MERCHANTABILITY or)
'{"en":"Text Five","es": (GNU General Public)
'{"en":"Text One","es": (A copy of the GNU)

Now, move back to the very beginning of your file ( Ctrl+ Origin )
Open the Replace dialog ( Ctrl + H )
UNcheck the wrap around option
Select, of course, the Regular expression search mode
Fill the Find what: and Replace with: boxes, as specified, above
Click, SEVERAL times, on the Replace All button, till the message Replace All: 0 occurrences were replaced occurs !

You should obtain the simplified text, which keeps, in addition, the original order of lines :

'{"en":"Text Five","es": (Copyright (C)2016)
'{"en":"Text Two","es": (software; you may)
'{"en":"Text One","es": (GNU General Public)
'{"en":"Text Three","es": (below. This guarantees)
'{"en":"Text Four","es": (Notepad++ into a)

Et voilà !

Best Regards,

guy038

tobelyan

@guy038 thank you very much, worked perfecly

guy038

Hello, @bill-davis,

So, let’s imagine that you have these two original sets of data :

Set 1

SS 00 01 03 14 SS 00 05 12 06 SS 00 07 07 05 SS 00 08 04 05
SS 00 38 04 04 SS 01 72 03 92 SS 10 16 01 16 SS 00 60 09 15
SS 00 61 09 15 SS 04 38 09 09 SS 40 93 07 05 SS 41 12 12 07
SS 41 51 10 09 SS 41 63 06 11 IH 10 01 09 86 SS 05 18 07 92

Set 2

SS 00 01 04 14 SS 00 05 12 06 SS 00 07 07 06 SS 00 08 04 05
SS 99 38 04 04 SS 01 72 03 92 SS 10 16 01 16 SS 00 60 09 15
SS 00 61 09 16 SS 04 38 09 09 SS 40 93 07 05 SS 41 12 12 07
SS 41 51 10 09 SS 91 63 06 11 IH 10 01 09 86 SS 05 18 07 93

If you are able to join the analogue lines ( the second version, right below the first one, and followed by, at least, one empty line ), as below :

SS 00 01 03 14 SS 00 05 12 06 SS 00 07 07 05 SS 00 08 04 05
SS 00 01 04 14 SS 00 05 12 06 SS 00 07 07 06 SS 00 08 04 05

SS 00 38 04 04 SS 01 72 03 92 SS 10 16 01 16 SS 00 60 09 15
SS 99 38 04 04 SS 01 72 03 92 SS 10 16 01 16 SS 00 60 09 15

SS 00 61 09 15 SS 04 38 09 09 SS 40 93 07 05 SS 41 12 12 07
SS 00 61 09 16 SS 04 38 09 09 SS 40 93 07 05 SS 41 12 12 07

SS 41 51 10 09 SS 41 63 06 11 IH 10 01 09 86 SS 05 18 07 92
SS 41 51 10 09 SS 91 63 06 11 IH 10 01 09 86 SS 05 18 07 93

( I, already, thought about the way to get this new arrangement !! )

Then, the regex (\d\d)(?-s)(?=.*\R.+)(?s)(?!.{59}\1) would match any two-digits number, which is present in the first line and NOT in the following line !

So, the 3rd and 12th numbers of line 1, the 1st number of line 4, the 4th number of line 7, the 5th and 16th numbers of line 10 would be found or marked with the Search > Mark… dialog

If your file is an Unix fie, with, only, the \n EOL character, the correct regex is (\d\d)(?-s)(?=.*\R.+)(?s)(?!.{58}\1)

Notes :

The idea is that, with the new organization of the data, any two-digits number is separated from its similar one, on the next line, by, exactly, 59 standard or EOL characters ( or 58, in case of Unix files )
So, we’re looking for a two-digits number (\d\d), stored as group 1, but if two conditions are, also, true :
- After the two-digits number, there is an unique line-break ( \R ), at the end of the current line => The positive look-ahead (?-s)(?=.*\R.+)
- After the two-digits number AND 59 characters ( standard or EOL ) an other identical two-digits number cannot be found => The negative look-ahead (?s)(?!.{59}\1)

Best Regards,

guy038

Kosmos Huynh

@guy038 and all

I am happy to see this topic but it does not work in my case. Could you please give me a favor?

My example as the followings:
Chương 335: Nghiêm trọng
Chương 385: Nghiêm trọng
Ma Thần nhạc viên Chương 348: Nghiêm trọng

I wanted to delete the last two lines. Then, I applied your instruction with “Chương” as key string but it did not work.
(?-s)^.(Chương ).\R(?s).\K(?-s)^.\1.*\R

Many thanks in advance!

Terry R

@Kosmos-Huynh said in Find Duplicate lines by the part of line and keep one of them:

but it does not work in my case

I’m not surprised the regex you showed didn’t work. The example data is just a bit too different and unless you know what each part of the regex does you could well find it doing more damage to your data than good.

As this conversation is 3 years old and your request is different enough can you start a new post? By all means reference back to this if you want but in reality it needs dealing with as a separate conversation.

Also when including sample data please use the </> button (which you will see above the window where you type) around the examples, this prevents any characters typed from being altered by the interpreter in which you type. Please include more examples as what you have is insufficient to help describe the reason why 2 lines should be deleted when all 3 have the same word but all 3 have different numbers following. Describe more fully the requirement a line must have before it can be described as a “duplicate” and therefore deleted along with the line before it.

Terry