Delete text that repeats in the same line

Vivianjenylord

Hello Friends, I have a document that in the same lines repeats the same word, can you with regular expressions of notepad remove the word that is repeated?

The document is like this:

123.45607894.165@abcd;aba 123.45607894.165@abcd;aba
9871.001@fab:9782581afa xx9871.001@fab:9782581afa 9871.001@fab:9782581afa
00040 jhjhjdsadj2 00040 jhjhjdsadj2 ""00040 jhjhjdsadj2
journal… xxx journal
journal… @ journal
the same 1234 the same

And I need so:

123.45607894.165@abcd;aba
9871.001@fab:9782581afa xx
00040 jhjhjdsadj2 “”
journal… xxx
journal… @
the same 1234

If someone helps me solve I’m going to be grateful
I thank you

Blafulous Crassley

I don’t know, but you can try some things with regex with https://regex101.com/ a free service. You can even sign up with an account and save your tests.

guy038

Hi, vivianjenylord,

I found a solution, with regexes, which needs other text manipulations, as sort and column numbering

However, I’m not satisfied because the method is a bit complicated and I’m still wondering it’s worth posting it !

Please, one question. What about the following case, with the line :

abcdefghij 12345 abcdefghij abcdefghij xyz abcdefghij

Must we keep :

The shortest item ( abcdefghij )
The longest item ( abcdefghij 12345 )
The last item, sorted alphabetically ascending ( abcdefghij xyz )

Best Regards,

guy038

Vivianjenylord

@guy038
Thank you for responding, in my text I order them alphabetically, regarding your question:
I would like to obtain as a result
abcdefghij 12345 xyz

In case it’s very complicated for me to understand (I’m just a web designer), keep
The last item, sorted alphabetically ascending (abcdefghij xyz)

friend thank you very much for your selfless help

guy038

Hi, vivianjenylord, and All,

Thanks for your reply. On my side, I’ve managed to simplify the main regex :-) The method needs a lot of steps, although each one is not difficult to realize ;-))

Well, Let’s go !

I assume to use the text, below, as working file, which corresponds to your sample text, with four more lines… … and some blank chars and blank lines in order to match any case :-))

123.45607894.165@abcd;aba 123.45607894.165@abcd;aba
9871.001@fab:9782581afa xx9871.001@fab:9782581afa 9871.001@fab:9782581afa

00040 jhjhjdsadj2 00040 jhjhjdsadj2 ""00040 jhjhjdsadj2

journal… xxx journal
       journal… @ journal
the same 1234 the same
abcde 12345 abcde abcdexyz tuvabcde


fghij 12345fghij fghij xyzfghij xyz  tuv
PQR 12345 PQR PQRxyz tuvPQR               
Last TEST 12345Last TEST Last TEST xyzLast TEST xyz     tuvLast TESTxyz ijkLastTEST

First, Paste this text in a N++ new tab

Now, we’re going to :

Delete possible blank lines, pure or not
Trim possible blank characters, at beginning and/or end of each line
Insert a character, not yet used in your file, at beginning of each line, to act as a separator

I chose the # symbol but any single character would be appropriate. However, note that if this character is a meta-character of regular expressions, don’t forget to escape it with the \ char, in order to use it literally !

So:

Open the Replace dialog ( Ctrl + H )
SEARCH ^\h*\R|^\h+|\h+$|^(.)
REPLACE ?1#\1
Select the Regular expression mode search
Tick the Wrap around option
Click, once, on the Replace All button

You should obtain the following text :

#123.45607894.165@abcd;aba 123.45607894.165@abcd;aba
#9871.001@fab:9782581afa xx9871.001@fab:9782581afa 9871.001@fab:9782581afa
#00040 jhjhjdsadj2 00040 jhjhjdsadj2 ""00040 jhjhjdsadj2
#journal… xxx journal
#journal… @ journal
#the same 1234 the same
#abcde 12345 abcde abcdexyz tuvabcde
#fghij 12345fghij fghij xyzfghij xyz  tuv
#PQR 12345 PQR PQRxyz tuvPQR
#Last TEST 12345Last TEST Last TEST xyzLast TEST xyz     tuvLast TESTxyz ijkLastTEST

Then :

Place the cursor/caret at the very beginning ( line 1, column 1 )
Open the Column editor ( Alt + C )
Select the option Number to Insert
Type in 1 in the initial number and increase by fields
Tick the Leading zeros box
If necessary, select the Dec format
Click on the OK button
Delete number 11, at the end

You’ll get the text, below :

01#123.45607894.165@abcd;aba 123.45607894.165@abcd;aba
02#9871.001@fab:9782581afa xx9871.001@fab:9782581afa 9871.001@fab:9782581afa
03#00040 jhjhjdsadj2 00040 jhjhjdsadj2 ""00040 jhjhjdsadj2
04#journal… xxx journal
05#journal… @ journal
06#the same 1234 the same
07#abcde 12345 abcde abcdexyz tuvabcde
08#fghij 12345fghij fghij xyzfghij xyz  tuv
09#PQR 12345 PQR PQRxyz tuvPQR
10#Last TEST 12345Last TEST Last TEST xyzLast TEST xyz     tuvLast TESTxyz ijkLastTEST

Now, we’re going to use the main regex, which :

Cut the text into several lines, each containing the repeated word
Add the correct numbering to each split line, for a future sort action

So, open the Replace dialog , again

SEARCH (?-is)^((\d+#)(.{1,}[^ \r\n]).*?)\x20?(?=\3)
- By default I supposed a case sensitive search… If your prefer an insensitive search, use, at beginning, the modifiers (?i-s)
REPLACE \1\r\n\2
Keep the same options, as above
Click on the Replace All button, repeatedly ( or use the Alt + A shortcut ), until you get the message Replace All: 0 occurrences were replaced ( 6 hits for this example ! )

You should obtain this 32-lines text :

01#123.45607894.165@abcd;aba
01#123.45607894.165@abcd;aba
02#9871.001@fab:9782581afa xx
02#9871.001@fab:9782581afa
02#9871.001@fab:9782581afa
03#00040 jhjhjdsadj2
03#00040 jhjhjdsadj2 ""
03#00040 jhjhjdsadj2
04#journal… xxx
04#journal
05#journal… @
05#journal
06#the same 1234
06#the same
07#abcde 12345
07#abcde
07#abcdexyz tuv
07#abcde
08#fghij 12345
08#fghij
08#fghij xyz
08#fghij xyz  tuv
09#PQR 12345
09#PQR
09#PQRxyz tuv
09#PQR
10#Last TEST 12345
10#Last TEST
10#Last TEST xyz
10#Last TEST xyz     tuv
10#Last TESTxyz ijk
10#Last TEST

Ah, almost finished ! Now, we perform a classical N++ sort, using the option :

Edit > Line Operations > Sort Lines Lexicographically Ascending

After the sort, don’t forget to add,at least, one pure blank line, after the sorted results ( IMPORTANT )

Hence, the sorted text :

01#123.45607894.165@abcd;aba
01#123.45607894.165@abcd;aba
02#9871.001@fab:9782581afa
02#9871.001@fab:9782581afa
02#9871.001@fab:9782581afa xx
03#00040 jhjhjdsadj2
03#00040 jhjhjdsadj2
03#00040 jhjhjdsadj2 ""
04#journal
04#journal… xxx
05#journal
05#journal… @
06#the same
06#the same 1234
07#abcde
07#abcde
07#abcde 12345
07#abcdexyz tuv
08#fghij
08#fghij 12345
08#fghij xyz
08#fghij xyz  tuv
09#PQR
09#PQR
09#PQR 12345
09#PQRxyz tuv
10#Last TEST
10#Last TEST
10#Last TEST 12345
10#Last TEST xyz
10#Last TEST xyz     tuv
10#Last TESTxyz ijk

Finally, for each line number, we must keep the last item, only. So :

For the last time, open the Replace dialog
SEARCH ^(?-s)(.+)\R(\1.*\R)+
REPLACE \2
Keep the same options, as above
Click, once, on the Replace All button

Almost the final text expected !

01#123.45607894.165@abcd;aba
02#9871.001@fab:9782581afa xx
03#00040 jhjhjdsadj2 ""
04#journal… xxx
05#journal… @
06#the same 1234
07#abcdexyz tuv
08#fghij xyz  tuv
09#PQRxyz tuv
10#Last TESTxyz ijk

To end, we just have to get rid of the numbering, at beginning of each line. No problem with the simple regex :

SEARCH (?-s)^.+#
REPLACE Leave EMPTY
Keep the same options, as above
Click, once, on the Replace All button

Here we are ! A bit of work but a correct result, isn’t it ?

123.45607894.165@abcd;aba
9871.001@fab:9782581afa xx
00040 jhjhjdsadj2 ""
journal… xxx
journal… @
the same 1234
abcdexyz tuv
fghij xyz  tuv
PQRxyz tuv
Last TESTxyz ijk

I just hope, that results will be correct, too, with your real data ;-))

See you later

Cheers,

guy038

Vivianjenylord

@guy038
guy038, I am very grateful to you, you are a great person for your selfless help, having taken the time to make an excellent explanation of the subject, I was able to solve my problem with the text.
Thank you