Delete text that repeats in the same line
-
Hello Friends, I have a document that in the same lines repeats the same word, can you with regular expressions of notepad remove the word that is repeated?
The document is like this:
123.45607894.165@abcd;aba 123.45607894.165@abcd;aba
9871.001@fab:9782581afa xx9871.001@fab:9782581afa 9871.001@fab:9782581afa
00040 jhjhjdsadj2 00040 jhjhjdsadj2 ""00040 jhjhjdsadj2
journal… xxx journal
journal… @ journal
the same 1234 the sameAnd I need so:
123.45607894.165@abcd;aba
9871.001@fab:9782581afa xx
00040 jhjhjdsadj2 “”
journal… xxx
journal… @
the same 1234If someone helps me solve I’m going to be grateful
I thank you -
I don’t know, but you can try some things with regex with https://regex101.com/ a free service. You can even sign up with an account and save your tests.
-
Hi, vivianjenylord,
I found a solution, with regexes, which needs other text manipulations, as sort and column numbering
However, I’m not satisfied because the method is a bit complicated and I’m still wondering it’s worth posting it !
Please, one question. What about the following case, with the line :
abcdefghij 12345 abcdefghij abcdefghij xyz abcdefghij
Must we keep :
- The shortest item (
abcdefghij
) - The longest item (
abcdefghij 12345
) - The last item, sorted alphabetically ascending (
abcdefghij xyz
)
Best Regards,
guy038
- The shortest item (
-
@guy038
Thank you for responding, in my text I order them alphabetically, regarding your question:
I would like to obtain as a result
abcdefghij 12345 xyzIn case it’s very complicated for me to understand (I’m just a web designer), keep
The last item, sorted alphabetically ascending (abcdefghij xyz)friend thank you very much for your selfless help
-
Hi, vivianjenylord, and All,
Thanks for your reply. On my side, I’ve managed to simplify the main regex :-) The method needs a lot of steps, although each one is not difficult to realize ;-))
Well, Let’s go !
I assume to use the text, below, as working file, which corresponds to your sample text, with four more lines… … and some blank chars and blank lines in order to match any case :-))
123.45607894.165@abcd;aba 123.45607894.165@abcd;aba 9871.001@fab:9782581afa xx9871.001@fab:9782581afa 9871.001@fab:9782581afa 00040 jhjhjdsadj2 00040 jhjhjdsadj2 ""00040 jhjhjdsadj2 journal… xxx journal journal… @ journal the same 1234 the same abcde 12345 abcde abcdexyz tuvabcde fghij 12345fghij fghij xyzfghij xyz tuv PQR 12345 PQR PQRxyz tuvPQR Last TEST 12345Last TEST Last TEST xyzLast TEST xyz tuvLast TESTxyz ijkLastTEST
First, Paste this text in a N++ new tab
Now, we’re going to :
-
Delete possible blank lines, pure or not
-
Trim possible blank characters, at beginning and/or end of each line
-
Insert a character, not yet used in your file, at beginning of each line, to act as a separator
I chose the
#
symbol but any single character would be appropriate. However, note that if this character is a meta-character of regular expressions, don’t forget to escape it with the\
char, in order to use it literally !So:
-
Open the Replace dialog (
Ctrl + H
) -
SEARCH
^\h*\R|^\h+|\h+$|^(.)
-
REPLACE
?1#\1
-
Select the
Regular expression
mode search -
Tick the
Wrap around
option -
Click, once, on the
Replace All
button
You should obtain the following text :
#123.45607894.165@abcd;aba 123.45607894.165@abcd;aba #9871.001@fab:9782581afa xx9871.001@fab:9782581afa 9871.001@fab:9782581afa #00040 jhjhjdsadj2 00040 jhjhjdsadj2 ""00040 jhjhjdsadj2 #journal… xxx journal #journal… @ journal #the same 1234 the same #abcde 12345 abcde abcdexyz tuvabcde #fghij 12345fghij fghij xyzfghij xyz tuv #PQR 12345 PQR PQRxyz tuvPQR #Last TEST 12345Last TEST Last TEST xyzLast TEST xyz tuvLast TESTxyz ijkLastTEST
Then :
-
Place the cursor/caret at the very beginning ( line
1
, column1
) -
Open the Column editor (
Alt + C
) -
Select the option
Number to Insert
-
Type in
1
in theinitial number
andincrease by
fields -
Tick the
Leading zeros
box -
If necessary, select the
Dec
format -
Click on the
OK
button -
Delete number
11
, at the end
You’ll get the text, below :
01#123.45607894.165@abcd;aba 123.45607894.165@abcd;aba 02#9871.001@fab:9782581afa xx9871.001@fab:9782581afa 9871.001@fab:9782581afa 03#00040 jhjhjdsadj2 00040 jhjhjdsadj2 ""00040 jhjhjdsadj2 04#journal… xxx journal 05#journal… @ journal 06#the same 1234 the same 07#abcde 12345 abcde abcdexyz tuvabcde 08#fghij 12345fghij fghij xyzfghij xyz tuv 09#PQR 12345 PQR PQRxyz tuvPQR 10#Last TEST 12345Last TEST Last TEST xyzLast TEST xyz tuvLast TESTxyz ijkLastTEST
Now, we’re going to use the main regex, which :
-
Cut the text into several lines, each containing the repeated word
-
Add the correct numbering to each split line, for a future sort action
So, open the Replace dialog , again
-
SEARCH
(?-is)^((\d+#)(.{1,}[^ \r\n]).*?)\x20?(?=\3)
- By default I supposed a case sensitive search… If your prefer an insensitive search, use, at beginning, the modifiers
(?i-s)
- By default I supposed a case sensitive search… If your prefer an insensitive search, use, at beginning, the modifiers
-
REPLACE
\1\r\n\2
-
Keep the same options, as above
-
Click on the
Replace All
button, repeatedly ( or use theAlt + A
shortcut ), until you get the message Replace All: 0 occurrences were replaced (6
hits for this example ! )
You should obtain this 32-lines text :
01#123.45607894.165@abcd;aba 01#123.45607894.165@abcd;aba 02#9871.001@fab:9782581afa xx 02#9871.001@fab:9782581afa 02#9871.001@fab:9782581afa 03#00040 jhjhjdsadj2 03#00040 jhjhjdsadj2 "" 03#00040 jhjhjdsadj2 04#journal… xxx 04#journal 05#journal… @ 05#journal 06#the same 1234 06#the same 07#abcde 12345 07#abcde 07#abcdexyz tuv 07#abcde 08#fghij 12345 08#fghij 08#fghij xyz 08#fghij xyz tuv 09#PQR 12345 09#PQR 09#PQRxyz tuv 09#PQR 10#Last TEST 12345 10#Last TEST 10#Last TEST xyz 10#Last TEST xyz tuv 10#Last TESTxyz ijk 10#Last TEST
Ah, almost finished ! Now, we perform a classical N++ sort, using the option :
Edit > Line Operations > Sort Lines Lexicographically Ascending
After the sort, don’t forget to add,at least, one pure blank line, after the sorted results ( IMPORTANT )
Hence, the sorted text :
01#123.45607894.165@abcd;aba 01#123.45607894.165@abcd;aba 02#9871.001@fab:9782581afa 02#9871.001@fab:9782581afa 02#9871.001@fab:9782581afa xx 03#00040 jhjhjdsadj2 03#00040 jhjhjdsadj2 03#00040 jhjhjdsadj2 "" 04#journal 04#journal… xxx 05#journal 05#journal… @ 06#the same 06#the same 1234 07#abcde 07#abcde 07#abcde 12345 07#abcdexyz tuv 08#fghij 08#fghij 12345 08#fghij xyz 08#fghij xyz tuv 09#PQR 09#PQR 09#PQR 12345 09#PQRxyz tuv 10#Last TEST 10#Last TEST 10#Last TEST 12345 10#Last TEST xyz 10#Last TEST xyz tuv 10#Last TESTxyz ijk
Finally, for each line number, we must keep the last item, only. So :
-
For the last time, open the Replace dialog
-
SEARCH
^(?-s)(.+)\R(\1.*\R)+
-
REPLACE
\2
-
Keep the same options, as above
-
Click, once, on the
Replace All
button
Almost the final text expected !
01#123.45607894.165@abcd;aba 02#9871.001@fab:9782581afa xx 03#00040 jhjhjdsadj2 "" 04#journal… xxx 05#journal… @ 06#the same 1234 07#abcdexyz tuv 08#fghij xyz tuv 09#PQRxyz tuv 10#Last TESTxyz ijk
To end, we just have to get rid of the numbering, at beginning of each line. No problem with the simple regex :
-
SEARCH
(?-s)^.+#
-
REPLACE
Leave EMPTY
-
Keep the same options, as above
-
Click, once, on the
Replace All
button
Here we are ! A bit of work but a correct result, isn’t it ?
123.45607894.165@abcd;aba 9871.001@fab:9782581afa xx 00040 jhjhjdsadj2 "" journal… xxx journal… @ the same 1234 abcdexyz tuv fghij xyz tuv PQRxyz tuv Last TESTxyz ijk
I just hope, that results will be correct, too, with your real data ;-))
See you later
Cheers,
guy038
-
-
@guy038
guy038, I am very grateful to you, you are a great person for your selfless help, having taken the time to make an excellent explanation of the subject, I was able to solve my problem with the text.
Thank you