Delete text that repeats in the same line



  • Hello Friends, I have a document that in the same lines repeats the same word, can you with regular expressions of notepad remove the word that is repeated?

    The document is like this:

    123.45607894.165@abcd;aba 123.45607894.165@abcd;aba
    9871.001@fab:9782581afa xx9871.001@fab:9782581afa 9871.001@fab:9782581afa
    00040 jhjhjdsadj2 00040 jhjhjdsadj2 ""00040 jhjhjdsadj2
    journal… xxx journal
    journal… @ journal
    the same 1234 the same

    And I need so:

    123.45607894.165@abcd;aba
    9871.001@fab:9782581afa xx
    00040 jhjhjdsadj2 “”
    journal… xxx
    journal… @
    the same 1234

    If someone helps me solve I’m going to be grateful
    I thank you



  • I don’t know, but you can try some things with regex with https://regex101.com/ a free service. You can even sign up with an account and save your tests.



  • Hi, vivianjenylord,

    I found a solution, with regexes, which needs other text manipulations, as sort and column numbering

    However, I’m not satisfied because the method is a bit complicated and I’m still wondering it’s worth posting it !


    Please, one question. What about the following case, with the line :

    abcdefghij 12345 abcdefghij abcdefghij xyz abcdefghij

    Must we keep :

    • The shortest item ( abcdefghij )
    • The longest item ( abcdefghij 12345 )
    • The last item, sorted alphabetically ascending ( abcdefghij xyz )

    Best Regards,

    guy038



  • @guy038
    Thank you for responding, in my text I order them alphabetically, regarding your question:
    I would like to obtain as a result
    abcdefghij 12345 xyz

    In case it’s very complicated for me to understand (I’m just a web designer), keep
    The last item, sorted alphabetically ascending (abcdefghij xyz)

    friend thank you very much for your selfless help



  • Hi, vivianjenylord, and All,

    Thanks for your reply. On my side, I’ve managed to simplify the main regex :-) The method needs a lot of steps, although each one is not difficult to realize ;-))

    Well, Let’s go !


    I assume to use the text, below, as working file, which corresponds to your sample text, with four more lines… … and some blank chars and blank lines in order to match any case :-))

    123.45607894.165@abcd;aba 123.45607894.165@abcd;aba
    9871.001@fab:9782581afa xx9871.001@fab:9782581afa 9871.001@fab:9782581afa
    
    00040 jhjhjdsadj2 00040 jhjhjdsadj2 ""00040 jhjhjdsadj2
    
    journal… xxx journal
           journal… @ journal
    the same 1234 the same
    abcde 12345 abcde abcdexyz tuvabcde
    
    
    fghij 12345fghij fghij xyzfghij xyz  tuv
    PQR 12345 PQR PQRxyz tuvPQR               
    Last TEST 12345Last TEST Last TEST xyzLast TEST xyz     tuvLast TESTxyz ijkLastTEST
    

    First, Paste this text in a N++ new tab

    Now, we’re going to :

    • Delete possible blank lines, pure or not

    • Trim possible blank characters, at beginning and/or end of each line

    • Insert a character, not yet used in your file, at beginning of each line, to act as a separator

    I chose the # symbol but any single character would be appropriate. However, note that if this character is a meta-character of regular expressions, don’t forget to escape it with the \ char, in order to use it literally !

    So:

    • Open the Replace dialog ( Ctrl + H )

    • SEARCH ^\h*\R|^\h+|\h+$|^(.)

    • REPLACE ?1#\1

    • Select the Regular expression mode search

    • Tick the Wrap around option

    • Click, once, on the Replace All button

    You should obtain the following text :

    #123.45607894.165@abcd;aba 123.45607894.165@abcd;aba
    #9871.001@fab:9782581afa xx9871.001@fab:9782581afa 9871.001@fab:9782581afa
    #00040 jhjhjdsadj2 00040 jhjhjdsadj2 ""00040 jhjhjdsadj2
    #journal… xxx journal
    #journal… @ journal
    #the same 1234 the same
    #abcde 12345 abcde abcdexyz tuvabcde
    #fghij 12345fghij fghij xyzfghij xyz  tuv
    #PQR 12345 PQR PQRxyz tuvPQR
    #Last TEST 12345Last TEST Last TEST xyzLast TEST xyz     tuvLast TESTxyz ijkLastTEST
    

    Then :

    • Place the cursor/caret at the very beginning ( line 1, column 1 )

    • Open the Column editor ( Alt + C )

    • Select the option Number to Insert

    • Type in 1 in the initial number and increase by fields

    • Tick the Leading zeros box

    • If necessary, select the Dec format

    • Click on the OK button

    • Delete number 11, at the end

    You’ll get the text, below :

    01#123.45607894.165@abcd;aba 123.45607894.165@abcd;aba
    02#9871.001@fab:9782581afa xx9871.001@fab:9782581afa 9871.001@fab:9782581afa
    03#00040 jhjhjdsadj2 00040 jhjhjdsadj2 ""00040 jhjhjdsadj2
    04#journal… xxx journal
    05#journal… @ journal
    06#the same 1234 the same
    07#abcde 12345 abcde abcdexyz tuvabcde
    08#fghij 12345fghij fghij xyzfghij xyz  tuv
    09#PQR 12345 PQR PQRxyz tuvPQR
    10#Last TEST 12345Last TEST Last TEST xyzLast TEST xyz     tuvLast TESTxyz ijkLastTEST
    

    Now, we’re going to use the main regex, which :

    • Cut the text into several lines, each containing the repeated word

    • Add the correct numbering to each split line, for a future sort action

    So, open the Replace dialog , again

    • SEARCH (?-is)^((\d+#)(.{1,}[^ \r\n]).*?)\x20?(?=\3)

      • By default I supposed a case sensitive search… If your prefer an insensitive search, use, at beginning, the modifiers (?i-s)
    • REPLACE \1\r\n\2

    • Keep the same options, as above

    • Click on the Replace All button, repeatedly ( or use the Alt + A shortcut ), until you get the message Replace All: 0 occurrences were replaced ( 6 hits for this example ! )

    You should obtain this 32-lines text :

    01#123.45607894.165@abcd;aba
    01#123.45607894.165@abcd;aba
    02#9871.001@fab:9782581afa xx
    02#9871.001@fab:9782581afa
    02#9871.001@fab:9782581afa
    03#00040 jhjhjdsadj2
    03#00040 jhjhjdsadj2 ""
    03#00040 jhjhjdsadj2
    04#journal… xxx
    04#journal
    05#journal… @
    05#journal
    06#the same 1234
    06#the same
    07#abcde 12345
    07#abcde
    07#abcdexyz tuv
    07#abcde
    08#fghij 12345
    08#fghij
    08#fghij xyz
    08#fghij xyz  tuv
    09#PQR 12345
    09#PQR
    09#PQRxyz tuv
    09#PQR
    10#Last TEST 12345
    10#Last TEST
    10#Last TEST xyz
    10#Last TEST xyz     tuv
    10#Last TESTxyz ijk
    10#Last TEST
    

    Ah, almost finished ! Now, we perform a classical N++ sort, using the option :

    Edit > Line Operations > Sort Lines Lexicographically Ascending

    After the sort, don’t forget to add,at least, one pure blank line, after the sorted results ( IMPORTANT )

    Hence, the sorted text :

    01#123.45607894.165@abcd;aba
    01#123.45607894.165@abcd;aba
    02#9871.001@fab:9782581afa
    02#9871.001@fab:9782581afa
    02#9871.001@fab:9782581afa xx
    03#00040 jhjhjdsadj2
    03#00040 jhjhjdsadj2
    03#00040 jhjhjdsadj2 ""
    04#journal
    04#journal… xxx
    05#journal
    05#journal… @
    06#the same
    06#the same 1234
    07#abcde
    07#abcde
    07#abcde 12345
    07#abcdexyz tuv
    08#fghij
    08#fghij 12345
    08#fghij xyz
    08#fghij xyz  tuv
    09#PQR
    09#PQR
    09#PQR 12345
    09#PQRxyz tuv
    10#Last TEST
    10#Last TEST
    10#Last TEST 12345
    10#Last TEST xyz
    10#Last TEST xyz     tuv
    10#Last TESTxyz ijk
    

    Finally, for each line number, we must keep the last item, only. So :

    • For the last time, open the Replace dialog

    • SEARCH ^(?-s)(.+)\R(\1.*\R)+

    • REPLACE \2

    • Keep the same options, as above

    • Click, once, on the Replace All button

    Almost the final text expected !

    01#123.45607894.165@abcd;aba
    02#9871.001@fab:9782581afa xx
    03#00040 jhjhjdsadj2 ""
    04#journal… xxx
    05#journal… @
    06#the same 1234
    07#abcdexyz tuv
    08#fghij xyz  tuv
    09#PQRxyz tuv
    10#Last TESTxyz ijk
    

    To end, we just have to get rid of the numbering, at beginning of each line. No problem with the simple regex :

    • SEARCH (?-s)^.+#

    • REPLACE Leave EMPTY

    • Keep the same options, as above

    • Click, once, on the Replace All button


    Here we are ! A bit of work but a correct result, isn’t it ?

    123.45607894.165@abcd;aba
    9871.001@fab:9782581afa xx
    00040 jhjhjdsadj2 ""
    journal… xxx
    journal… @
    the same 1234
    abcdexyz tuv
    fghij xyz  tuv
    PQRxyz tuv
    Last TESTxyz ijk
    

    I just hope, that results will be correct, too, with your real data ;-))

    See you later

    Cheers,

    guy038



  • @guy038
    guy038, I am very grateful to you, you are a great person for your selfless help, having taken the time to make an excellent explanation of the subject, I was able to solve my problem with the text.
    Thank you


Log in to reply