• Login
Community
  • Login

Delete text that repeats in the same line

Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
6 Posts 3 Posters 1.4k Views
Loading More Posts
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • V
    Vivianjenylord
    last edited by Oct 31, 2018, 6:46 AM

    Hello Friends, I have a document that in the same lines repeats the same word, can you with regular expressions of notepad remove the word that is repeated?

    The document is like this:

    123.45607894.165@abcd;aba 123.45607894.165@abcd;aba
    9871.001@fab:9782581afa xx9871.001@fab:9782581afa 9871.001@fab:9782581afa
    00040 jhjhjdsadj2 00040 jhjhjdsadj2 ""00040 jhjhjdsadj2
    journal… xxx journal
    journal… @ journal
    the same 1234 the same

    And I need so:

    123.45607894.165@abcd;aba
    9871.001@fab:9782581afa xx
    00040 jhjhjdsadj2 “”
    journal… xxx
    journal… @
    the same 1234

    If someone helps me solve I’m going to be grateful
    I thank you

    1 Reply Last reply Reply Quote 0
    • B
      Blafulous Crassley
      last edited by Oct 31, 2018, 11:30 AM

      I don’t know, but you can try some things with regex with https://regex101.com/ a free service. You can even sign up with an account and save your tests.

      1 Reply Last reply Reply Quote 0
      • G
        guy038
        last edited by guy038 Nov 19, 2022, 1:53 AM Oct 31, 2018, 3:36 PM

        Hi, vivianjenylord,

        I found a solution, with regexes, which needs other text manipulations, as sort and column numbering

        However, I’m not satisfied because the method is a bit complicated and I’m still wondering it’s worth posting it !


        Please, one question. What about the following case, with the line :

        abcdefghij 12345 abcdefghij abcdefghij xyz abcdefghij

        Must we keep :

        • The shortest item ( abcdefghij )
        • The longest item ( abcdefghij 12345 )
        • The last item, sorted alphabetically ascending ( abcdefghij xyz )

        Best Regards,

        guy038

        V 1 Reply Last reply Oct 31, 2018, 4:21 PM Reply Quote 1
        • V
          Vivianjenylord @guy038
          last edited by Vivianjenylord Oct 31, 2018, 4:22 PM Oct 31, 2018, 4:21 PM

          @guy038
          Thank you for responding, in my text I order them alphabetically, regarding your question:
          I would like to obtain as a result
          abcdefghij 12345 xyz

          In case it’s very complicated for me to understand (I’m just a web designer), keep
          The last item, sorted alphabetically ascending (abcdefghij xyz)

          friend thank you very much for your selfless help

          1 Reply Last reply Reply Quote 0
          • G
            guy038
            last edited by guy038 Nov 1, 2018, 1:16 AM Nov 1, 2018, 12:42 AM

            Hi, vivianjenylord, and All,

            Thanks for your reply. On my side, I’ve managed to simplify the main regex :-) The method needs a lot of steps, although each one is not difficult to realize ;-))

            Well, Let’s go !


            I assume to use the text, below, as working file, which corresponds to your sample text, with four more lines… … and some blank chars and blank lines in order to match any case :-))

            123.45607894.165@abcd;aba 123.45607894.165@abcd;aba
            9871.001@fab:9782581afa xx9871.001@fab:9782581afa 9871.001@fab:9782581afa
            
            00040 jhjhjdsadj2 00040 jhjhjdsadj2 ""00040 jhjhjdsadj2
            
            journal… xxx journal
                   journal… @ journal
            the same 1234 the same
            abcde 12345 abcde abcdexyz tuvabcde
            
            
            fghij 12345fghij fghij xyzfghij xyz  tuv
            PQR 12345 PQR PQRxyz tuvPQR               
            Last TEST 12345Last TEST Last TEST xyzLast TEST xyz     tuvLast TESTxyz ijkLastTEST
            

            First, Paste this text in a N++ new tab

            Now, we’re going to :

            • Delete possible blank lines, pure or not

            • Trim possible blank characters, at beginning and/or end of each line

            • Insert a character, not yet used in your file, at beginning of each line, to act as a separator

            I chose the # symbol but any single character would be appropriate. However, note that if this character is a meta-character of regular expressions, don’t forget to escape it with the \ char, in order to use it literally !

            So:

            • Open the Replace dialog ( Ctrl + H )

            • SEARCH ^\h*\R|^\h+|\h+$|^(.)

            • REPLACE ?1#\1

            • Select the Regular expression mode search

            • Tick the Wrap around option

            • Click, once, on the Replace All button

            You should obtain the following text :

            #123.45607894.165@abcd;aba 123.45607894.165@abcd;aba
            #9871.001@fab:9782581afa xx9871.001@fab:9782581afa 9871.001@fab:9782581afa
            #00040 jhjhjdsadj2 00040 jhjhjdsadj2 ""00040 jhjhjdsadj2
            #journal… xxx journal
            #journal… @ journal
            #the same 1234 the same
            #abcde 12345 abcde abcdexyz tuvabcde
            #fghij 12345fghij fghij xyzfghij xyz  tuv
            #PQR 12345 PQR PQRxyz tuvPQR
            #Last TEST 12345Last TEST Last TEST xyzLast TEST xyz     tuvLast TESTxyz ijkLastTEST
            

            Then :

            • Place the cursor/caret at the very beginning ( line 1, column 1 )

            • Open the Column editor ( Alt + C )

            • Select the option Number to Insert

            • Type in 1 in the initial number and increase by fields

            • Tick the Leading zeros box

            • If necessary, select the Dec format

            • Click on the OK button

            • Delete number 11, at the end

            You’ll get the text, below :

            01#123.45607894.165@abcd;aba 123.45607894.165@abcd;aba
            02#9871.001@fab:9782581afa xx9871.001@fab:9782581afa 9871.001@fab:9782581afa
            03#00040 jhjhjdsadj2 00040 jhjhjdsadj2 ""00040 jhjhjdsadj2
            04#journal… xxx journal
            05#journal… @ journal
            06#the same 1234 the same
            07#abcde 12345 abcde abcdexyz tuvabcde
            08#fghij 12345fghij fghij xyzfghij xyz  tuv
            09#PQR 12345 PQR PQRxyz tuvPQR
            10#Last TEST 12345Last TEST Last TEST xyzLast TEST xyz     tuvLast TESTxyz ijkLastTEST
            

            Now, we’re going to use the main regex, which :

            • Cut the text into several lines, each containing the repeated word

            • Add the correct numbering to each split line, for a future sort action

            So, open the Replace dialog , again

            • SEARCH (?-is)^((\d+#)(.{1,}[^ \r\n]).*?)\x20?(?=\3)

              • By default I supposed a case sensitive search… If your prefer an insensitive search, use, at beginning, the modifiers (?i-s)
            • REPLACE \1\r\n\2

            • Keep the same options, as above

            • Click on the Replace All button, repeatedly ( or use the Alt + A shortcut ), until you get the message Replace All: 0 occurrences were replaced ( 6 hits for this example ! )

            You should obtain this 32-lines text :

            01#123.45607894.165@abcd;aba
            01#123.45607894.165@abcd;aba
            02#9871.001@fab:9782581afa xx
            02#9871.001@fab:9782581afa
            02#9871.001@fab:9782581afa
            03#00040 jhjhjdsadj2
            03#00040 jhjhjdsadj2 ""
            03#00040 jhjhjdsadj2
            04#journal… xxx
            04#journal
            05#journal… @
            05#journal
            06#the same 1234
            06#the same
            07#abcde 12345
            07#abcde
            07#abcdexyz tuv
            07#abcde
            08#fghij 12345
            08#fghij
            08#fghij xyz
            08#fghij xyz  tuv
            09#PQR 12345
            09#PQR
            09#PQRxyz tuv
            09#PQR
            10#Last TEST 12345
            10#Last TEST
            10#Last TEST xyz
            10#Last TEST xyz     tuv
            10#Last TESTxyz ijk
            10#Last TEST
            

            Ah, almost finished ! Now, we perform a classical N++ sort, using the option :

            Edit > Line Operations > Sort Lines Lexicographically Ascending

            After the sort, don’t forget to add,at least, one pure blank line, after the sorted results ( IMPORTANT )

            Hence, the sorted text :

            01#123.45607894.165@abcd;aba
            01#123.45607894.165@abcd;aba
            02#9871.001@fab:9782581afa
            02#9871.001@fab:9782581afa
            02#9871.001@fab:9782581afa xx
            03#00040 jhjhjdsadj2
            03#00040 jhjhjdsadj2
            03#00040 jhjhjdsadj2 ""
            04#journal
            04#journal… xxx
            05#journal
            05#journal… @
            06#the same
            06#the same 1234
            07#abcde
            07#abcde
            07#abcde 12345
            07#abcdexyz tuv
            08#fghij
            08#fghij 12345
            08#fghij xyz
            08#fghij xyz  tuv
            09#PQR
            09#PQR
            09#PQR 12345
            09#PQRxyz tuv
            10#Last TEST
            10#Last TEST
            10#Last TEST 12345
            10#Last TEST xyz
            10#Last TEST xyz     tuv
            10#Last TESTxyz ijk
            

            Finally, for each line number, we must keep the last item, only. So :

            • For the last time, open the Replace dialog

            • SEARCH ^(?-s)(.+)\R(\1.*\R)+

            • REPLACE \2

            • Keep the same options, as above

            • Click, once, on the Replace All button

            Almost the final text expected !

            01#123.45607894.165@abcd;aba
            02#9871.001@fab:9782581afa xx
            03#00040 jhjhjdsadj2 ""
            04#journal… xxx
            05#journal… @
            06#the same 1234
            07#abcdexyz tuv
            08#fghij xyz  tuv
            09#PQRxyz tuv
            10#Last TESTxyz ijk
            

            To end, we just have to get rid of the numbering, at beginning of each line. No problem with the simple regex :

            • SEARCH (?-s)^.+#

            • REPLACE Leave EMPTY

            • Keep the same options, as above

            • Click, once, on the Replace All button


            Here we are ! A bit of work but a correct result, isn’t it ?

            123.45607894.165@abcd;aba
            9871.001@fab:9782581afa xx
            00040 jhjhjdsadj2 ""
            journal… xxx
            journal… @
            the same 1234
            abcdexyz tuv
            fghij xyz  tuv
            PQRxyz tuv
            Last TESTxyz ijk
            

            I just hope, that results will be correct, too, with your real data ;-))

            See you later

            Cheers,

            guy038

            V 1 Reply Last reply Nov 1, 2018, 5:41 AM Reply Quote 1
            • V
              Vivianjenylord @guy038
              last edited by Nov 1, 2018, 5:41 AM

              @guy038
              guy038, I am very grateful to you, you are a great person for your selfless help, having taken the time to make an excellent explanation of the subject, I was able to solve my problem with the text.
              Thank you

              1 Reply Last reply Reply Quote 1
              3 out of 6
              • First post
                3/6
                Last post
              The Community of users of the Notepad++ text editor.
              Powered by NodeBB | Contributors