Community
    • Login

    Delete text that repeats in the same line

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    6 Posts 3 Posters 1.4k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • VivianjenylordV
      Vivianjenylord
      last edited by

      Hello Friends, I have a document that in the same lines repeats the same word, can you with regular expressions of notepad remove the word that is repeated?

      The document is like this:

      123.45607894.165@abcd;aba 123.45607894.165@abcd;aba
      9871.001@fab:9782581afa xx9871.001@fab:9782581afa 9871.001@fab:9782581afa
      00040 jhjhjdsadj2 00040 jhjhjdsadj2 ""00040 jhjhjdsadj2
      journal… xxx journal
      journal… @ journal
      the same 1234 the same

      And I need so:

      123.45607894.165@abcd;aba
      9871.001@fab:9782581afa xx
      00040 jhjhjdsadj2 “”
      journal… xxx
      journal… @
      the same 1234

      If someone helps me solve I’m going to be grateful
      I thank you

      1 Reply Last reply Reply Quote 0
      • Blafulous CrassleyB
        Blafulous Crassley
        last edited by

        I don’t know, but you can try some things with regex with https://regex101.com/ a free service. You can even sign up with an account and save your tests.

        1 Reply Last reply Reply Quote 0
        • guy038G
          guy038
          last edited by guy038

          Hi, vivianjenylord,

          I found a solution, with regexes, which needs other text manipulations, as sort and column numbering

          However, I’m not satisfied because the method is a bit complicated and I’m still wondering it’s worth posting it !


          Please, one question. What about the following case, with the line :

          abcdefghij 12345 abcdefghij abcdefghij xyz abcdefghij

          Must we keep :

          • The shortest item ( abcdefghij )
          • The longest item ( abcdefghij 12345 )
          • The last item, sorted alphabetically ascending ( abcdefghij xyz )

          Best Regards,

          guy038

          VivianjenylordV 1 Reply Last reply Reply Quote 1
          • VivianjenylordV
            Vivianjenylord @guy038
            last edited by Vivianjenylord

            @guy038
            Thank you for responding, in my text I order them alphabetically, regarding your question:
            I would like to obtain as a result
            abcdefghij 12345 xyz

            In case it’s very complicated for me to understand (I’m just a web designer), keep
            The last item, sorted alphabetically ascending (abcdefghij xyz)

            friend thank you very much for your selfless help

            1 Reply Last reply Reply Quote 0
            • guy038G
              guy038
              last edited by guy038

              Hi, vivianjenylord, and All,

              Thanks for your reply. On my side, I’ve managed to simplify the main regex :-) The method needs a lot of steps, although each one is not difficult to realize ;-))

              Well, Let’s go !


              I assume to use the text, below, as working file, which corresponds to your sample text, with four more lines… … and some blank chars and blank lines in order to match any case :-))

              123.45607894.165@abcd;aba 123.45607894.165@abcd;aba
              9871.001@fab:9782581afa xx9871.001@fab:9782581afa 9871.001@fab:9782581afa
              
              00040 jhjhjdsadj2 00040 jhjhjdsadj2 ""00040 jhjhjdsadj2
              
              journal… xxx journal
                     journal… @ journal
              the same 1234 the same
              abcde 12345 abcde abcdexyz tuvabcde
              
              
              fghij 12345fghij fghij xyzfghij xyz  tuv
              PQR 12345 PQR PQRxyz tuvPQR               
              Last TEST 12345Last TEST Last TEST xyzLast TEST xyz     tuvLast TESTxyz ijkLastTEST
              

              First, Paste this text in a N++ new tab

              Now, we’re going to :

              • Delete possible blank lines, pure or not

              • Trim possible blank characters, at beginning and/or end of each line

              • Insert a character, not yet used in your file, at beginning of each line, to act as a separator

              I chose the # symbol but any single character would be appropriate. However, note that if this character is a meta-character of regular expressions, don’t forget to escape it with the \ char, in order to use it literally !

              So:

              • Open the Replace dialog ( Ctrl + H )

              • SEARCH ^\h*\R|^\h+|\h+$|^(.)

              • REPLACE ?1#\1

              • Select the Regular expression mode search

              • Tick the Wrap around option

              • Click, once, on the Replace All button

              You should obtain the following text :

              #123.45607894.165@abcd;aba 123.45607894.165@abcd;aba
              #9871.001@fab:9782581afa xx9871.001@fab:9782581afa 9871.001@fab:9782581afa
              #00040 jhjhjdsadj2 00040 jhjhjdsadj2 ""00040 jhjhjdsadj2
              #journal… xxx journal
              #journal… @ journal
              #the same 1234 the same
              #abcde 12345 abcde abcdexyz tuvabcde
              #fghij 12345fghij fghij xyzfghij xyz  tuv
              #PQR 12345 PQR PQRxyz tuvPQR
              #Last TEST 12345Last TEST Last TEST xyzLast TEST xyz     tuvLast TESTxyz ijkLastTEST
              

              Then :

              • Place the cursor/caret at the very beginning ( line 1, column 1 )

              • Open the Column editor ( Alt + C )

              • Select the option Number to Insert

              • Type in 1 in the initial number and increase by fields

              • Tick the Leading zeros box

              • If necessary, select the Dec format

              • Click on the OK button

              • Delete number 11, at the end

              You’ll get the text, below :

              01#123.45607894.165@abcd;aba 123.45607894.165@abcd;aba
              02#9871.001@fab:9782581afa xx9871.001@fab:9782581afa 9871.001@fab:9782581afa
              03#00040 jhjhjdsadj2 00040 jhjhjdsadj2 ""00040 jhjhjdsadj2
              04#journal… xxx journal
              05#journal… @ journal
              06#the same 1234 the same
              07#abcde 12345 abcde abcdexyz tuvabcde
              08#fghij 12345fghij fghij xyzfghij xyz  tuv
              09#PQR 12345 PQR PQRxyz tuvPQR
              10#Last TEST 12345Last TEST Last TEST xyzLast TEST xyz     tuvLast TESTxyz ijkLastTEST
              

              Now, we’re going to use the main regex, which :

              • Cut the text into several lines, each containing the repeated word

              • Add the correct numbering to each split line, for a future sort action

              So, open the Replace dialog , again

              • SEARCH (?-is)^((\d+#)(.{1,}[^ \r\n]).*?)\x20?(?=\3)

                • By default I supposed a case sensitive search… If your prefer an insensitive search, use, at beginning, the modifiers (?i-s)
              • REPLACE \1\r\n\2

              • Keep the same options, as above

              • Click on the Replace All button, repeatedly ( or use the Alt + A shortcut ), until you get the message Replace All: 0 occurrences were replaced ( 6 hits for this example ! )

              You should obtain this 32-lines text :

              01#123.45607894.165@abcd;aba
              01#123.45607894.165@abcd;aba
              02#9871.001@fab:9782581afa xx
              02#9871.001@fab:9782581afa
              02#9871.001@fab:9782581afa
              03#00040 jhjhjdsadj2
              03#00040 jhjhjdsadj2 ""
              03#00040 jhjhjdsadj2
              04#journal… xxx
              04#journal
              05#journal… @
              05#journal
              06#the same 1234
              06#the same
              07#abcde 12345
              07#abcde
              07#abcdexyz tuv
              07#abcde
              08#fghij 12345
              08#fghij
              08#fghij xyz
              08#fghij xyz  tuv
              09#PQR 12345
              09#PQR
              09#PQRxyz tuv
              09#PQR
              10#Last TEST 12345
              10#Last TEST
              10#Last TEST xyz
              10#Last TEST xyz     tuv
              10#Last TESTxyz ijk
              10#Last TEST
              

              Ah, almost finished ! Now, we perform a classical N++ sort, using the option :

              Edit > Line Operations > Sort Lines Lexicographically Ascending

              After the sort, don’t forget to add,at least, one pure blank line, after the sorted results ( IMPORTANT )

              Hence, the sorted text :

              01#123.45607894.165@abcd;aba
              01#123.45607894.165@abcd;aba
              02#9871.001@fab:9782581afa
              02#9871.001@fab:9782581afa
              02#9871.001@fab:9782581afa xx
              03#00040 jhjhjdsadj2
              03#00040 jhjhjdsadj2
              03#00040 jhjhjdsadj2 ""
              04#journal
              04#journal… xxx
              05#journal
              05#journal… @
              06#the same
              06#the same 1234
              07#abcde
              07#abcde
              07#abcde 12345
              07#abcdexyz tuv
              08#fghij
              08#fghij 12345
              08#fghij xyz
              08#fghij xyz  tuv
              09#PQR
              09#PQR
              09#PQR 12345
              09#PQRxyz tuv
              10#Last TEST
              10#Last TEST
              10#Last TEST 12345
              10#Last TEST xyz
              10#Last TEST xyz     tuv
              10#Last TESTxyz ijk
              

              Finally, for each line number, we must keep the last item, only. So :

              • For the last time, open the Replace dialog

              • SEARCH ^(?-s)(.+)\R(\1.*\R)+

              • REPLACE \2

              • Keep the same options, as above

              • Click, once, on the Replace All button

              Almost the final text expected !

              01#123.45607894.165@abcd;aba
              02#9871.001@fab:9782581afa xx
              03#00040 jhjhjdsadj2 ""
              04#journal… xxx
              05#journal… @
              06#the same 1234
              07#abcdexyz tuv
              08#fghij xyz  tuv
              09#PQRxyz tuv
              10#Last TESTxyz ijk
              

              To end, we just have to get rid of the numbering, at beginning of each line. No problem with the simple regex :

              • SEARCH (?-s)^.+#

              • REPLACE Leave EMPTY

              • Keep the same options, as above

              • Click, once, on the Replace All button


              Here we are ! A bit of work but a correct result, isn’t it ?

              123.45607894.165@abcd;aba
              9871.001@fab:9782581afa xx
              00040 jhjhjdsadj2 ""
              journal… xxx
              journal… @
              the same 1234
              abcdexyz tuv
              fghij xyz  tuv
              PQRxyz tuv
              Last TESTxyz ijk
              

              I just hope, that results will be correct, too, with your real data ;-))

              See you later

              Cheers,

              guy038

              VivianjenylordV 1 Reply Last reply Reply Quote 1
              • VivianjenylordV
                Vivianjenylord @guy038
                last edited by

                @guy038
                guy038, I am very grateful to you, you are a great person for your selfless help, having taken the time to make an excellent explanation of the subject, I was able to solve my problem with the text.
                Thank you

                1 Reply Last reply Reply Quote 1
                • First post
                  Last post
                The Community of users of the Notepad++ text editor.
                Powered by NodeBB | Contributors