Community
    • Login

    Delete duplicated expressions with keeping the first one

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    24 Posts 5 Posters 5.4k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Ziad AboramiZ
      Ziad Aborami
      last edited by Ziad Aborami

      Hello,

      I have a content like this:
      …
      <some thing>
      Cat
      Explain
      <some thing>
      Dog
      Another explain
      <some thing>
      Dog
      Information
      <some thing>
      Bird
      Other explaing
      <some thing>
      Bird
      Different explain
      <some thing>
      Bird
      New explain
      …

      What I want to do
      Match the duplicated expressions in this pattern:

      <some thing>\R(.*?)$

      In this example:

      <some thing>
      Bird
      <some thing>
      Dog

      And then keep only the first expression of each group
      The result is:

      <some thing>
      Cat
      Explain
      <some thing>
      Dog
      Another explain

      Information
      <some thing>
      Bird
      Other explaing

      Different explain

      New explain


      In this way I can merge the explain of the same things.
      …
      I have tried:
      FIND:
      <some thing>\R(.+?)$\s+?^(?=.*^\1$)

      REPLACE:
      (No thing)

      Problem 1: that matches the duplicated results of each group without the last one
      so in replacement I got:

      <some thing>
      Cat
      Explain
      <some thing>

      <some thing>
      Dog
      Information

      Other explaing

      Different explain
      <some thing>
      Bird
      New explain
      …
      and I want something that matches the other results without the first one to get the wanted result
      ,
      Problem 2: when the content is complex
      this search
      <some thing>\R(.+?)$\s+?^(?=.*^\1$)
      matches the whole content

      I hope that there is a helpful way

      Thanks in advance

      1 Reply Last reply Reply Quote 0
      • Ziad AboramiZ
        Ziad Aborami
        last edited by Ziad Aborami

        @Ziad-Aborami said:
        Correction The result was:

        <some thing>
        Cat
        Explain

        Another explain “this blongs to dog”
        <some thing>
        Dog
        Information

        Other explaing “this belongs to bird”

        Different explain “this blongs to bird”
        <some thing>
        Bird
        New explain

        1 Reply Last reply Reply Quote 0
        • Ziad AboramiZ
          Ziad Aborami
          last edited by Ziad Aborami

          I think that is not impossible.
          What I want is to keep only the first result of each group of duplicated expressions with this pattern
          <some thing>\R(.*?)$
          Help please

          1 Reply Last reply Reply Quote 0
          • PeterJonesP
            PeterJones
            last edited by

            I think this matches your description.

            • Find = (?s)(<some thing>\R[^\r\n]*?)(.*?)\1(.*?)(?=^<some thing>|\Z)
              • this sets .-matches-newline, regardless of the checkbox
              • find <some thing> on one line followed by a single line (ie, the “Dog” or “Bird” line), save in group \1
              • save the next line(s) into group \2
              • must find group \1 a second time (ie, a second instance of <some thing> followed by the same Dog/Bird/whatever)
              • consume the next line(s)
              • stop consuming when it looks ahead and sees either a line beginning with <some thing>, or it reaches the end of the document
            • Replace = \1\2
              • take all of that match (except the “look ahead”) and replace with just groups 1 and 2 (the <some thing> and dog/bird/whatever, and the line(s) following that)
            • mode = regular expression

            -----
            FYI:

            This forum is formatted using Markdown, with a help link buried on the little grey ? in the COMPOSE window/pane when writing your post. For more about how to use Markdown in this forum, please see @Scott-Sumner’s post in the “how to markdown code on this forum” topic, and my updates near the end. It is very important that you use these formatting tips – using single backtick marks around small snippets, and using code-quoting for pasting multiple lines from your example data files – because otherwise, the forum will change normal quotes ("") to curly “smart” quotes (“”), will change hyphens to dashes, will sometimes hide asterisks (or if your text is c:\folder\*.txt, it will show up as c:\folder*.txt, missing the backslash). If you want to clearly communicate your text data to us, you need to properly format it.

            If you have further search-and-replace (“matching”, “marking”, “bookmarking”, regular expression, “regex”) needs, study this FAQ and the documentation it points to. Before asking a new regex question, understand that for future requests, many of us will expect you to show what data you have (exactly), what data you want (exactly), what regex you already tried (to show that you’re showing effort), why you thought that regex would work (to prove it wasn’t just something randomly typed), and what data you’re getting with an explanation of why that result is wrong. When you show that effort, you’ll see us bend over backward to get things working for you. If you need help formatting, see the paragraph above.

            Please note that for all regex and related queries, it is best if you are explicit about what needs to match, and what shouldn’t match, and have multiple examples of both in your example dataset. Often, what shouldn’t match helps define the regular expression as much or more than what should match.

            (yes, you did give examples, and you showed what you tried; I appreciate that. I include these because it contains useful links and additional information that you might not be aware of)

            1 Reply Last reply Reply Quote 3
            • Ziad AboramiZ
              Ziad Aborami
              last edited by

              Thanks very much.

              That doesn’t work.
              The problem is this marks the whole text between <some thing> and <some thing> if the word after some thing is duplicated
              The second argument ist that marks only the first result with the whole text and I want to mark all other results except the first one

              What I want is to match only the word not the whole explain
              ,
              <some thing>
              Bird
              Explain related to Bird
              <some thing>
              Bird

              Another explain related to Bird
              <some thing>
              Bird

              Informations about Bird
              <some thing>
              Cat
              Explain about cat
              <some thing>
              Tree
              Explain about tree
              <some thing>
              Tree

              Another explain related to tree
              <some thing>

              I want to delete the Items in Bold

              I have tried
              <some thing>\R(\w+?)$^(?=.*^\1$)

              But that matches all the duplicated except the last from each group.
              What I want is to match all the duplicated except the first from each group

              1 Reply Last reply Reply Quote 0
              • PeterJonesP
                PeterJones
                last edited by

                Sorry that I didn’t understand you wanted to keep the extra descriptions

                @Ziad-Aborami said:

                <some thing>\R(\w+?)$^(?=.*^\1$)

                Did you try playing around with my regular expression as the starting point, rather than starting from yours? Because you knew mine just matched too much, and I’d already told you what each of the sections of mine did. You could have gotten closer by just trying to remove sections of mine. Have you started reading the FAQ I pointed you to?

                By using that procedure, and then knowing that \K will “reset” the match, so it allows getting groups from earlier in the document, but only matching/replacing what comes after \K, I was able to get something that is as close to your description as I can come:

                • Find = (?s)(<some thing>\R[^\r\n]*?\R)(.*?)\K\1
                  • I included the final newline in the group \1
                  • the \K resets the match, so it’s only matching the second instance of the group \1 which follows
                  • from my original regex, I removed the line slurp after the \1, and removed the lookahead, because all you wanted to match was the duplicate of \1
                  • this will only match the duplicate of the \1
                • Replace = empty
                • Mode = regular expression
                • Might have to hit Replace All multiple times (that’s the way to take care of the three instances of the bird, for example)

                Note: when I tried single-stepping through using Find Next then Replace, it matched correctly, but didn’t replace with empty; when I hit replace all, it did replace with empty. I don’t know why, or whether that’s repeatable for others.

                Starting with

                <some thing>
                Bird
                Explain related to Bird
                <some thing>
                Bird
                Another explain related to Bird
                <some thing>
                Bird
                Informations about Bird
                <some thing>
                Cat
                Explain about cat
                <some thing>
                Tree
                Explain about tree
                <some thing>
                Tree
                Another explain related to tree
                <some thing>
                

                The first Replace All will get it to

                <some thing>
                Bird
                Explain related to Bird
                Another explain related to Bird
                <some thing>
                Bird
                Informations about Bird
                <some thing>
                Cat
                Explain about cat
                <some thing>
                Tree
                Explain about tree
                Another explain related to tree
                <some thing>
                

                (where there are still two birds), and the second replace all will get it to

                <some thing>
                Bird
                Explain related to Bird
                Another explain related to Bird
                Informations about Bird
                <some thing>
                Cat
                Explain about cat
                <some thing>
                Tree
                Explain about tree
                Another explain related to tree
                <some thing>
                

                If you want extra newlines separating the second and third description, instead of an empty replace, use \r\n as the replace; after hitting replace all a few times, it got down to:

                <some thing>
                Bird
                Explain related to Bird
                
                Another explain related to Bird
                
                Informations about Bird
                <some thing>
                Cat
                Explain about cat
                <some thing>
                Tree
                Explain about tree
                
                Another explain related to tree
                <some thing>
                

                If this is closer, but still not quite right, you’ll have to try to modify starting from my regex, to add or subtract blocks that you think are still not properly matched. Then if you cannot get it to work, you’ll have to explain (like I did above) what you think each phrase in the modified regex is doing, and explain why the results aren’t right, giving examples in markdown formatting (to do exact quotes, rather than letting the forum interpret some characters we may be missing; my FYI above linked you to where you would find out how I’m making the exact-text blocks), so that we know exactly what you have, and what you want, and what you’re getting, and you need to explain in detail why what you’re getting is wrong.

                If what I have provided isn’t enough, you will need to give more data examples of what you want, and what you don’t want. Also, rather than starting from scratch, or from your previous attempts, please try to extend or edit my regex (which I’ve explained on a phrase-by-phrase basis) – and include an explanation of what you think each of your changed phrases will do (and why you think that phrase will help). When you start from yours, which didn’t seem to come close for you, rather than from my, which came close but deleted too much ; and when you throw out the regex without any explanation of what you think that exact regex will do – when you do those, it tells me you’re ignoring everything I’ve said, and that you haven’t tried to learn from my explanations; this makes me more reluctant to keep helping.

                1 Reply Last reply Reply Quote 2
                • guy038G
                  guy038
                  last edited by guy038

                  Hello, @ziad-aborami, @peterjones and All,

                  I think about a possible solution, which uses only one regex S/R, that you must re-iterate till the S/R process tell you that no more occurrence occur ;-))

                  Assuming that the <some thing> expression does represent the literal expression <some thing>,

                  Let’s start with your initial text, below :

                  <some thing>
                  Bird
                  Explain related to Bird
                  <some thing>
                  Bird
                  Another explain related to Bird
                  <some thing>
                  Bird
                  Informations about Bird
                  <some thing>
                  Cat
                  Explain about cat
                  <some thing>
                  Tree
                  Explain about tree
                  <some thing>
                  Tree
                  Another explain related to tree
                  

                  Then :

                  • Open the Replace dialog ( Ctrl + H )

                  • SEARCH (((?-s)^<some thing>\R.+\R)(?s).+?)\2

                  • REPLACE \1

                  • Check the Wrap around option and/or move to the very beginning of your file

                  • Select the Regular expression search mode

                  • Click on the Replace All button, repeatedly, till the message Replace All: 0 occurrences were replaced occurs, at bottom of the dialog

                  You should obtain your expected text :

                  <some thing>
                  Bird
                  Explain related to Bird
                  Another explain related to Bird
                  Informations about Bird
                  <some thing>
                  Cat
                  Explain about cat
                  <some thing>
                  Tree
                  Explain about tree
                  Another explain related to tree
                  

                  Note :

                  • You may click, instead, on the Replace button till no more occurrence occurs. Of course, in that case, the number of clicks needed is greater than with the Replace All button

                  If this syntax is OK for you, I could give you some explanations of this regex S/R, next time !

                  Best Regards,

                  guy038

                  P.S. :

                  Sorry, Peter, I haven’t got time to read your answer, yet, ( almost simultaneous ! )

                  1 Reply Last reply Reply Quote 3
                  • Ziad AboramiZ
                    Ziad Aborami
                    last edited by Ziad Aborami

                    @PeterJones
                    Thanks very much that works in this example but when I have used in a large document with more than 100,000 lines that delete the whole content.
                    I did’t understand why this happens!
                    (?s)(<some thing>\R[^\r\n]?\R)(.?)\K\1
                    this matches in the simple example the wanted items but in my text of thousands lines that hat matched the whole content.

                    1 Reply Last reply Reply Quote 1
                    • Ziad AboramiZ
                      Ziad Aborami
                      last edited by

                      @guy038
                      Thanks very much. That does work in this example and in short texts but in my text with more than 100,000 lines that shrinks the text and delete somethin else.
                      I have tried this with a shorter section from my text and that works but in the whole long text I don’t understand why didn’t work!

                      1 Reply Last reply Reply Quote 0
                      • PeterJonesP
                        PeterJones
                        last edited by

                        @Ziad-Aborami ,

                        @guy038’s solution and mine are quite similar, and they both should be non-greedy (both use the ? modifier). However, we have noticed in the past that very long files sometimes mess up the regex engine. Guy probably remembers better than I what the details of the length-limitation are.

                        As a formatting side note: did you notice in the preview window, when you wrote the regex, that the * character disappeared, and some of the string became italic? That’s because you aren’t properly using markdown, as was explained above, to format raw strings. It’s best to put regex in between `` , so all the regex characters embed properly.

                        • Your regex (?s)(<some thing>\R[^\r\n]*?\R)(.*?)\K\1 rendered incorrectly as (?s)(<some thing>\R[^\r\n]?\R)(.?)\K\1
                        • With `` notation, `(?s)(<some thing>\R[^\r\n]*?\R)(.*?)\K\1` renders correctly as (?s)(<some thing>\R[^\r\n]*?\R)(.*?)\K\1

                        @guy038,

                        No problem on the similar post: I like that yours avoids the \K-reset.

                        Also, please chime in on whether my memory is right on long files messing up regex, especially if you have links to the previous conversation(s), or know where the length-limit hits.

                        1 Reply Last reply Reply Quote 2
                        • guy038G
                          guy038
                          last edited by guy038

                          Hi, @ziad-aborami,

                          If your file is not personal nor confidential, and if you don’t mind, you could send me your file, by e-mail, to :

                          May be a specific configuration occur, in some parts of your file and breaks down the regex logic ! Be confident, there is, probably, an appropriate regex to get the job done on your entire document ;-))

                          See you later,

                          Cheers,

                          guy038

                          1 Reply Last reply Reply Quote 3
                          • Ziad AboramiZ
                            Ziad Aborami
                            last edited by

                            @PeterJones
                            Thanks for your notice, I will use the notation marks the next time and thanx for the help

                            @guy038
                            I will send you the file in E-mail and I hope that you can find the solution for a large txt file and thanx in advance

                            1 Reply Last reply Reply Quote 1
                            • Alan KilbornA
                              Alan Kilborn
                              last edited by

                              Maybe I’m a rogue but I would hit it with Python or Pythonscript:

                              new_content = []
                              flag = 0
                              d_dict = {}
                              for line in editor.getCharacterPointer().splitlines():
                                  line = line.rstrip()
                                  if flag == 1:
                                      if not line in d_dict:
                                          d_dict[line] = 1
                                          new_content.append('<some thing>')
                                          new_content.append(line)
                                      flag = 0
                                  else:
                                      if line == '<some thing>':
                                          flag = 1
                                      else:
                                          new_content.append(line)
                              editor.beginUndoAction()
                              editor.setText('\r\n'.join(new_content))
                              editor.endUndoAction()
                              
                              1 Reply Last reply Reply Quote 4
                              • Ziad AboramiZ
                                Ziad Aborami
                                last edited by Ziad Aborami

                                @Alan-Kilborn

                                Thanks for your suggestion

                                I got an error while running the script: (win10)

                                File “D:\script.py”, line 4, in <module>
                                for line in editor.getCharacterPointer().splitlines():
                                NameError: name ‘editor’ is not defined

                                What is the wrong?

                                1 Reply Last reply Reply Quote 1
                                • guy038G
                                  guy038
                                  last edited by guy038

                                  Hello, @ziad-aborami,

                                  EDIT on 03/02/19 :
                                  I slightly modified the parts A , B and C, in order to respect the <tr class="content">........</tr> multi-lines blocks structure !


                                  I did get your e-mail and download your news.txt file of 350,578 lines. This dictionnary file contains 38,783 entries and 8,117 of these entries have duplicate entries, on which a global S/R must occur !

                                  Now, I’m going to give you 3 parts of your file, which contain duplicate entries and just tell me the final text that you expect to, for each of the 3 examples !

                                  Part A, regarding the Abaddon entry, in 3 copies :

                                  <tr class="content">
                                  </>
                                  Abaddon
                                  <div style="margin-left:1em"><span style="color:darkblue"><b>Abaddon - <i>(der)</i></b></span>
                                  Herkunft: hebr.<br><br>Beschreibung:<br>Substantiv<br><br></span>
                                  Abaddon - <i>(der)</i></a>
                                  </td>
                                  <td class="content" width="50%" valign="top">Abgrund</td>
                                  </tr>
                                  <tr class="content">
                                  </>
                                  Abaddon
                                  <div style="margin-left:1em"><span style="color:darkblue"><b>Abaddon - <i>(der)</i></b></span>
                                  Herkunft: hebr.<br><br>Beschreibung:<br>Substantiv<br><br></span>
                                  Abaddon - <i>(der)</i></a>
                                  </td>
                                  <td class="content" width="50%" valign="top">Unterwelt</td>
                                  </tr>
                                  <tr class="content">
                                  </>
                                  Abaddon
                                  <div style="margin-left:1em"><span style="color:darkblue"><b>Abaddon - <i>(der)</i></b></span>
                                  Herkunft: hebr.<br><br>Beschreibung:<br>Substantiv<br><br></span>
                                  Abaddon - <i>(der)</i></a>
                                  </td>
                                  <td class="content" width="50%" valign="top">jüd. Todesengel</td>
                                  </tr>
                                  

                                  Part B, regarding the abaissieren entry, in 2 copies :

                                  <tr class="content">
                                  </>
                                  abaissieren
                                  <div style="margin-left:1em"><span style="color:darkblue"><b>abaissieren</b></span>
                                  Herkunft: franz.<br><br>Beschreibung:<br>Verb<br><br></span>
                                  abaissieren</a>
                                  </td>
                                  <td class="content" width="50%" valign="top">demütigen</td>
                                  </tr>
                                  <tr class="content">
                                  </>
                                  abaissieren
                                  <div style="margin-left:1em"><span style="color:darkblue"><b>abaissieren</b></span>
                                  Herkunft: franz.<br><br>Beschreibung:<br>Verb<br><br></span>
                                  abaissieren</a>
                                  </td>
                                  <td class="content" width="50%" valign="top">erniedrigen</td>
                                  </tr>
                                  

                                  Part C, regarding the abgeschmackt entry, in 4 copies :

                                  <tr class="content">
                                  </>
                                  abgeschmackt
                                  <div style="margin-left:1em"><span style="color:darkblue"><b>abgeschmackt</b></span> <i class="p" style="color:green"><n></i>
                                  <span><b>banal</b><br>
                                  Herkunft: franz.<br><br>Beschreibung:<br>Adjektiv<br><br>Synonym:<br>abgedroschen, einfach, alltäglich, fade, gewöhnlich</span>
                                  banal</a>
                                  </td>
                                  </tr>
                                  <tr class="content">
                                  </>
                                  abgeschmackt
                                  <div style="margin-left:1em"><span style="color:darkblue"><b>abgeschmackt</b></span> <i class="p" style="color:green"><n></i>
                                  <span><b>fad(e)</b><br>
                                  Herkunft: franz.<br><br>Beschreibung:<br>Adjektiv<br><br>Synonym:<br>abgedroschen, abgegriffen, langweilig, läppisch, ohne Geschmack</span>
                                  fad(e)</a>
                                  </td>
                                  </tr>
                                  <tr class="content">
                                  </>
                                  abgeschmackt
                                  <div style="margin-left:1em"><span style="color:darkblue"><b>abgeschmackt</b></span> <i class="p" style="color:green"><n></i>
                                  <span><b>insipid</b><br>
                                  Herkunft: lat.<br><br>Beschreibung:<br>Adjektiv<br></span>
                                  insipid</a>
                                  </td>
                                  </tr>
                                  <tr class="content">
                                  </>
                                  abgeschmackt
                                  <div style="margin-left:1em"><span style="color:darkblue"><b>abgeschmackt</b></span> <i class="p" style="color:green"><n></i>
                                  <span><b>insipid(e)</b><br>
                                  Herkunft: lat.<br><br>Beschreibung:<br>Adjektiv<br><br>Synonym:<br>albern, schal, töricht, uninteressant</span>
                                  insipid(e)</a>
                                  </td>
                                  </tr>
                                  <tr class="content">
                                  </>
                                  abgeschmackt
                                  <div style="margin-left:1em"><span style="color:darkblue"><b>abgeschmackt</b></span> <i class="p" style="color:green"><n></i>
                                  <span><b>maussade</b><br>
                                  Herkunft: franz.<br><br>Beschreibung:<br>Adjektiv<br><br>Synonym:<br>schal, mürrisch</span>
                                  maussade</a>
                                  </td>
                                  </tr>
                                  

                                  In your answer, simply place the expected multi-lines text of each part A, B and C, between the two lines :

                                  ~~~html

                                  ~~~

                                  I suppose that, with these 3 expected blocks of text, it’ll be enough to determine the right regex ;-))

                                  Cheers,

                                  guy038

                                  1 Reply Last reply Reply Quote 3
                                  • PeterJonesP
                                    PeterJones
                                    last edited by

                                    @Ziad-Aborami said:

                                    NameError: name ‘editor’ is not defined

                                    That’s caused because the editor object was not imported into the python script.

                                    You can add the line

                                    from Npp import *
                                    

                                    to the top of the script you got from @Alan-Kilborn. This will define that object for this one script.

                                    Or, if you want it available to all scripts, you can go to Plugins > Python Script > Scripts, then Ctrl+Click on startup, add that line just after the import sys. I also recommend (though it’s not necessary for this) that you also want to use Plugins > Python Script > Configuration…, and change the “Initialisation” setting from LAZY to ATSTARTUP. Then you can save/close startup.py, exit notepad++ and reload. From then on, PythonScript scripts will always have the editor/editor1/editor2/notepad objects available; the ATSTARTUP makes it so that the PythonScript initialization happens right away, which is useful if you’ve got python-scripts adding callback hooks or similar. (This is the setup I use.)

                                    1 Reply Last reply Reply Quote 2
                                    • Ziad AboramiZ
                                      Ziad Aborami
                                      last edited by

                                      @PeterJones
                                      Thanks for help.
                                      I have got:

                                      ModuleNotFoundError: No module named ‘Npp’

                                      I have added from Npp import * to the first line

                                      1 Reply Last reply Reply Quote 0
                                      • Ziad AboramiZ
                                        Ziad Aborami
                                        last edited by Ziad Aborami

                                        @guy038
                                        Thanks again!
                                        I appreciate your help so much!

                                        1 Reply Last reply Reply Quote 0
                                        • PeterJonesP
                                          PeterJones
                                          last edited by PeterJones

                                          @Ziad-Aborami said:

                                          ModuleNotFoundError: No module named ‘Npp’

                                          Out of curiosity: are you using the PythonScript plugin in Notepad++, or are you using a normal python interpreter, running it with python.exe, python2.exe, or python3.exe? it needs to be using PythonScript plugin in order for this to work.

                                          Also, where did you save the script that @Alan-Kilborn provided? The right way is to go to Plugins > Python Script > New Script, then give it a name (such as YourNameHere.py) in the directory that it defaults to. You then run the script using the Plugins > Python Script > Scripts > YourNameHere. For example, here’s a Hello World

                                          from Npp import *
                                          console.show()
                                          console.write("Hello World")
                                          

                                          I created the file as helloWorld.py using the above sequence, pasted in that text, then ran Plugins > Python Script > Scripts > helloWorld. This displayed Hello World in the PythonScript console sub-window in Notepad++.

                                          1 Reply Last reply Reply Quote 2
                                          • Alan KilbornA
                                            Alan Kilborn
                                            last edited by

                                            Regarding the import error with the Pythonscript presented. My bad, but I refuse to be bothered with putting those standard imports at the top of every script I ever do (thus I forget about it when posting scripts). I make sure (one time) they are in my startup.py and then I forget about it. Maybe the Pythonscript implementers should consider adding those imports at the top of a file when the New Script option is chosen from the Pythonscript menus.

                                            1 Reply Last reply Reply Quote 3
                                            • First post
                                              Last post
                                            The Community of users of the Notepad++ text editor.
                                            Powered by NodeBB | Contributors