Community

    • Login
    • Search
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Search

    Delete duplicated expressions with keeping the first one

    Help wanted · · · – – – · · ·
    5
    24
    3559
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • PeterJones
      PeterJones last edited by

      @Ziad-Aborami ,

      @guy038’s solution and mine are quite similar, and they both should be non-greedy (both use the ? modifier). However, we have noticed in the past that very long files sometimes mess up the regex engine. Guy probably remembers better than I what the details of the length-limitation are.

      As a formatting side note: did you notice in the preview window, when you wrote the regex, that the * character disappeared, and some of the string became italic? That’s because you aren’t properly using markdown, as was explained above, to format raw strings. It’s best to put regex in between `` , so all the regex characters embed properly.

      • Your regex (?s)(<some thing>\R[^\r\n]*?\R)(.*?)\K\1 rendered incorrectly as (?s)(<some thing>\R[^\r\n]?\R)(.?)\K\1
      • With `` notation, `(?s)(<some thing>\R[^\r\n]*?\R)(.*?)\K\1` renders correctly as (?s)(<some thing>\R[^\r\n]*?\R)(.*?)\K\1

      @guy038,

      No problem on the similar post: I like that yours avoids the \K-reset.

      Also, please chime in on whether my memory is right on long files messing up regex, especially if you have links to the previous conversation(s), or know where the length-limit hits.

      1 Reply Last reply Reply Quote 2
      • guy038
        guy038 last edited by guy038

        Hi, @ziad-aborami,

        If your file is not personal nor confidential, and if you don’t mind, you could send me your file, by e-mail, to :

        May be a specific configuration occur, in some parts of your file and breaks down the regex logic ! Be confident, there is, probably, an appropriate regex to get the job done on your entire document ;-))

        See you later,

        Cheers,

        guy038

        1 Reply Last reply Reply Quote 3
        • Ziad Aborami
          Ziad Aborami last edited by

          @PeterJones
          Thanks for your notice, I will use the notation marks the next time and thanx for the help

          @guy038
          I will send you the file in E-mail and I hope that you can find the solution for a large txt file and thanx in advance

          1 Reply Last reply Reply Quote 1
          • Alan Kilborn
            Alan Kilborn last edited by

            Maybe I’m a rogue but I would hit it with Python or Pythonscript:

            new_content = []
            flag = 0
            d_dict = {}
            for line in editor.getCharacterPointer().splitlines():
                line = line.rstrip()
                if flag == 1:
                    if not line in d_dict:
                        d_dict[line] = 1
                        new_content.append('<some thing>')
                        new_content.append(line)
                    flag = 0
                else:
                    if line == '<some thing>':
                        flag = 1
                    else:
                        new_content.append(line)
            editor.beginUndoAction()
            editor.setText('\r\n'.join(new_content))
            editor.endUndoAction()
            
            1 Reply Last reply Reply Quote 4
            • Ziad Aborami
              Ziad Aborami last edited by Ziad Aborami

              @Alan-Kilborn

              Thanks for your suggestion

              I got an error while running the script: (win10)

              File “D:\script.py”, line 4, in <module>
              for line in editor.getCharacterPointer().splitlines():
              NameError: name ‘editor’ is not defined

              What is the wrong?

              1 Reply Last reply Reply Quote 1
              • guy038
                guy038 last edited by guy038

                Hello, @ziad-aborami,

                EDIT on 03/02/19 :
                I slightly modified the parts A , B and C, in order to respect the <tr class="content">........</tr> multi-lines blocks structure !


                I did get your e-mail and download your news.txt file of 350,578 lines. This dictionnary file contains 38,783 entries and 8,117 of these entries have duplicate entries, on which a global S/R must occur !

                Now, I’m going to give you 3 parts of your file, which contain duplicate entries and just tell me the final text that you expect to, for each of the 3 examples !

                Part A, regarding the Abaddon entry, in 3 copies :

                <tr class="content">
                </>
                Abaddon
                <div style="margin-left:1em"><span style="color:darkblue"><b>Abaddon - <i>(der)</i></b></span>
                Herkunft: hebr.<br><br>Beschreibung:<br>Substantiv<br><br></span>
                Abaddon - <i>(der)</i></a>
                </td>
                <td class="content" width="50%" valign="top">Abgrund</td>
                </tr>
                <tr class="content">
                </>
                Abaddon
                <div style="margin-left:1em"><span style="color:darkblue"><b>Abaddon - <i>(der)</i></b></span>
                Herkunft: hebr.<br><br>Beschreibung:<br>Substantiv<br><br></span>
                Abaddon - <i>(der)</i></a>
                </td>
                <td class="content" width="50%" valign="top">Unterwelt</td>
                </tr>
                <tr class="content">
                </>
                Abaddon
                <div style="margin-left:1em"><span style="color:darkblue"><b>Abaddon - <i>(der)</i></b></span>
                Herkunft: hebr.<br><br>Beschreibung:<br>Substantiv<br><br></span>
                Abaddon - <i>(der)</i></a>
                </td>
                <td class="content" width="50%" valign="top">jüd. Todesengel</td>
                </tr>
                

                Part B, regarding the abaissieren entry, in 2 copies :

                <tr class="content">
                </>
                abaissieren
                <div style="margin-left:1em"><span style="color:darkblue"><b>abaissieren</b></span>
                Herkunft: franz.<br><br>Beschreibung:<br>Verb<br><br></span>
                abaissieren</a>
                </td>
                <td class="content" width="50%" valign="top">demütigen</td>
                </tr>
                <tr class="content">
                </>
                abaissieren
                <div style="margin-left:1em"><span style="color:darkblue"><b>abaissieren</b></span>
                Herkunft: franz.<br><br>Beschreibung:<br>Verb<br><br></span>
                abaissieren</a>
                </td>
                <td class="content" width="50%" valign="top">erniedrigen</td>
                </tr>
                

                Part C, regarding the abgeschmackt entry, in 4 copies :

                <tr class="content">
                </>
                abgeschmackt
                <div style="margin-left:1em"><span style="color:darkblue"><b>abgeschmackt</b></span> <i class="p" style="color:green"><n></i>
                <span><b>banal</b><br>
                Herkunft: franz.<br><br>Beschreibung:<br>Adjektiv<br><br>Synonym:<br>abgedroschen, einfach, alltäglich, fade, gewöhnlich</span>
                banal</a>
                </td>
                </tr>
                <tr class="content">
                </>
                abgeschmackt
                <div style="margin-left:1em"><span style="color:darkblue"><b>abgeschmackt</b></span> <i class="p" style="color:green"><n></i>
                <span><b>fad(e)</b><br>
                Herkunft: franz.<br><br>Beschreibung:<br>Adjektiv<br><br>Synonym:<br>abgedroschen, abgegriffen, langweilig, läppisch, ohne Geschmack</span>
                fad(e)</a>
                </td>
                </tr>
                <tr class="content">
                </>
                abgeschmackt
                <div style="margin-left:1em"><span style="color:darkblue"><b>abgeschmackt</b></span> <i class="p" style="color:green"><n></i>
                <span><b>insipid</b><br>
                Herkunft: lat.<br><br>Beschreibung:<br>Adjektiv<br></span>
                insipid</a>
                </td>
                </tr>
                <tr class="content">
                </>
                abgeschmackt
                <div style="margin-left:1em"><span style="color:darkblue"><b>abgeschmackt</b></span> <i class="p" style="color:green"><n></i>
                <span><b>insipid(e)</b><br>
                Herkunft: lat.<br><br>Beschreibung:<br>Adjektiv<br><br>Synonym:<br>albern, schal, töricht, uninteressant</span>
                insipid(e)</a>
                </td>
                </tr>
                <tr class="content">
                </>
                abgeschmackt
                <div style="margin-left:1em"><span style="color:darkblue"><b>abgeschmackt</b></span> <i class="p" style="color:green"><n></i>
                <span><b>maussade</b><br>
                Herkunft: franz.<br><br>Beschreibung:<br>Adjektiv<br><br>Synonym:<br>schal, mürrisch</span>
                maussade</a>
                </td>
                </tr>
                

                In your answer, simply place the expected multi-lines text of each part A, B and C, between the two lines :

                ~~~html

                ~~~

                I suppose that, with these 3 expected blocks of text, it’ll be enough to determine the right regex ;-))

                Cheers,

                guy038

                1 Reply Last reply Reply Quote 3
                • PeterJones
                  PeterJones last edited by

                  @Ziad-Aborami said:

                  NameError: name ‘editor’ is not defined

                  That’s caused because the editor object was not imported into the python script.

                  You can add the line

                  from Npp import *
                  

                  to the top of the script you got from @Alan-Kilborn. This will define that object for this one script.

                  Or, if you want it available to all scripts, you can go to Plugins > Python Script > Scripts, then Ctrl+Click on startup, add that line just after the import sys. I also recommend (though it’s not necessary for this) that you also want to use Plugins > Python Script > Configuration…, and change the “Initialisation” setting from LAZY to ATSTARTUP. Then you can save/close startup.py, exit notepad++ and reload. From then on, PythonScript scripts will always have the editor/editor1/editor2/notepad objects available; the ATSTARTUP makes it so that the PythonScript initialization happens right away, which is useful if you’ve got python-scripts adding callback hooks or similar. (This is the setup I use.)

                  1 Reply Last reply Reply Quote 2
                  • Ziad Aborami
                    Ziad Aborami last edited by

                    @PeterJones
                    Thanks for help.
                    I have got:

                    ModuleNotFoundError: No module named ‘Npp’

                    I have added from Npp import * to the first line

                    1 Reply Last reply Reply Quote 0
                    • Ziad Aborami
                      Ziad Aborami last edited by Ziad Aborami

                      @guy038
                      Thanks again!
                      I appreciate your help so much!

                      1 Reply Last reply Reply Quote 0
                      • PeterJones
                        PeterJones last edited by PeterJones

                        @Ziad-Aborami said:

                        ModuleNotFoundError: No module named ‘Npp’

                        Out of curiosity: are you using the PythonScript plugin in Notepad++, or are you using a normal python interpreter, running it with python.exe, python2.exe, or python3.exe? it needs to be using PythonScript plugin in order for this to work.

                        Also, where did you save the script that @Alan-Kilborn provided? The right way is to go to Plugins > Python Script > New Script, then give it a name (such as YourNameHere.py) in the directory that it defaults to. You then run the script using the Plugins > Python Script > Scripts > YourNameHere. For example, here’s a Hello World

                        from Npp import *
                        console.show()
                        console.write("Hello World")
                        

                        I created the file as helloWorld.py using the above sequence, pasted in that text, then ran Plugins > Python Script > Scripts > helloWorld. This displayed Hello World in the PythonScript console sub-window in Notepad++.

                        1 Reply Last reply Reply Quote 2
                        • Alan Kilborn
                          Alan Kilborn last edited by

                          Regarding the import error with the Pythonscript presented. My bad, but I refuse to be bothered with putting those standard imports at the top of every script I ever do (thus I forget about it when posting scripts). I make sure (one time) they are in my startup.py and then I forget about it. Maybe the Pythonscript implementers should consider adding those imports at the top of a file when the New Script option is chosen from the Pythonscript menus.

                          1 Reply Last reply Reply Quote 3
                          • Ziad Aborami
                            Ziad Aborami last edited by

                            @Alan-Kilborn @PeterJones
                            Thanks so much, the python method works well!

                            @PeterJones
                            I was running the python from cmd with python.exe
                            I have created a new script and save it in Notepad++ scripts
                            When I have run it from scripts in Notepad++, it has worked well in 2 seconds without lagging.
                            Finally the problem was solved!

                            1 Reply Last reply Reply Quote 3
                            • Ekopalypse
                              Ekopalypse last edited by

                              It is only needed to import from Npp if the current file isn’t the main file executed.
                              Like if you create a module or package.

                              1 Reply Last reply Reply Quote 3
                              • guy038
                                guy038 last edited by guy038

                                Hi, @ziad-aborami, and All,

                                While waiting for your post, I noticed some errors in your file ! Not a lot : only 7 :-))

                                Here are the corrections, to be made to your new.txt file that you attach to your e-mail. These modifications are listed, in reverse order, in order to not change the numbering during the manual modifications !

                                • First this simple change to respect the <tr class="content">........</tr> multi-lines block, for each entry

                                  • Add a line <tr class="content">, before the first line of your file

                                  • Save your file, which is, now, 350,579 lines long

                                • Then, here are the 7 modifications to do :

                                  • Add the line <tr class="content">, BEFORE the line 346,171 ( </> )

                                  • Change the entire line 346,149 ( <tr class>“content”> ) into the line <tr class="content">

                                  • Change the entire line 343,625 ( </> ) into the <tr class="content">

                                  • Add the line </tr>, AFTER the line 250,404 ( <td class=“content” width=“50%” valign=“top”>Herabsetzung</td> )

                                  • Add the two lines <tr class="content"> and </>, BEFORE the line 250,399 ( Rabaisssement )

                                  • Add the line <tr class="content">, BEFORE the line 48,800 ( </> )

                                  • Add the line </tr>, AFTER the line 48,799 ( </td> )

                                Your file should be, now, 350,585 lines long

                                If you count the number of its <tr class="content">........</tr> multi-lines blocks, with the regex, below :

                                SEARCXH (?-s)^<tr class="content">\R</>\R(?s).+?</tr>

                                You should get 38784 blocks or entries, in your German-German dictionary !


                                Although, I haven’t got any clue about the exact text that you would like to keep, let’s me give it a blind try !

                                I’m starting with the file being modified, as above !

                                Now :

                                • Open the Replace dialog ( Ctrl + H )

                                • SEARCH (?i-s)^(<tr class="content">\R</>\R(.+)\R(?:.+\R){4,70}?)</tr>\R<tr class="content">\R</>\R\2\R

                                • REPLACE \1</>\r\n

                                • Check the Wrap around option and/or move to the very beginning of your file

                                • Select the Regular expression search mode

                                • Click on the Replace All button, repeatedly, till the message Replace All: 0 occurrences were replaced occurs, at bottom of the dialog

                                Hence, the results :

                                After a 1ST click on the "Replace All" button  =>  10,583 replacements done, in 12s
                                After a 2ND click on the "Replace All" button  =>   4,119 replacements done, in 12s
                                After a 3RD click on the "Replace All" button  =>     912 replacements done, in 11s
                                

                                But the 4th replacement was incorrect ! It said 2 replacements but, actually, only the first was OK, as the second wrongly selects all file contents :-((. So I went back, one step, with Ctrl + Z

                                Then I marked all remaining entries of dictionary and place them in a new tab. After detecting the duplicates entries and deleting the others, it remained 20 entries, in 2 or 3 copies, listed below :

                                Abischnitt			Abischnitt
                                Erneuerung			Erneuerung
                                großartig			großartig
                                hervorragend		hervorragend
                                Muster				Muster
                                Verbindung			Verbindung
                                vornehm				vornehm
                                Vorrang				Vorrang
                                Wiederherstellung	Wiederherstellung
                                Überbleibsel		Überbleibsel
                                Übereinstimmung		Übereinstimmung			Übereinstimmung
                                Überlegenheit		Überlegenheit
                                Überlegung			Überlegung
                                übereinstimmen		übereinstimmen			übereinstimmen
                                übereinstimmend		übereinstimmend			übereinstimmend
                                überlegen			überlegen				überlegen
                                übersinnlich		übersinnlich	
                                überspannt			überspannt
                                übertrieben			übertrieben
                                üppig				üppig
                                

                                Finally, for each of these 20 entries, I simply changed the lines :

                                </td>
                                </tr>
                                <tr class="content">
                                </>
                                üppig
                                <div style="margin-left:1em"><span style="color:darkblue"><b>üppig</b></span> <i class="p" style="color:green"><n></i>
                                

                                as below :

                                </td>
                                </>
                                <div style="margin-left:1em"><span style="color:darkblue"><b>üppig</b></span> <i class="p" style="color:green"><n></i>
                                

                                Best Regards,

                                guy038

                                IMPORTANT :

                                My first attempt of the search regex, against the corrected file, was :

                                SEARCH (?-s)^(<tr class="content">\R</>\R(.+)\R(?s).+?)</tr>\R<tr class="content">\R</>\R\2\R

                                But it did not work at all ! It just modified 8 entries and wrongly selected all the file contents, at the end !

                                Then, I decided to omit the smallest range of characters (?s).+?) between the entry (.+)\R and the closing tag </tr> and to replace it with the syntax (?:.+\R)+?, which grabs the smallest range of non-empty lines, in a non-capturing group, giving the regex :

                                SEARCH (?-s)^(<tr class="content">\R</>\R(.+)\R(?:.+\R)+?)</tr>\R<tr class="content">\R</>\R\2\R

                                It was even worse ! This time, I just got 1 result : the whole file contents, again :-((

                                Then, I verified how many lines was located after an entry, till the nearest </tr> closing tag. And it happens that this initial range is between 4 and 11 lines. However, as the goal is to gather duplicates entries, after some replacements, it would increase the size of all the multi-lines areas <tr class="content">...........</tr> !

                                So I decided of a compromise, using the range {4,70}, in the search regex below, which detects and replaces, correctly, 99,9% of all the occurrences, with a manual modification of 20 entries, only :-))

                                SEARCH (?i-s)^(<tr class="content">\R</>\R(.+)\R(?:.+\R){4,70}?)</tr>\R<tr class="content">\R</>\R\2\R

                                The initial file, of 350585 lines long, which contained 38,784 entries, whose some of them were duplicates, now contains 23,146 uniques entries, for 303,669 lines, only !

                                1 Reply Last reply Reply Quote 3
                                • Ziad Aborami
                                  Ziad Aborami last edited by

                                  @guy038
                                  Thanks so much!
                                  I have fixed the errors.
                                  The command of replacment worked fine!
                                  Thanks a lot for your time
                                  You are generous!

                                  1 Reply Last reply Reply Quote 3
                                  • First post
                                    Last post
                                  Copyright © 2014 NodeBB Forums | Contributors