Community
    • Login

    Delete duplicated expressions with keeping the first one

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    24 Posts 5 Posters 5.4k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Ziad AboramiZ
      Ziad Aborami
      last edited by

      @Alan-Kilborn @PeterJones
      Thanks so much, the python method works well!

      @PeterJones
      I was running the python from cmd with python.exe
      I have created a new script and save it in Notepad++ scripts
      When I have run it from scripts in Notepad++, it has worked well in 2 seconds without lagging.
      Finally the problem was solved!

      1 Reply Last reply Reply Quote 3
      • EkopalypseE
        Ekopalypse
        last edited by

        It is only needed to import from Npp if the current file isn’t the main file executed.
        Like if you create a module or package.

        1 Reply Last reply Reply Quote 3
        • guy038G
          guy038
          last edited by guy038

          Hi, @ziad-aborami, and All,

          While waiting for your post, I noticed some errors in your file ! Not a lot : only 7 :-))

          Here are the corrections, to be made to your new.txt file that you attach to your e-mail. These modifications are listed, in reverse order, in order to not change the numbering during the manual modifications !

          • First this simple change to respect the <tr class="content">........</tr> multi-lines block, for each entry

            • Add a line <tr class="content">, before the first line of your file

            • Save your file, which is, now, 350,579 lines long

          • Then, here are the 7 modifications to do :

            • Add the line <tr class="content">, BEFORE the line 346,171 ( </> )

            • Change the entire line 346,149 ( <tr class>“content”> ) into the line <tr class="content">

            • Change the entire line 343,625 ( </> ) into the <tr class="content">

            • Add the line </tr>, AFTER the line 250,404 ( <td class=“content” width=“50%” valign=“top”>Herabsetzung</td> )

            • Add the two lines <tr class="content"> and </>, BEFORE the line 250,399 ( Rabaisssement )

            • Add the line <tr class="content">, BEFORE the line 48,800 ( </> )

            • Add the line </tr>, AFTER the line 48,799 ( </td> )

          Your file should be, now, 350,585 lines long

          If you count the number of its <tr class="content">........</tr> multi-lines blocks, with the regex, below :

          SEARCXH (?-s)^<tr class="content">\R</>\R(?s).+?</tr>

          You should get 38784 blocks or entries, in your German-German dictionary !


          Although, I haven’t got any clue about the exact text that you would like to keep, let’s me give it a blind try !

          I’m starting with the file being modified, as above !

          Now :

          • Open the Replace dialog ( Ctrl + H )

          • SEARCH (?i-s)^(<tr class="content">\R</>\R(.+)\R(?:.+\R){4,70}?)</tr>\R<tr class="content">\R</>\R\2\R

          • REPLACE \1</>\r\n

          • Check the Wrap around option and/or move to the very beginning of your file

          • Select the Regular expression search mode

          • Click on the Replace All button, repeatedly, till the message Replace All: 0 occurrences were replaced occurs, at bottom of the dialog

          Hence, the results :

          After a 1ST click on the "Replace All" button  =>  10,583 replacements done, in 12s
          After a 2ND click on the "Replace All" button  =>   4,119 replacements done, in 12s
          After a 3RD click on the "Replace All" button  =>     912 replacements done, in 11s
          

          But the 4th replacement was incorrect ! It said 2 replacements but, actually, only the first was OK, as the second wrongly selects all file contents :-((. So I went back, one step, with Ctrl + Z

          Then I marked all remaining entries of dictionary and place them in a new tab. After detecting the duplicates entries and deleting the others, it remained 20 entries, in 2 or 3 copies, listed below :

          Abischnitt			Abischnitt
          Erneuerung			Erneuerung
          großartig			großartig
          hervorragend		hervorragend
          Muster				Muster
          Verbindung			Verbindung
          vornehm				vornehm
          Vorrang				Vorrang
          Wiederherstellung	Wiederherstellung
          Überbleibsel		Überbleibsel
          Übereinstimmung		Übereinstimmung			Übereinstimmung
          Überlegenheit		Überlegenheit
          Überlegung			Überlegung
          übereinstimmen		übereinstimmen			übereinstimmen
          übereinstimmend		übereinstimmend			übereinstimmend
          überlegen			überlegen				überlegen
          übersinnlich		übersinnlich	
          überspannt			überspannt
          übertrieben			übertrieben
          üppig				üppig
          

          Finally, for each of these 20 entries, I simply changed the lines :

          </td>
          </tr>
          <tr class="content">
          </>
          üppig
          <div style="margin-left:1em"><span style="color:darkblue"><b>üppig</b></span> <i class="p" style="color:green"><n></i>
          

          as below :

          </td>
          </>
          <div style="margin-left:1em"><span style="color:darkblue"><b>üppig</b></span> <i class="p" style="color:green"><n></i>
          

          Best Regards,

          guy038

          IMPORTANT :

          My first attempt of the search regex, against the corrected file, was :

          SEARCH (?-s)^(<tr class="content">\R</>\R(.+)\R(?s).+?)</tr>\R<tr class="content">\R</>\R\2\R

          But it did not work at all ! It just modified 8 entries and wrongly selected all the file contents, at the end !

          Then, I decided to omit the smallest range of characters (?s).+?) between the entry (.+)\R and the closing tag </tr> and to replace it with the syntax (?:.+\R)+?, which grabs the smallest range of non-empty lines, in a non-capturing group, giving the regex :

          SEARCH (?-s)^(<tr class="content">\R</>\R(.+)\R(?:.+\R)+?)</tr>\R<tr class="content">\R</>\R\2\R

          It was even worse ! This time, I just got 1 result : the whole file contents, again :-((

          Then, I verified how many lines was located after an entry, till the nearest </tr> closing tag. And it happens that this initial range is between 4 and 11 lines. However, as the goal is to gather duplicates entries, after some replacements, it would increase the size of all the multi-lines areas <tr class="content">...........</tr> !

          So I decided of a compromise, using the range {4,70}, in the search regex below, which detects and replaces, correctly, 99,9% of all the occurrences, with a manual modification of 20 entries, only :-))

          SEARCH (?i-s)^(<tr class="content">\R</>\R(.+)\R(?:.+\R){4,70}?)</tr>\R<tr class="content">\R</>\R\2\R

          The initial file, of 350585 lines long, which contained 38,784 entries, whose some of them were duplicates, now contains 23,146 uniques entries, for 303,669 lines, only !

          1 Reply Last reply Reply Quote 3
          • Ziad AboramiZ
            Ziad Aborami
            last edited by

            @guy038
            Thanks so much!
            I have fixed the errors.
            The command of replacment worked fine!
            Thanks a lot for your time
            You are generous!

            1 Reply Last reply Reply Quote 3
            • First post
              Last post
            The Community of users of the Notepad++ text editor.
            Powered by NodeBB | Contributors