Delete duplicated expressions with keeping the first one
-
Maybe I’m a rogue but I would hit it with Python or Pythonscript:
new_content = [] flag = 0 d_dict = {} for line in editor.getCharacterPointer().splitlines(): line = line.rstrip() if flag == 1: if not line in d_dict: d_dict[line] = 1 new_content.append('<some thing>') new_content.append(line) flag = 0 else: if line == '<some thing>': flag = 1 else: new_content.append(line) editor.beginUndoAction() editor.setText('\r\n'.join(new_content)) editor.endUndoAction() -
Thanks for your suggestion
I got an error while running the script: (win10)
File “D:\script.py”, line 4, in <module>
for line in editor.getCharacterPointer().splitlines():
NameError: name ‘editor’ is not definedWhat is the wrong?
-
Hello, @ziad-aborami,
EDIT on 03/02/19 :
I slightly modified the partsA,BandC, in order to respect the<tr class="content">........</tr>multi-lines blocks structure !
I did get your e-mail and download your news.txt file of
350,578lines. This dictionnary file contains38,783entries and8,117of these entries have duplicate entries, on which a global S/R must occur !Now, I’m going to give you
3parts of your file, which contain duplicate entries and just tell me the final text that you expect to, for each of the3examples !Part A, regarding the
Abaddonentry, in3copies :<tr class="content"> </> Abaddon <div style="margin-left:1em"><span style="color:darkblue"><b>Abaddon - <i>(der)</i></b></span> Herkunft: hebr.<br><br>Beschreibung:<br>Substantiv<br><br></span> Abaddon - <i>(der)</i></a> </td> <td class="content" width="50%" valign="top">Abgrund</td> </tr> <tr class="content"> </> Abaddon <div style="margin-left:1em"><span style="color:darkblue"><b>Abaddon - <i>(der)</i></b></span> Herkunft: hebr.<br><br>Beschreibung:<br>Substantiv<br><br></span> Abaddon - <i>(der)</i></a> </td> <td class="content" width="50%" valign="top">Unterwelt</td> </tr> <tr class="content"> </> Abaddon <div style="margin-left:1em"><span style="color:darkblue"><b>Abaddon - <i>(der)</i></b></span> Herkunft: hebr.<br><br>Beschreibung:<br>Substantiv<br><br></span> Abaddon - <i>(der)</i></a> </td> <td class="content" width="50%" valign="top">jüd. Todesengel</td> </tr>Part B, regarding the
abaissierenentry, in2copies :<tr class="content"> </> abaissieren <div style="margin-left:1em"><span style="color:darkblue"><b>abaissieren</b></span> Herkunft: franz.<br><br>Beschreibung:<br>Verb<br><br></span> abaissieren</a> </td> <td class="content" width="50%" valign="top">demütigen</td> </tr> <tr class="content"> </> abaissieren <div style="margin-left:1em"><span style="color:darkblue"><b>abaissieren</b></span> Herkunft: franz.<br><br>Beschreibung:<br>Verb<br><br></span> abaissieren</a> </td> <td class="content" width="50%" valign="top">erniedrigen</td> </tr>Part C, regarding the
abgeschmacktentry, in4copies :<tr class="content"> </> abgeschmackt <div style="margin-left:1em"><span style="color:darkblue"><b>abgeschmackt</b></span> <i class="p" style="color:green"><n></i> <span><b>banal</b><br> Herkunft: franz.<br><br>Beschreibung:<br>Adjektiv<br><br>Synonym:<br>abgedroschen, einfach, alltäglich, fade, gewöhnlich</span> banal</a> </td> </tr> <tr class="content"> </> abgeschmackt <div style="margin-left:1em"><span style="color:darkblue"><b>abgeschmackt</b></span> <i class="p" style="color:green"><n></i> <span><b>fad(e)</b><br> Herkunft: franz.<br><br>Beschreibung:<br>Adjektiv<br><br>Synonym:<br>abgedroschen, abgegriffen, langweilig, läppisch, ohne Geschmack</span> fad(e)</a> </td> </tr> <tr class="content"> </> abgeschmackt <div style="margin-left:1em"><span style="color:darkblue"><b>abgeschmackt</b></span> <i class="p" style="color:green"><n></i> <span><b>insipid</b><br> Herkunft: lat.<br><br>Beschreibung:<br>Adjektiv<br></span> insipid</a> </td> </tr> <tr class="content"> </> abgeschmackt <div style="margin-left:1em"><span style="color:darkblue"><b>abgeschmackt</b></span> <i class="p" style="color:green"><n></i> <span><b>insipid(e)</b><br> Herkunft: lat.<br><br>Beschreibung:<br>Adjektiv<br><br>Synonym:<br>albern, schal, töricht, uninteressant</span> insipid(e)</a> </td> </tr> <tr class="content"> </> abgeschmackt <div style="margin-left:1em"><span style="color:darkblue"><b>abgeschmackt</b></span> <i class="p" style="color:green"><n></i> <span><b>maussade</b><br> Herkunft: franz.<br><br>Beschreibung:<br>Adjektiv<br><br>Synonym:<br>schal, mürrisch</span> maussade</a> </td> </tr>In your answer, simply place the expected multi-lines text of each part
A,BandC, between the two lines :~~~html~~~I suppose that, with these
3expected blocks of text, it’ll be enough to determine the right regex ;-))Cheers,
guy038
-
@Ziad-Aborami said:
NameError: name ‘editor’ is not defined
That’s caused because the
editorobject was not imported into the python script.You can add the line
from Npp import *to the top of the script you got from @Alan-Kilborn. This will define that object for this one script.
Or, if you want it available to all scripts, you can go to Plugins > Python Script > Scripts, then Ctrl+Click on
startup, add that line just after theimport sys. I also recommend (though it’s not necessary for this) that you also want to use Plugins > Python Script > Configuration…, and change the “Initialisation” setting fromLAZYtoATSTARTUP. Then you can save/closestartup.py, exit notepad++ and reload. From then on, PythonScript scripts will always have theeditor/editor1/editor2/notepadobjects available; the ATSTARTUP makes it so that the PythonScript initialization happens right away, which is useful if you’ve got python-scripts adding callback hooks or similar. (This is the setup I use.) -
@PeterJones
Thanks for help.
I have got:ModuleNotFoundError: No module named ‘Npp’
I have added
from Npp import *to the first line -
@guy038
Thanks again!
I appreciate your help so much! -
@Ziad-Aborami said:
ModuleNotFoundError: No module named ‘Npp’
Out of curiosity: are you using the PythonScript plugin in Notepad++, or are you using a normal python interpreter, running it with
python.exe,python2.exe, orpython3.exe? it needs to be using PythonScript plugin in order for this to work.Also, where did you save the script that @Alan-Kilborn provided? The right way is to go to Plugins > Python Script > New Script, then give it a name (such as
YourNameHere.py) in the directory that it defaults to. You then run the script using the Plugins > Python Script > Scripts > YourNameHere. For example, here’s a Hello Worldfrom Npp import * console.show() console.write("Hello World")I created the file as
helloWorld.pyusing the above sequence, pasted in that text, then ran Plugins > Python Script > Scripts > helloWorld. This displayedHello Worldin the PythonScript console sub-window in Notepad++. -
Regarding the import error with the Pythonscript presented. My bad, but I refuse to be bothered with putting those standard imports at the top of every script I ever do (thus I forget about it when posting scripts). I make sure (one time) they are in my startup.py and then I forget about it. Maybe the Pythonscript implementers should consider adding those imports at the top of a file when the New Script option is chosen from the Pythonscript menus.
-
@Alan-Kilborn @PeterJones
Thanks so much, the python method works well!@PeterJones
I was running the python from cmd withpython.exe
I have created a new script and save it in Notepad++ scripts
When I have run it from scripts in Notepad++, it has worked well in 2 seconds without lagging.
Finally the problem was solved! -
It is only needed to import from Npp if the current file isn’t the main file executed.
Like if you create a module or package. -
Hi, @ziad-aborami, and All,
While waiting for your post, I noticed some errors in your file ! Not a lot : only
7:-))Here are the corrections, to be made to your new.txt file that you attach to your e-mail. These modifications are listed, in reverse order, in order to not change the numbering during the manual modifications !
-
First this simple change to respect the
<tr class="content">........</tr>multi-lines block, for each entry-
Add a line
<tr class="content">, before the first line of your file -
Save your file, which is, now,
350,579lines long
-
-
Then, here are the
7modifications to do :-
Add the line
<tr class="content">, BEFORE the line346,171( </> ) -
Change the entire line
346,149( <tr class>“content”> ) into the line<tr class="content"> -
Change the entire line
343,625( </> ) into the<tr class="content"> -
Add the line
</tr>, AFTER the line250,404( <td class=“content” width=“50%” valign=“top”>Herabsetzung</td> ) -
Add the two lines
<tr class="content">and</>, BEFORE the line250,399( Rabaisssement ) -
Add the line
<tr class="content">, BEFORE the line48,800( </> ) -
Add the line
</tr>, AFTER the line48,799( </td> )
-
Your file should be, now,
350,585lines longIf you count the number of its
<tr class="content">........</tr>multi-lines blocks, with the regex, below :SEARCXH
(?-s)^<tr class="content">\R</>\R(?s).+?</tr>You should get
38784blocks or entries, in your German-German dictionary !
Although, I haven’t got any clue about the exact text that you would like to keep, let’s me give it a blind try !
I’m starting with the file being modified, as above !
Now :
-
Open the Replace dialog (
Ctrl + H) -
SEARCH
(?i-s)^(<tr class="content">\R</>\R(.+)\R(?:.+\R){4,70}?)</tr>\R<tr class="content">\R</>\R\2\R -
REPLACE
\1</>\r\n -
Check the
Wrap aroundoption and/or move to the very beginning of your file -
Select the
Regular expressionsearch mode -
Click on the
Replace Allbutton, repeatedly, till the messageReplace All: 0 occurrences were replacedoccurs, at bottom of the dialog
Hence, the results :
After a 1ST click on the "Replace All" button => 10,583 replacements done, in 12s After a 2ND click on the "Replace All" button => 4,119 replacements done, in 12s After a 3RD click on the "Replace All" button => 912 replacements done, in 11sBut the
4threplacement was incorrect ! It said2replacements but, actually, only the first was OK, as the second wrongly selects all file contents :-((. So I went back, one step, withCtrl + ZThen I marked all remaining entries of dictionary and place them in a new tab. After detecting the duplicates entries and deleting the others, it remained
20entries, in2or3copies, listed below :Abischnitt Abischnitt Erneuerung Erneuerung großartig großartig hervorragend hervorragend Muster Muster Verbindung Verbindung vornehm vornehm Vorrang Vorrang Wiederherstellung Wiederherstellung Überbleibsel Überbleibsel Übereinstimmung Übereinstimmung Übereinstimmung Überlegenheit Überlegenheit Überlegung Überlegung übereinstimmen übereinstimmen übereinstimmen übereinstimmend übereinstimmend übereinstimmend überlegen überlegen überlegen übersinnlich übersinnlich überspannt überspannt übertrieben übertrieben üppig üppigFinally, for each of these
20entries, I simply changed the lines :</td> </tr> <tr class="content"> </> üppig <div style="margin-left:1em"><span style="color:darkblue"><b>üppig</b></span> <i class="p" style="color:green"><n></i>as below :
</td> </> <div style="margin-left:1em"><span style="color:darkblue"><b>üppig</b></span> <i class="p" style="color:green"><n></i>Best Regards,
guy038
IMPORTANT :
My first attempt of the search regex, against the corrected file, was :
SEARCH
(?-s)^(<tr class="content">\R</>\R(.+)\R(?s).+?)</tr>\R<tr class="content">\R</>\R\2\RBut it did not work at all ! It just modified
8entries and wrongly selected all the file contents, at the end !Then, I decided to omit the smallest range of characters
(?s).+?)between the entry(.+)\Rand the closing tag</tr>and to replace it with the syntax(?:.+\R)+?, which grabs the smallest range of non-empty lines, in a non-capturing group, giving the regex :SEARCH
(?-s)^(<tr class="content">\R</>\R(.+)\R(?:.+\R)+?)</tr>\R<tr class="content">\R</>\R\2\RIt was even worse ! This time, I just got
1result : the whole file contents, again :-((Then, I verified how many lines was located after an entry, till the nearest
</tr>closing tag. And it happens that this initial range is between4and11lines. However, as the goal is to gather duplicates entries, after some replacements, it would increase the size of all the multi-lines areas<tr class="content">...........</tr>!So I decided of a compromise, using the range
{4,70}, in the search regex below, which detects and replaces, correctly,99,9%of all the occurrences, with a manual modification of20entries, only :-))SEARCH
(?i-s)^(<tr class="content">\R</>\R(.+)\R(?:.+\R){4,70}?)</tr>\R<tr class="content">\R</>\R\2\RThe initial file, of
350585lines long, which contained38,784entries, whose some of them were duplicates, now contains23,146uniques entries, for303,669lines, only ! -
-
@guy038
Thanks so much!
I have fixed the errors.
The command of replacment worked fine!
Thanks a lot for your time
You are generous!