How to: Delete all lines in a .txt-document that occur in another .txt-document

Iskandar The Pupsi

Hello,

dunno if this can be done easily but I will try to ask anyway:

I have a list of words in a foreign language and there might be a few English words
in there as well…these I want to delete.
My idea was: I have already found a list of 5000 most common English words.
So now my question: how can I delete those entries/lines in my list that also
occur in the English language list?

To make it clearer:
i have a list called “foreign_language_vocab.txt” and a list “English_vocab.txt”.
Now i want to delete all lines in “foriegn_language_vocab.txt” that have also occur in “English_vocab.txt”. Thank you for the help!

Best,
Iskandar

dinkumoil

Is there always one word per line in both files? If yes, a while ago I wrote a script for the NppExec plugin that does exactly what you want.

Which version of Notepad++ do you use? If it is a version prior to v7.6 you can install the NppExec plugin using Plugin Manager. If you use v7.6 you can use new build in Plugin Admin.

When you managed to install the plugin come back to obtain further instructions.

guy038

Hello, @iskandar-the-pupsi, @dinkumoil and All,

Nothing is impossible with regular expressions ;-))

So, in a new N++ tab ( Ctrl + N ) :

Copy all the contents of the foreign_language_vocab.txt file
Add a line of, at least, 3 tildes characters ( ~~~ )
Copy all the contents of the English_vocab.txt file

Here is, below, an example, with a mix of French and English-American words, in the first part

# foreign_language_vocab.txt
table
church
poisson
girl
couteau
maison
orange
town
world
day
école
garçon
car
lit
plate
voiture
star
~~~~~~~~~~~~~~~~~~~~
# English_vocab.txt
table
man
church
girl
knife
town
fork
world
country
car
house
plate
road
light
hammer
box
paper
book
vegetable
orange
castle
forest
wood
bed
desk
water
glass
cat
farm

Now :

Open the Replace dialog
SEARCH (?-s)^(.+)\R(?s)(?=.+^\1$)|~~~.+
REPLACE Leave EMPTY
Tick the Wrap around option
Select the Regular expression search mode
Click on the Replace All button

Et voilà ;-)) You get the expected result, below :

# foreign_language_vocab.txt
poisson
couteau
maison
day
école
garçon
lit
voiture
star

Remarks :

Data, in the two parts does not need to be sorted, first !
If a word has the same spelling in the two languages, it is removed ! ( case of words “table” and “orange” )
If a foreign word is not part of the English_vocab.txt file , it is not removed ( case of the remaining words “day” and “star” in the foreign_language_vocab.txt file )

Best Regards

guy038

Scott Sumner

@guy038 said:

Nothing is impossible with regular expressions

There should be a qualifier: …unless your regular expression happens to select all the text in your document. :-)

guy038

Hi, @scott-sumner and All,

Note that I did not tell "Nothing is impossible with N++ regular expressions " ;-))

Cheers,

guy038