How to: Delete all lines in a .txt-document that occur in another .txt-document



  • Hello,

    dunno if this can be done easily but I will try to ask anyway:

    I have a list of words in a foreign language and there might be a few English words
    in there as well…these I want to delete.
    My idea was: I have already found a list of 5000 most common English words.
    So now my question: how can I delete those entries/lines in my list that also
    occur in the English language list?

    To make it clearer:
    i have a list called “foreign_language_vocab.txt” and a list “English_vocab.txt”.
    Now i want to delete all lines in “foriegn_language_vocab.txt” that have also occur in “English_vocab.txt”. Thank you for the help!

    Best,
    Iskandar



  • Is there always one word per line in both files? If yes, a while ago I wrote a script for the NppExec plugin that does exactly what you want.

    Which version of Notepad++ do you use? If it is a version prior to v7.6 you can install the NppExec plugin using Plugin Manager. If you use v7.6 you can use new build in Plugin Admin.

    When you managed to install the plugin come back to obtain further instructions.



  • Hello, @iskandar-the-pupsi, @dinkumoil and All,

    Nothing is impossible with regular expressions ;-))

    So, in a new N++ tab ( Ctrl + N ) :

    • Copy all the contents of the foreign_language_vocab.txt file

    • Add a line of, at least, 3 dash characters ( --- )

    • Copy all the contents of the English_vocab.txt file

    Here is, below, an example, with a mix of French and English-American words, in the first part

    # foreign_language_vocab.txt
    table
    church
    poisson
    girl
    couteau
    maison
    orange
    town
    world
    day
    école
    garçon
    car
    lit
    plate
    voiture
    star
    ------------
    # English_vocab.txt
    table
    man
    church
    girl
    knife
    town
    fork
    world
    country
    car
    house
    plate
    road
    light
    hammer
    box
    paper
    book
    vegetable
    orange
    castle
    forest
    wood
    bed
    desk
    water
    glass
    cat
    farm
    

    Now :

    • Open the Replace dialog

    • SEARCH (?-s)(^.+\R)(?s)(?=.+^\1)|---.+

    • REPLACE Leave EMPTY

    • Tick the Wrap around option

    • Select the Regular expression search mode

    • Click on the Replace All button

    Et voilà ;-)) You get the expected result, below :

    # foreign_language_vocab.txt
    poisson
    couteau
    maison
    day
    école
    garçon
    lit
    voiture
    star
    

    Remarks :

    • Data, in the two parts does not need to be sorted, first !

    • If a word has the same spelling in the two languages, it is removed ! ( case of words “table” and “orange” )

    • If a foreign word is not part of the English_vocab.txt file , it is not removed ( case of the remaining words “day” and “star” in the foreign_language_vocab.txt file )

    Best Regards

    guy038



  • @guy038 said:

    Nothing is impossible with regular expressions

    There should be a qualifier: …unless your regular expression happens to select all the text in your document. :-)



  • Hi, @scott-sumner and All,

    Note that I did not tell "Nothing is impossible wihth N++ regular expressions " ;-))

    Cheers,

    guy038


Log in to reply