Find matching word between two text file



  • Hi guys, I need some help here.
    Let say I’ve 2 text file each with 100’s of lines.
    Text file A and B.
    I want to match all words in A to B … so any words in A if find in B highlighted.
    Example: lat say in file A - have a word: CAR… in file B - have a word: CARPOOL…
    It would match the word: CAR - and highlight it. (so only CAR would be highlighted - not: word: CARPOOL.
    Or … all matching words to be saved to a new file…either would be great.
    So file A being the "source… match any/all words from file A to file B (if exist)
    I tried Compare - but it’s show the difference… I would need the match.
    Thanks for your help in advance.
    Frederick



  • Hi @Frederick Smith
    your problem was interesting. It was very similar in nature to a solution presented by @guy038, namely:
    https://notepad-plus-plus.org/community/topic/16335/multiline-replace-multiple-hosts-in-hostsfile
    In that instance the question was how to remove lines when duplicates found. In essence though the search method here works very close to that one.

    I’m going to assume that the file A contents is 1 word per line, if not then we need file A in that format (When you copy lines you ONLY want the word which is duplicated, not additional words on the same line). So you would need file A opened first, then put a “—” line at the bottom, make it the last line. Then below it add file B.

    Open the Mark function and use the following:
    Find What: (?is)^(.+)\R(?=.*---.*\1)
    You need search mode set to regular expression (very important) and wrap around ticked. Also tick Bookmark Lines, this will help later.

    Have the cursor set at the top left most position of the file, so top of file A contents, otherwise the result will be unpredictable. You will only need to click on the Mark All button once. Any of the file A contents which also appear in file B area (below the — line) will be marked and also the line will be bookmarked (blue circle in the margin). The — line stops attempts to find duplicates in file B area.

    Now use the “Search” menu option, select “Bookmark”, then “Copy Bookmarked Lines”. Put the copied lines elsewhere, which is what you requested.

    My regex includes the (?is) modifier, s means CRLF (carriage return line feed) character is treated like ALL other characters, i means do an insensitive search. Insensitive means “CAR” would also find “car”, “Car”, “cAr” etc.

    I hope this helps, otherwise come back with more info including samples of actual file A and B contents if you can.

    Terry



  • Hi Terry,
    Thanks a lot for taking the time and responding to my question.
    First - you’re correct an your assumption.
    ALMOST THERE…
    First I tried, didn’t work, - then looking at your function code - realized it calls for: “—” (3) not “-” , so once I changed that it WORKED!
    With one exception!
    The only one thing is that it Marks the file A part - not file B part -
    (and I would need file B part to be marked)

    • I tried flipping around the files., but that didn’t work.

    This is not a real files…just a sample to illustrate…

    This is file A:
    car
    apple
    beach
    hello
    down
    sun
    question

    This is file B:
    city
    whatever
    carpool
    san
    beachcity
    cornel
    downpillow

    I opened FileA - and made to this:
    car
    apple
    beach
    hello
    down
    sun
    question

    city
    whatever
    carpool
    san
    beachcity
    cornel
    downpillow

    So,instead mark: car, beach down
    Would need mark: carpool, beachcity, downpillow
    So “car” would be highlighted in: “carpool”

    So how to change the “Mark” function to do that result?

    Thanks again Terry!



  • Hello, @frederick-Smith; @terry-r and All,

    Of course, with your additional information, it becomes easier to point out the suitable regex ! I hope that Terry won’t mind if I reply to you, first ;-))


    Actually, you have two files : File_A which contains a list of strings, which, possibly, are subsets of some words contained in the File_B list !

    Then, we’re going to reverse the logic :

    • First, in a new N++ tab, copy/paste the File_B.txt contents

    • Add the single line ---

    • Then, under this line, insert the File_A contents

    • Open the Mark dialog

    • Use the regex search :

    (?si)(.+)(?=.*^---\R.*^\1$)

    • Preferably, tick the Purge for each search option

    • Click on the Mark All button


    So, given File_B contents, below :

    city
    whatever
    carpool
    san
    beachcity
    cornel
    downpillow
    

    and File_A contents, below :

    car
    cornel
    apple
    beach
    hello
    ever
    down
    sun
    it
    question
    

    Just note that I added 3 words ever, cornel and it, in order to show that “subset-words” can be marked, also, in middle or at end of the whole word or that the entire word can be highlighted !

    Now, we add, in a new tab, the following text :

    city
    whatever
    carpool
    san
    beachcity
    cornel
    downpillow
    ---
    car
    cornel
    apple
    beach
    hello
    ever
    down
    sun
    it
    question
    

    Finally, using the Mark dialog and the regex (?si)(.+)(?=.*^---\R.*^\1$), it should higlight the bold words, below :-))

    city
    whatever
    carpool
    san
    beachcity
    cornel
    downpillow

    Notes :

    • As usual, the (?si) modifiers mean an insensitive to case search and that any dot ( . ) will match any single character ( Standard and EOL )

    • Then, the main part (.+) try to match the longest, non-null, amount of characters, even in several lines, stored as group 1, but ONLY IF the positive look-around (?=.*^---\R.*^\1$) is TRUE. That is to say, IF it detects :

      • A range of any character, possibly empty, .* ,

      • followed with a line with, only, 3 dashes and its line-break, ^---\R ,

      • followed, again, with the longest range, possibly null, of any character, .* ,

      • and ended with the contents of group 1, alone on its line, ^\1$


    Remark : if you prefer a sensitive to case search, simply use the first part (?s-i), instead !

    Cheers,

    guy038



  • @Frederick-Smith said:

    It would match the word: CAR - and highlight it. (so only CAR would be highlighted - not: word: CARPOOL.

    I interpreted that as being the word in file A being highlighted, so what you really meant was the letters CAR in carpool would be highlighted as CAR also existed in file A. Sorry about that and the confusion over the 3 “-”, sometimes characters don’t show well, it’s the interpreter (behind the compose window) that causes most of the issues. As @guy038 has given you another solution to fit your requirements I’ll let it be.

    Be sure to come back if anyone that elaborate, or help further.

    Terry



  • Hi @terry-r, @guy038 and All

    First I want to thank you both: @terry-r and @guy038 - for taking your time and giving me help.

    Both solution works - maybe a bit different - but both gives the good results what I was looking for.

    Let me say, how much I appreciate the community. Thanks you!

    Thanks again guys!


Log in to reply