Is there a way to find words in one document that are not in the other?



  • I tried the Compare plugin, but it’s not quite what I’m looking for.

    Essentially, I want something to show me words that are in one document that are not in another. For example, if you have two documents full of a bunch of movie titles, I’d like to see the ones that are in one document but not the other.

    For example sometimes I’ll notice something got deleted on a list of some kind, that I didn’t realize had been deleted, so comparing something to an older saved version to see if the old version has something the new version does not have, this would be very useful.



  • Hello, @lemmy-westin,

    By now, it’s 8.49am in France. I’ll be away from house all day long, this Sunday. But, this evening, I can give you a solution.

    Basically, I would merge the two documents in a single file, with a known boundary, between the two contents.

    Then, with a regular expression, it would be possible to search for text which is in the first part of this temporary file and NOT in the second part of the file, after the boundary !

    See you later,

    Best Regards,

    guy038



  • That sounds promising and interesting, thanks! I have not used regular expressions before, just looking at a wiki explanation of what that is at the moment.



  • of course, it would be great if such a help is available within np+

    However, I think more convenient would be to use a text tokenizer program - T-Stat at - http://tstat.polito.it/ - would is a good choice.
    you enter one or more txt/html/doc(x)/openoffice writer files.
    and it will give you complete word list in a click, sorted on text or frequency.

    You do that for two files, create there their word-lists,
    and then compare that in excel, or maybe in npp using compare plugin.

    having such a tokenizer within npp would be also great that would create a complete word list of our current text file.

    Thanks.



  • tstat has version 3.0 to 3.1.1., but those are non-window.
    for windows, you can use max textstat-2.9c.zip
    there is 3.0 released for window but that site doesn’t seem to be working today.



  • Hi, @Lemmy-westin,

    Sorry, it took me more time than I, first, thought !! But I did it, yeah !

    We’ll need a dummy character, repeated a couple of times, as a boundary, between the contents of the two files to compare. Of course, this special character must NOT be already present in your two files.

    I, personally, chose the # character. However, any other symbol may be used. Be aware that if you choose a special regex character as, for instance, the + sign , you’ll need to escape it ( \+ ), in the regexes, in order that the regex engine considers it as a literal sign !

    So, following the method, explained in my previous post :

    • In a new tab, paste the contents of the first file to compare

    • Add a single line, with some # characters, which represents the boundary between the contents of the two files

    • Paste, after this boundary line, the contents of the second file to compare

    • To detect any word, which exists, before the boundary and does NOT exist, after the boundary, use the regex, below :

    SEARCH (?si)(?<=\W)(\w+)(?=\W.*#+(?!.*\W\1(\W|\z)))

    • To detect any word, which exists, before the boundary and ALSO exists, after the boundary, use the regex, below :

    SEARCH (?si)(?<=\W)(\w+)(?=\W.*#+(?=.*\W\1(\W|\z)))


    You may give a try with the license.txt file, in N++ folder.

    • Open license.txt

    • Add a new line #####, somewhere in the file.

    • Go back to the very beginning

    • Open the Find dialog

    • Select the Regular expression search mode

    • Type, one the two above regexes, in the Find what: zone

    • Click on the Find next button, repeatedly

    Et voilà !

    REMARKS :

    • This search could be slow, on “out of date” computers ( like my own configuration), and/or for big size files !

    • Due to some bugs of the regex engine, relative to backward assertions, it’s better to begin searching at a location, which is followed by a non-word character or in a blank line, located above. By that means, if a matched word begins a line, it will be correctly found !

    • With these regexes, the backward search do NOT work. The opposite would have been very amazing :-))

    • The general template, of these regexes, is :

    [Modifiers][Positive Look-Behind][Regex to Search][Positive Look-Ahead[Negative Look-Ahead]]
    
    [Modifiers][Positive Look-Behind][Regex to Search][Positive Look-Ahead[Positive Look-Ahead]]
    
    • I’ll give you better explanations, next time !

    Now, @lemmy-westin, I suppose that my example, certainly, does not, exactly, match what you would like to ! May be, your files seem more like a simple list, of one or more words per line. In that case, the practical goal would be to detect :

    • A line, in the FIRST part, which does NOT exist, in the SECOND part, of current file

    • A line which BOTH exists, in the FIRST and SECOND parts of current file

    So, if you don’t mind, in order to “tune” these regexes, could you give me some examples of texts we have to search through ?

    TIA,

    Best Regards,

    guy038

    P.S. : I, surely, already answered to your question, or so, some time ago ! I’ve just have to find out where, among all my postings !!



  • Thank you guy038, much much thanks! This works like a charm. Thanks for putting that time into putting this together, you know your stuff!

    On what the text being searched is like, it’s generally just lines of things separated by commas, take movie titles or bands for example. This is a great way to see if something from an older version of anything is missing from a new version in general, so this has other uses too. Thanks again!

    And thanks for the other token idea thingy too V S, rock on helpful folks here!



  • Hi, @Lemmy-westin, and All,

    Thinking again about your problem, I succeeded to build a general method and the corresponding regexes !

    So, let’s suppose you have a text, separated in TWO parts, by a single line, build of some # characters.

    Then, you may like to search for :

    • Case D1 : Lines, which lie, ONLY, in the FIRST part of the text ( BEFORE the ###### line )

    • Case E1 : Lines, which lie, BOTH, in the TWO parts of the text ( BEFORE and AFTER the ###### line )

    • Case D2 : Parts of line, which lie, ONLY, in the FIRST part of the text ( BEFORE the ###### line )

    • Case E2 : Parts of line, which lie, BOTH, in the TWO parts of the text ( BEFORE and AFTER the ###### line )

    • Case D3 : Single words, which lie, ONLY, in the FIRST part of the text ( BEFORE the ###### line )

    • Case E3 : Single words, which lie, BOTH, in the TWO parts of the text ( BEFORE and AFTER the ###### line )

    Remark :

    If you want to search for ranges, in the SECOND part of text, exclusively, just swap the two parts of text and use, either, the case D1, D2 or D3 !


    To, correctly, define these three ranges of text, we’ll use a start boundary and an end boundary. They will be used, in the look-behind and look-ahead structures, and will NEVER be part of the regex to search for !

    • For cases D1 and E1 :

      • Start boundary = ^ ( Beginning of line ) OR \R ( End of Line characters of previous line )

      • End boundary = \R ( End of line character(s) = \r\n in Windows files or \n in Unix files )

      • Searched regex .+ ( All standard characters of any NO-blank line )

    • For cases D2 and E2 :

      • Start boundary = % ( An other dummy character, NOT already used in current text )

      • End boundary = % ( The same character, as above )

      • Searched regex = .+ ( Any NON-null range of standard characters, between the two % excluded limits )

    • For cases D3 and E3 :

      • Start boundary = \W ( A NON-word character, so, any character different from [0-9A-Za-z] and from all accentuated characters. This, also, includes the End of Line characters )

      • End boundary = \W ( A NON-word character, as above )

      • Searched regex = (\w+) ( A complete single word, of any length, between two excluded NON-word characters )


    Now, here are the regexes to achieve these different searches :

    Case D1 : (?i)^(.+)(?s)(?=\R.*#+(?!.*\R\1(\R|\z))) OR (?i)^(.+)(?s)(?=\R.*#+)(?!.*#+.*\R\1(\R|\z))

    Case E1 : (?i)^(.+)(?s)(?=\R.*#+(?=.*\R\1(\R|\z))) OR (?i)^(.+)(?s)(?=\R.*#+.*\R\1(\R|\z))

    You may test the D1 and E1 regexes with, for instance, the text, below, in a NEW tab :

    When we speak of free
    software, we are referring to
     freedom, not price. Our General
    When we speak of free
    software, we are referring to
    make sure that you have the
    freedom to distribute copies
    This is a simple test
    #########################################
    This IS A simple TEST
    When we SPEAK of free
     freedom, not price. Our General
    make sure that you have the
     freedom, not price. Our General
    

    Case D2 : (?i)(?<=%)(.+)(?s)(?=%.*#+(?!.*%\1%)) OR (?i)(?<=%)(.+)(?s)(?=%.*#+)(?!%.*#+.*%\1%)

    Case E2 : (?i)(?<=%)(.+)(?s)(?=%.*#+(?=.*%\1%)) OR (?i)(?<=%)(.+)(?s)(?=%.*#+.*%\1%)

    You may test the D2 and E2 regexes with, for instance, the text, below, in a NEW tab :

    111 %When we speak of free% 111
    222,%software, we are referring to%,222
    333      % freedom, not price. Our General%        333
    abc %When we speak of free% abc
    xyz,%software, we are referring to%,xyz
    %make sure that you have the%
    555       %freedom to distribute copies%        555
    666:%This is a simple test%:666
    #####################################################################
    777|||%This is A simple TEST%|||777
    888----%When we SPEAK of free%----888
    999% freedom, not price. Our General%999
    abc     %make sure that you have the%      abc
    000000000% freedom, not price. Our General%0000000000000000
       -------------   %make sure that you have the%   ------------     
    

    Case D3 : (?si)(?<=\W)(\w+)(?=\W.*#+(?!.*\W\1(\W|\z))) OR (?si)(?<=\W)(\w+)(?=\W.*#+)(?!.*#+.*\W\1(\W|\z))

    Case E3 : (?si)(?<=\W)(\w+)(?=\W.*#+(?=.*\W\1(\W|\z))) OR (?si)(?<=\W)(\w+)(?=\W.*#+.*\W\1(\W|\z))

    You may test the D3 and E3 regexes with, for instance, the text, below, in a NEW tab :

    software
    price
    freedom
    SOFtware
    prICE
    General
    Public
    This is a simple test to find out identical / different words inside that text 
    ##########################################################################################
    This, is A test in order to know the same / different words of the text
    SoftwarE
    freeDOM
    genERal
    FREEDOM
    

    Notes :

    • The last cases D3 and E3 are the ones, discussed in my previous topic

    • All the regexes , above, are case insensitive. If searches must be sensitive, just change the (?i) syntaxes into (?-i) and the (?si) syntaxes into (?s-i)

    • Remember that your text must contain just ONE line with , at least, one # character

    • Regarding the D1, D2 and D3 equivalent regexes, their general template are :

      • [Modifiers][Positive Look-Behind][Regex to Search][Positive Look-Ahead[Negative Look-Ahead]], with nested look-aheads

      • [Modifiers][Positive Look-Behind][Regex to Search][Positive Look-Ahead][Negative Look-Ahead], with juxtaposed look-aheads

    • Regarding the E1, E2 and E3 equivalent regexes, their general template are :

      • [Modifiers][Positive Look-Behind][Regex to Search][Positive Look-Ahead[Positive Look-Ahead]], with nested look-aheads

      • [Modifiers][Positive Look-Behind][Regex to Search][Positive Look-Ahead], with 1 look-ahead, only

    • Just notice that a positive look-ahead, nested in an other positive look-ahead, may be merged in an unique look-ahead. But it’s impossible to merge a negative look-ahead, nested in a positive look-ahead !

    • Of course, as usual, you may replace, delete, mark or bookmark the different matches, for further modifications !

    Cheers,

    guy038


Log in to reply