Is there a way to find words in one document that are not in the other?

Lemmy Westin

I tried the Compare plugin, but it’s not quite what I’m looking for.

Essentially, I want something to show me words that are in one document that are not in another. For example, if you have two documents full of a bunch of movie titles, I’d like to see the ones that are in one document but not the other.

For example sometimes I’ll notice something got deleted on a list of some kind, that I didn’t realize had been deleted, so comparing something to an older saved version to see if the old version has something the new version does not have, this would be very useful.

guy038

Hello, @lemmy-westin,

By now, it’s 8.49am in France. I’ll be away from house all day long, this Sunday. But, this evening, I can give you a solution.

Basically, I would merge the two documents in a single file, with a known boundary, between the two contents.

Then, with a regular expression, it would be possible to search for text which is in the first part of this temporary file and NOT in the second part of the file, after the boundary !

See you later,

Best Regards,

guy038

Lemmy Westin

That sounds promising and interesting, thanks! I have not used regular expressions before, just looking at a wiki explanation of what that is at the moment.

V S Rawat

of course, it would be great if such a help is available within np+

However, I think more convenient would be to use a text tokenizer program - T-Stat at - http://tstat.polito.it/ - would is a good choice.
you enter one or more txt/html/doc(x)/openoffice writer files.
and it will give you complete word list in a click, sorted on text or frequency.

You do that for two files, create there their word-lists,
and then compare that in excel, or maybe in npp using compare plugin.

having such a tokenizer within npp would be also great that would create a complete word list of our current text file.

Thanks.

V S Rawat

tstat has version 3.0 to 3.1.1., but those are non-window.
for windows, you can use max textstat-2.9c.zip
there is 3.0 released for window but that site doesn’t seem to be working today.

guy038

Hi, @Lemmy-westin,

Sorry, it took me more time than I, first, thought !! But I did it, yeah !

We’ll need a dummy character, repeated a couple of times, as a boundary, between the contents of the two files to compare. Of course, this special character must NOT be already present in your two files.

I, personally, chose the # character. However, any other symbol may be used. Be aware that if you choose a special regex character as, for instance, the + sign , you’ll need to escape it ( \+ ), in the regexes, in order that the regex engine considers it as a literal sign !

So, following the method, explained in my previous post :

In a new tab, paste the contents of the first file to compare
Add a single line, with some # characters, which represents the boundary between the contents of the two files
Paste, after this boundary line, the contents of the second file to compare
To detect any word, which exists, before the boundary and does NOT exist, after the boundary, use the regex, below :

SEARCH (?si)(?<=\W)(\w+)(?=\W.*#+(?!.*\W\1(\W|\z)))

To detect any word, which exists, before the boundary and ALSO exists, after the boundary, use the regex, below :

SEARCH (?si)(?<=\W)(\w+)(?=\W.*#+(?=.*\W\1(\W|\z)))

You may give a try with the license.txt file, in N++ folder.

Open license.txt
Add a new line #####, somewhere in the file.
Go back to the very beginning
Open the Find dialog
Select the Regular expression search mode
Type, one the two above regexes, in the Find what: zone
Click on the Find next button, repeatedly

Et voilà !

REMARKS :

This search could be slow, on “out of date” computers ( like my own configuration), and/or for big size files !
Due to some bugs of the regex engine, relative to backward assertions, it’s better to begin searching at a location, which is followed by a non-word character or in a blank line, located above. By that means, if a matched word begins a line, it will be correctly found !
With these regexes, the backward search do NOT work. The opposite would have been very amazing :-))
The general template, of these regexes, is :

[Modifiers][Positive Look-Behind][Regex to Search][Positive Look-Ahead[Negative Look-Ahead]]

[Modifiers][Positive Look-Behind][Regex to Search][Positive Look-Ahead[Positive Look-Ahead]]

I’ll give you better explanations, next time !

Now, @lemmy-westin, I suppose that my example, certainly, does not, exactly, match what you would like to ! May be, your files seem more like a simple list, of one or more words per line. In that case, the practical goal would be to detect :

A line, in the FIRST part, which does NOT exist, in the SECOND part, of current file
A line which BOTH exists, in the FIRST and SECOND parts of current file

So, if you don’t mind, in order to “tune” these regexes, could you give me some examples of texts we have to search through ?

TIA,

Best Regards,

guy038

P.S. : I, surely, already answered to your question, or so, some time ago ! I’ve just have to find out where, among all my postings !!

Lemmy Westin

Thank you guy038, much much thanks! This works like a charm. Thanks for putting that time into putting this together, you know your stuff!

On what the text being searched is like, it’s generally just lines of things separated by commas, take movie titles or bands for example. This is a great way to see if something from an older version of anything is missing from a new version in general, so this has other uses too. Thanks again!

And thanks for the other token idea thingy too V S, rock on helpful folks here!

guy038

Hi, @Lemmy-westin, and All,

Thinking again about your problem, I succeeded to build a general method and the corresponding regexes !

So, let’s suppose you have a text, separated in TWO parts, by a single line, build of some # characters.

Then, you may like to search for :

Case D1 : Lines, which lie, ONLY, in the FIRST part of the text ( BEFORE the ###### line )
Case E1 : Lines, which lie, BOTH, in the TWO parts of the text ( BEFORE and AFTER the ###### line )
Case D2 : Parts of line, which lie, ONLY, in the FIRST part of the text ( BEFORE the ###### line )
Case E2 : Parts of line, which lie, BOTH, in the TWO parts of the text ( BEFORE and AFTER the ###### line )
Case D3 : Single words, which lie, ONLY, in the FIRST part of the text ( BEFORE the ###### line )
Case E3 : Single words, which lie, BOTH, in the TWO parts of the text ( BEFORE and AFTER the ###### line )

Remark :

If you want to search for ranges, in the SECOND part of text, exclusively, just swap the two parts of text and use, either, the case D1, D2 or D3 !

To, correctly, define these three ranges of text, we’ll use a start boundary and an end boundary. They will be used, in the look-behind and look-ahead structures, and will NEVER be part of the regex to search for !

For cases D1 and E1 :
- Start boundary = ^ ( Beginning of line ) OR \R ( End of Line characters of previous line )
- End boundary = \R ( End of line character(s) = \r\n in Windows files or \n in Unix files )
- Searched regex .+ ( All standard characters of any NO-blank line )
For cases D2 and E2 :
- Start boundary = % ( An other dummy character, NOT already used in current text )
- End boundary = % ( The same character, as above )
- Searched regex = .+ ( Any NON-null range of standard characters, between the two % excluded limits )
For cases D3 and E3 :
- Start boundary = \W ( A NON-word character, so, any character different from [0-9A-Za-z] and from all accentuated characters. This, also, includes the End of Line characters )
- End boundary = \W ( A NON-word character, as above )
- Searched regex = (\w+) ( A complete single word, of any length, between two excluded NON-word characters )

Now, here are the regexes to achieve these different searches :

Case D1 : (?i)^(.+)(?s)(?=\R.*#+(?!.*\R\1(\R|\z))) OR (?i)^(.+)(?s)(?=\R.*#+)(?!.*#+.*\R\1(\R|\z))

Case E1 : (?i)^(.+)(?s)(?=\R.*#+(?=.*\R\1(\R|\z))) OR (?i)^(.+)(?s)(?=\R.*#+.*\R\1(\R|\z))

You may test the D1 and E1 regexes with, for instance, the text, below, in a NEW tab :

When we speak of free
software, we are referring to
 freedom, not price. Our General
When we speak of free
software, we are referring to
make sure that you have the
freedom to distribute copies
This is a simple test
#########################################
This IS A simple TEST
When we SPEAK of free
 freedom, not price. Our General
make sure that you have the
 freedom, not price. Our General

Case D2 : (?i)(?<=%)(.+)(?s)(?=%.*#+(?!.*%\1%)) OR (?i)(?<=%)(.+)(?s)(?=%.*#+)(?!%.*#+.*%\1%)

Case E2 : (?i)(?<=%)(.+)(?s)(?=%.*#+(?=.*%\1%)) OR (?i)(?<=%)(.+)(?s)(?=%.*#+.*%\1%)

You may test the D2 and E2 regexes with, for instance, the text, below, in a NEW tab :

111 %When we speak of free% 111
222,%software, we are referring to%,222
333      % freedom, not price. Our General%        333
abc %When we speak of free% abc
xyz,%software, we are referring to%,xyz
%make sure that you have the%
555       %freedom to distribute copies%        555
666:%This is a simple test%:666
#####################################################################
777|||%This is A simple TEST%|||777
888----%When we SPEAK of free%----888
999% freedom, not price. Our General%999
abc     %make sure that you have the%      abc
000000000% freedom, not price. Our General%0000000000000000
   -------------   %make sure that you have the%   ------------

Case D3 : (?si)(?<=\W)(\w+)(?=\W.*#+(?!.*\W\1(\W|\z))) OR (?si)(?<=\W)(\w+)(?=\W.*#+)(?!.*#+.*\W\1(\W|\z))

Case E3 : (?si)(?<=\W)(\w+)(?=\W.*#+(?=.*\W\1(\W|\z))) OR (?si)(?<=\W)(\w+)(?=\W.*#+.*\W\1(\W|\z))

You may test the D3 and E3 regexes with, for instance, the text, below, in a NEW tab :

software
price
freedom
SOFtware
prICE
General
Public
This is a simple test to find out identical / different words inside that text 
##########################################################################################
This, is A test in order to know the same / different words of the text
SoftwarE
freeDOM
genERal
FREEDOM

Notes :

The last cases D3 and E3 are the ones, discussed in my previous topic
All the regexes , above, are case insensitive. If searches must be sensitive, just change the (?i) syntaxes into (?-i) and the (?si) syntaxes into (?s-i)
Remember that your text must contain just ONE line with , at least, one # character
Regarding the D1, D2 and D3 equivalent regexes, their general template are :
- [Modifiers][Positive Look-Behind][Regex to Search][Positive Look-Ahead[Negative Look-Ahead]], with nested look-aheads
- [Modifiers][Positive Look-Behind][Regex to Search][Positive Look-Ahead][Negative Look-Ahead], with juxtaposed look-aheads
Regarding the E1, E2 and E3 equivalent regexes, their general template are :
- [Modifiers][Positive Look-Behind][Regex to Search][Positive Look-Ahead[Positive Look-Ahead]], with nested look-aheads
- [Modifiers][Positive Look-Behind][Regex to Search][Positive Look-Ahead], with 1 look-ahead, only
Just notice that a positive look-ahead, nested in an other positive look-ahead, may be merged in an unique look-ahead. But it’s impossible to merge a negative look-ahead, nested in a positive look-ahead !
Of course, as usual, you may replace, delete, mark or bookmark the different matches, for further modifications !

Cheers,

guy038