Ian,
I preferred, in my last post, add the mark 1 OR 2, in order to easily notice, in the final File 3 fine, the lines that :
are in File 1 and NOT in File 2 ( The ones, which end with the mark 1 )
are in File 2 and NOT in File 1 ( The ones, which end with the mark 2 )
However, as it seems that :
ONLY your File 2 has 950 more lines, about, than in File 1
NO line, in File 1, is missing, in File 2
We DON’T have, therefore, to add any mark, in the two files :-) Necessarily, the remaining lines will be, only, the additional lines of the File 2 file !!
Then, the list of the different steps to do is quite simplified !
So, we start with two files, File 1, below :
Line A
Line C
Line D
Line F
Line G
and File 2, below :
Line A
Line B
Line C
Line D
Line E
Line F
Line G
Line H
Merge the contents of the two files File 1 and File 2, in a File 3 file. So, here are the contents of the File 3 file, below :
Line A
Line C
Line D
Line F
Line G
Line A
Line B
Line C
Line D
Line E
Line F
Line G
Line H
Perform a sort, on the contents of File 3 file, with the menu option Edit - Line Operations - Sort Lines Lexicographically Ascending. The File 3 file is, now, changed, as below :
Line A
Line A
Line B
Line C
Line C
Line D
Line D
Line E
Line F
Line F
Line G
Line G
Line H
Add an empty line, at the very end of the File 3 file ( just for the next S/R to work correctly, in all cases )
Now, we just have to run an unique S/R, on the contents of the File 3 file :
Open the Replace dialog ( CTRL + H )
Get back to the very beginning of the file ( CTRL Origin )
SEARCH ^(.+\R)\1
REPLACE NOTHING
Check the Regular expression search mode
Uncheck the . matches newline option
Click on the Replace All button
-> The final state of the File 3 file, as you would expect to, should be, as below :
Line B
Line E
Line H
Notes :
This regex deletes all the two same lines, which come, both, from File 1 AND File 2 files
Compared to my previous post, the regex is more simple, because, we don’t take account of the mark character, at the end of each line
The part .+ represents all the standard characters, of each non empty line
The syntax \R represents the EOL character(s), of each non empty line
The round brackets (....) , surrounding the part .+\R, store the any entire line as group 1
The form \1 stands for an entire line, identical to the previous one ( group 1 )
As the Replace field is empty, the two consecutive identical lines are, therefore, deleted
In order to simplify the regex, I didn’t try to delete the possible empty lines
I hope that this new version, of the method, will be more clear !
Cheers,
guy038
P.S. :
When you say :
If this is so then it would be an impossible task, there are several thousand lines
I just don’t understand ?! Indeed, the previous S/R, to add a mark :
SEARCH .$
REPLACE $0\t1
take, almost, the same time, to be performed on a file of 10 lines only OR on a huge file with 100,000 lines !
As for me, on my old Win XP laptop, with 1 Gb of RAM, only, and N++ v6.8.8, this S/R take 8 seconds, about, on a 100,000 lines file :-)))