Save a list of missing files

Ian Burrow

Hi

Just run a comparison on 2 text files that should be identical, one has approximately 1,000 more entries than the other.

Does anyone know how I can save a list of the files that are missing from the second text file?

Thanks

guy038

Hello Ian,

Easy enough ! Just the time to perform three simple S/R, in regex mode and a sort !

So, let’s imagine that we open, in N++, a file, named File 1, containing lines below :

Line A
Line C
Line D
Line F
Line G

We, then, put a mark, at the end of all the non empty lines of the File 1 file :

Open the Replace dialog ( CTRL + H )
Get back to the very beginning of the file ( CTRL Origin )

SEARCH .$

REPLACE $0\t1

Check the Regular expression search mode
Uncheck the . matches newline option
Click on the Replace All button

-> The File 1 file is, now, changed, as below :

Line A	1
Line C	1
Line D	1
Line F	1
Line G	1

Now, let’s imagine that we open, in N++, an other file, named File 2, with the lines below :

Line A
Line B
Line C
Line D
Line E
Line G
Line H

Again, we put a mark, at the end of all the non empty lines of the File 2 file :

Get back to the very beginning of the file ( CTRL Origin )

SEARCH .$

REPLACE $0\t2

Check the Regular expression search mode
Uncheck the . matches newline option
Click on the Replace All button

-> The File 2 file is, now, changed, as below :

Line A	2
Line B	2
Line C	2
Line D	2
Line E	2
Line G	2
Line H	2

Well, it’s almost done !

Merge the contents of the two files File 1 and File 2, in a File 3 file. So, here are the contents of the File 3 file, below :

Line A 1
Line C 1
Line D 1
Line F 1
Line G 1

Line A 2
Line B 2
Line C 2
Line D 2
Line E 2
Line G 2
Line H 2
Perform a sort, on the contents of File 3 file, with the menu option Edit - Line Operations - Sort Lines Lexicographically Ascending. The File 3 file is changed, as below :

Line A 1
Line A 2
Line B 2
Line C 1
Line C 2
Line D 1
Line D 2
Line E 2
Line F 1
Line G 1
Line G 2
Line H 2

Now, we just have to run a third and last S/R, on the contents of the File 3 file :

Open the Replace dialog ( CTRL + H )
Get back to the very beginning of the file ( CTRL Origin )

SEARCH ^(.+).\R\1.\R|^\R

REPLACE NOTHING

Check the Regular expression search mode
Uncheck the . matches newline option
Click on the Replace All button

-> The final state of the File 3 file should be, as below :

Line B	2
Line E	2
Line F	1
Line H	2

Notes :

This regex delete all the two same lines, which come, both, from File 1 AND File 2
I add the alternative |^\R, to my regex, in order to delete, as well, any empty line

As you notice, it remains, only, orphan lines, that is to say :

Lines, which are in File 1 and NOT in File 2, as the line Line F, with the mark 1 at the end
Lines, which are in File 2 and NOT in File 1, as the lines Line B, Line E and Line H, with the mark 2, at the end

Nice, isn’t it !?

To end with, it’ shouldn’t be very difficult to remove, either, the lines which come from file File 1 OR from File 2, with the two menu options Search - Mark… and Search - Bookmark - Remove Bookmarked Lines

Of course, the length of the lines doesn’t matter, at all. I just chose lines, of equal length, for readability !

Best Regards,

guy038

DaveyD

Hi guy038
You’re amazing with these regex ideas! I would never think that such things can be done with regex (I immediately thought about a python script! :) )
It took me a while to figure this one out, but I got it in the end.
(I would write the explanation, but I can’t do nearly as good a job as you!)

Thanks
Davey

Ian Burrow

Many thanks for the replies.

That looks very complicated to me, I’m certainly no programmer!

Am I understanding you correctly, you say put a mark at the end of each line that appears in both files?

If this is so then it would be an impossible task, there are several thousand lines and about 950 more in one file than the other.

It may help to explain more what I am trying to do. I have 2 copies of what should be the same hard drive. They should be identical but one copy is missing approximately 950 files, I have created a text list using a dos command of each hard drive and I have used Notepad++ to search and identify the missing files / lines. I now need to capture just the lines that are missing.

Thanks again

guy038

Ian,

I preferred, in my last post, add the mark 1 OR 2, in order to easily notice, in the final File 3 fine, the lines that :

are in File 1 and NOT in File 2 ( The ones, which end with the mark 1 )
are in File 2 and NOT in File 1 ( The ones, which end with the mark 2 )

However, as it seems that :

ONLY your File 2 has 950 more lines, about, than in File 1
NO line, in File 1, is missing, in File 2

We DON’T have, therefore, to add any mark, in the two files :-) Necessarily, the remaining lines will be, only, the additional lines of the File 2 file !!

Then, the list of the different steps to do is quite simplified !

So, we start with two files, File 1, below :

Line A
Line C
Line D
Line F
Line G

and File 2, below :

Line A
Line B
Line C
Line D
Line E
Line F
Line G
Line H

Merge the contents of the two files File 1 and File 2, in a File 3 file. So, here are the contents of the File 3 file, below :

Line A
Line C
Line D
Line F
Line G
Line A
Line B
Line C
Line D
Line E
Line F
Line G
Line H
Perform a sort, on the contents of File 3 file, with the menu option Edit - Line Operations - Sort Lines Lexicographically Ascending. The File 3 file is, now, changed, as below :

Line A
Line A
Line B
Line C
Line C
Line D
Line D
Line E
Line F
Line F
Line G
Line G
Line H
Add an empty line, at the very end of the File 3 file ( just for the next S/R to work correctly, in all cases )
Now, we just have to run an unique S/R, on the contents of the File 3 file :
- Open the Replace dialog ( CTRL + H )
- Get back to the very beginning of the file ( CTRL Origin )
- SEARCH ^(.+\R)\1
- REPLACE NOTHING
- Check the Regular expression search mode
- Uncheck the . matches newline option
- Click on the Replace All button

-> The final state of the File 3 file, as you would expect to, should be, as below :

Line B
Line E
Line H

Notes :

This regex deletes all the two same lines, which come, both, from File 1 AND File 2 files
Compared to my previous post, the regex is more simple, because, we don’t take account of the mark character, at the end of each line
- The part .+ represents all the standard characters, of each non empty line
- The syntax \R represents the EOL character(s), of each non empty line
- The round brackets (....) , surrounding the part .+\R, store the any entire line as group 1
- The form \1 stands for an entire line, identical to the previous one ( group 1 )
- As the Replace field is empty, the two consecutive identical lines are, therefore, deleted
In order to simplify the regex, I didn’t try to delete the possible empty lines

I hope that this new version, of the method, will be more clear !

Cheers,

guy038

P.S. :

When you say :

If this is so then it would be an impossible task, there are several thousand lines

I just don’t understand ?! Indeed, the previous S/R, to add a mark :

SEARCH .$

REPLACE $0\t1

take, almost, the same time, to be performed on a file of 10 lines only OR on a huge file with 100,000 lines !

As for me, on my old Win XP laptop, with 1 Gb of RAM, only, and N++ v6.8.8, this S/R take 8 seconds, about, on a 100,000 lines file :-)))