Comparing two txt files. Finding differences.

Rafal Jonca

Hello, I have a question like this:

Can it be done with Notepad ++ ?

I have two txt files.

First looks like this:

A.

http://ftpcdd.cnig.es/LIDAR/2010_LOTE1_ARAGON_NORTE/LAZ/Huso_30/PNOA_2010_LOTE1_ARA-NORTE_594-4650_ORT-CLA-COL.LAZ 14.62
http://ftpcdd.cnig.es/LIDAR/2010_LOTE5_CyL_NE_RIOJA/LAZ/Huso_30/PNOA_2010_Lote5_CYL-RIO_492-4682_ORT-CLA-COL.LAZ 51.23
http://ftpcdd.cnig.es/LIDAR/2010_LOTE5_CyL_NE_RIOJA/LAZ/Huso_30/PNOA_2010_Lote5_CYL-RIO_492-4694_ORT-CLA-COL.LAZ 34.60
http://ftpcdd.cnig.es/LIDAR/2010_LOTE5_CyL_NE_RIOJA/LAZ/Huso_30/PNOA_2010_Lote5_CYL-RIO_490-4694_ORT-CLA-COL.LAZ 31.20

The second like this:

B.

PNOA_2010_LOTE1_ARA-NORTE_594-4650_ORT-CLA-COL.LAZ
PNOA_2010_Lote5_CYL-RIO_492-4682_ORT-CLA-COL.LAZ
PNOA_2010_Lote5_CYL-RIO_492-4694_ORT-CLA-COL.LAZ
PNOA_2010_Lote5_CYL-RIO_490-4694_ORT-CLA-COL.LAZ

The first is just a list of urls to download, and the second is a list of files, which have been downloaded.

And is it possible to compare these two lists and find urls of the files which haven’t been downloaded ? Then it would be quite easy to download missing files manually.

I have many like these :) And re-download give next errors, and so on.

Thank you,

Scott Sumner

@Rafal-Jonca

Don’t think in terms of “comparing” the files. Although that can be made to work, there is an easier way.

Try the following:

Combine the contents of the two files into one file, in the order you’ve shown them (“A” first at the top of the new file, “B” at the bottom of the new file.

Invoke the Mark… feature (Search menu) and set up the following:

Find what zone: ([\w-]+\.LAZ)(?s)(?=.*?^\1)
Mark line checkbox: ticked
Wrap around checkbox: ticked
Search mode radio-button: Regular expression

Press the Mark All button.

This will highlight in red and will bookmark all of the occurrences of the files that you have already downloaded. It is a simple matter from there to delete the bookmarked lines (Search (menu) -> Bookmark -> Remove Bookmarked Lines) to get the list of URLs yet to download.

If this (or ANY posting on the Notepad++ Community site) is useful, don’t reply with a “thanks”, simply up-vote ( click the ^ in the ^ 0 v area on the right ).

Sample of the marking:

Imgur

Rafal Jonca

Hmm, it works excellent with all sets. But with exception of one set of urls.

Everytime it goes to this 1013 line, and later, as you see. Then it marks everything red below.

https://www.sendspace.com/file/3ytn9k

Scott Sumner

@Rafal-Jonca

Hmmm…well, I’m not opposed to using new (to me) hosting sites, but sendspace thinks I’m going to give it a credit card number that it “won’t charge”, so, ah, No, sorry… Suggest putting your file on a different hosting site (e.g. http://textuploader.com/) and I’ll have a look.

There was some discussion in another thread about this general technique causing all the text in the document to be redmarked, so I guess I’m now starting to question this technique, or at least my usage of it (maybe the regular expression is not restrictive enough).

Rafal Jonca

??? Sendspace is free for all the people ?

OK, I know how to use imgur now. It is looking like this:

https://imgur.com/a/Afvof

Scott Sumner

@Rafal-Jonca

Okay, I guess I did the wrong thing on sendspace…oops. :-)

I see the redmarking but to diagnose further I think I need the WHOLE file if you can share it as TEXT, not an image…

Rafal Jonca

I think the suffix “000.LAZ” is making my problems :) It is different in 1014 line.

I will check it carefully and let know.

Rafal Jotski

Yes, I confirm. These urls with 000.LAZ were making problems.

Because I was changing later urls to <a href="http_ shapes and I had broken links in these points. As a result all these with 000.LAZ were out.

So, your method helped me to find error spots :) It is working excellent now.

Could I ask you for detailed explanation how ([\w-]+.LAZ)(?s)(?=.*?^\1) works ?

Scott Sumner

@Rafal-Jotski

Could I ask you for detailed explanation…

Sure.

Look for any string of one or more word characters (defined as A-Z, a-z, 0-9, or _) or a -, followed by a .LAZ. The wrapping parentheses on this cause it the matching string to be remembered as capture group #1. The (?s) means that any following . characters in the expression can match across line borders (usually a line-border will stop the match possibility). Next comes a partial expression that starts with (?=.*? and ends a bit later with ). This is merely an assertion that what else inside occurs at some point later in the document. In this case what is inside that wrapper is a ^ which means “start of a line”, followed by \1 which is the same text as matched earlier (your xxxx.LAZ).

Since what occurs inside the (?= and ) is just an assertion it must match but does not contribute to the match, thus it isn’t colored red.

I think this may be fairly easy to understand, but maybe not to write from ground up, and it definitely isn’t easy to describe as per the above. I hope this helps in some way…

chcg

This post is deleted!

chcg

https://github.com/pnedev/compare-plugin might help you for simple ordered file lists or some other standalone diff programs like kdiff3, winmerge, …