question about compare with additional special chars and wildcard

Daniel B. 0

Hello,

I have two text documents that I would like to synchronize, but I have no idea how.

the files are called exist.txt and download.txt, in the exist.txt are folder names below each other, i would like to match them line by line with the download.txt, conditions are it must be per wildcard and ascii \x02 before and after.

as an example

content of exist.txt

my.folder1
my.folder2
my.folder3

content of download.txt

anything\x02my.folder1\x02whatever
dream\x02my.folder2\x02country

If he finds a match, he should remove it from the download.txt.

i would be happy to receive ideas or tips thanks in advance.

guy038

Hello, @daniel-b-0 and All,

Not difficult with regexes ! Just follow the road map below :

First, rename your download.txt file as download_SVG.txt
Open your two files exist.txt and download_SVG.txt in Notepad++
Now, open a new file in Notepad++
Append the contents of your download_SVG.txt file in this new file
Then, at the very end of the new file, append a line of some equal signs
Finally, append the contents of your exist.txt file, right below the line of equal signs
Save this new file as download.txt

Thus, for example, your new download.txt file would temporarily looks like below :

anythingmy.folder1whatever
dreammy.folder2country
dreammy.folder3
anythingmy.folder4whatever
anythingmy.folder5whatever
=====================================
my.folder1
my.folder3
my.folder5

Open the Replace dialog ( Ctrl + H )
SEARCH (?-si)^.+?\x02(.+)\x02.*\R(?=(?s).+?\1)|(?s)^=+.+
REPLACE Leave EMPTY
Check the Wrap around option
Select the Regular expression mode
Click on the Replace All button

=> Here you are : all lines, whose folder were present twice in the file, are deleted. So it remains the folders not downloaded yet :

dreammy.folder2country
anythingmy.folder4whatever

Re-save your final download.txt file

May be, when you said :

… and ascii \x02 before and after.

You spoke about the true literal expression \x02

In that case, the S/R above must be changed as :

SEARCH (?-si)^.+?\\x02(.+)\\x02.*\R(?=(?s).+?\1)|(?s)^=+.+
REPLACE Leave EMPTY

Best Regards

guy038

Daniel B. 0

thank you very much! @guy038 i am really amazed that regex can be so versatile. it does exactly what it is supposed to do!

guy038

Hi, @daniel-b-0,

Just for info :

Did you speak about the C1 control code \x02 or about the literal expression \x02 ?

BR

guy038

Daniel B. 0

Hi, @guy038,

it was about the control code, your solution works very well! unfortunately notepad is very very slow with more than 4000 lines.

BR

Daniel

guy038

Hi, @daniel-b-0 and All,

Last UPDATED on 2024/05/22 : In the first version of this post, I exposed some real names of my personal photos. After reflection, I decided, for confidentiality, to change it and only show non-personal data !!

I understand that my method cannot be used safely with files of important size. So, I’m going to expose an second method which should work in all cases !

I experimented this new method with real data : A USB key of mine, containing 8,186 photos, collected over a period from 2004 to 2023

( Don’t worry, these photos are also stored on two external hard drives. In all circonstances, we must imitate the Mother Nature;, which uses RNA to code proteins and, NEVER, DNA itself for this purpose !! )

The general organisation of my USB drive is :

G:\_PHOTOS\2004\06_11-22_xxxxxxx - xxxxxxxxx - xxxxxxxxxxxxxx \01.jpg
G:\_PHOTOS\2004\06_11-22_xxxxxxx - xxxxxxxxx - xxxxxxxxxxxxxx \02.jpg
G:\_PHOTOS\2004\06_11-22_xxxxxxx - xxxxxxxxx - xxxxxxxxxxxxxx \03.jpg
G:\_PHOTOS\2004\06_11-22_xxxxxxx - xxxxxxxxx - xxxxxxxxxxxxxx \03_ORG.jpg
G:\_PHOTOS\2004\06_11-22_xxxxxxx - xxxxxxxxx - xxxxxxxxxxxxxx \04.jpg
G:\_PHOTOS\2005\01_24-29_SKI_xxxx xxxx xxxxx\01.jpg
G:\_PHOTOS\2005\01_24-29_SKI_xxxx xxxx xxxxx\02.jpg
G:\_PHOTOS\2005\01_24-29_SKI_xxxx xxxx xxxxx\03.jpg
G:\_PHOTOS\2005\01_24-29_SKI_xxxx xxxx xxxxx\04.jpg
G:\_PHOTOS\2005\01_24-29_SKI_xxxx xxxx xxxxx\05.jpg
G:\_PHOTOS\2005\01_24-29_SKI_xxxx xxxx xxxxx\06.jpg
G:\_PHOTOS\2005\01_24-29_SKI_xxxx xxxx xxxxx\07.jpg
G:\_PHOTOS\2005\01_24-29_SKI_xxxx xxxx xxxxx\08.jpg
G:\_PHOTOS\2005\01_24-29_SKI_xxxx xxxx xxxxx\09.jpg
G:\_PHOTOS\2005\01_24-29_SKI_xxxx xxxx xxxxx\10.jpg
G:\_PHOTOS\2005\03_22_SKI_xx xxxxxxx\01.jpg
G:\_PHOTOS\2005\03_22_SKI_xx xxxxxxx\02.jpg
G:\_PHOTOS\2005\03_22_SKI_xx xxxxxxx\03.jpg
G:\_PHOTOS\2005\08_22_xxxx xxxxxx\01.jpg
G:\_PHOTOS\2006\01_07_xxxxxxx xxxxxxxxxxx\01.jpg
...
...
...
G:\_PHOTOS\2023\10_01_xxxxx_xxxxx.jpg
G:\_PHOTOS\2023\10_01_xxxxx_xxxxx.jpg
G:\_PHOTOS\2023\10_08xxxxx xxxxx xxxxxxxxxxxx\01.jpg
G:\_PHOTOS\2023\10_22_xxxxx_xxxxx_xxxxx\01.jpg
G:\_PHOTOS\2023\12_02_xxxx_xxxxxx_xxxxxx\01.jpg
G:\_PHOTOS\2023\12_15_xxxxxx xxxxxxx xxxxxxxx xxx\01.jpg
G:\_PHOTOS\2023\12_15_xxxxxx xxxxxxx xxxxxxxx xxx\02.jpg
G:\_PHOTOS\2023\12_15_xxxxxx xxxxxxx xxxxxxxx xxx\03.jpg
G:\_PHOTOS\2023\12_15_xxxxxx xxxxxxx xxxxxxxx xxx\04.jpg
G:\_PHOTOS\2023\12_15_xxxxxx xxxxxxx xxxxxxxx xxx\05.jpg
G:\_PHOTOS\2023\12_15_xxxxxx xxxxxxx xxxxxxxx xxx\06.jpg
G:\_PHOTOS\2023\12_15_xxxxxx xxxxxxx xxxxxxxx xxx\07.jpg
G:\_PHOTOS\2023\12_15_xxxxxx xxxxxxx xxxxxxxx xxx\08.jpg
G:\_PHOTOS\2023\12_15_xxxxxx xxxxxxx xxxxxxxx xxx\09.jpg
G:\_PHOTOS\2023\12_15_xxxxxx xxxxxxx xxxxxxxx xxx\10.jpg
G:\_PHOTOS\2023\12_15_xxxxxx xxxxxxx xxxxxxxx xxx\11.jpg
G:\_PHOTOS\2023\12_15_xxxxxx xxxxxxx xxxxxxxx xxx\12.jpg
G:\_PHOTOS\2023\12_15_xxxxxx xxxxxxx xxxxxxxx xxx\13.jpg
G:\_PHOTOS\2023\12_26_xxxxx xxxxxxxxx xx xxxx xxxxxxx\01.jpg
G:\_PHOTOS\2023\12_26_xxxxx xxxxxxxxx xx xxxx xxxxxxx\02.jpg
G:\_PHOTOS\2023\12_26_xxxxx xxxxxxxxx xx xxxx xxxxxxx\03.jpg
G:\_PHOTOS\2023\12_31_xxxxxx - xxxxxxxx\01.jpg

So, sorted by year, then by motif ( month_day[-day]_location_reason or, sometimes, month_day[-day]_reason_location ) and finally by photo number, with, sometimes, the initial of the person who took the photo ( -A for Annie, my sister, -X for unknown, etc, )

In order to mimic your download.txt file, I placed the \x02 delimiters right after the G:_PHOTOS\ part and right before the \xx.jpg part; giving this format :

G:\_PHOTOS\2004\06_11-22_xxxxxxx - xxxxxxxxx - xxxxxxxxxxxxxx \01.jpg
G:\_PHOTOS\2004\06_11-22_xxxxxxx - xxxxxxxxx - xxxxxxxxxxxxxx \02.jpg
G:\_PHOTOS\2004\06_11-22_xxxxxxx - xxxxxxxxx - xxxxxxxxxxxxxx \03.jpg
G:\_PHOTOS\2004\06_11-22_xxxxxxx - xxxxxxxxx - xxxxxxxxxxxxxx \03_ORG.jpg
G:\_PHOTOS\2004\06_11-22_xxxxxxx - xxxxxxxxx - xxxxxxxxxxxxxx \04.jpg
G:\_PHOTOS\2005\01_24-29_SKI_xxxx xxxx xxxxx\01.jpg
G:\_PHOTOS\2005\01_24-29_SKI_xxxx xxxx xxxxx\02.jpg
G:\_PHOTOS\2005\01_24-29_SKI_xxxx xxxx xxxxx\03.jpg
G:\_PHOTOS\2005\01_24-29_SKI_xxxx xxxx xxxxx\04.jpg
G:\_PHOTOS\2005\01_24-29_SKI_xxxx xxxx xxxxx\05.jpg
G:\_PHOTOS\2005\01_24-29_SKI_xxxx xxxx xxxxx\06.jpg
G:\_PHOTOS\2005\01_24-29_SKI_xxxx xxxx xxxxx\07.jpg
G:\_PHOTOS\2005\01_24-29_SKI_xxxx xxxx xxxxx\08.jpg
G:\_PHOTOS\2005\01_24-29_SKI_xxxx xxxx xxxxx\09.jpg
G:\_PHOTOS\2005\01_24-29_SKI_xxxx xxxx xxxxx\10.jpg
G:\_PHOTOS\2005\03_22_SKI_xx xxxxxxx\01.jpg
G:\_PHOTOS\2005\03_22_SKI_xx xxxxxxx\02.jpg
G:\_PHOTOS\2005\03_22_SKI_xx xxxxxxx\03.jpg
G:\_PHOTOS\2005\08_22_xxxx xxxxxx\01.jpg
G:\_PHOTOS\2006\01_07_xxxxxxx xxxxxxxxxxx\01.jpg
...
...
...
G:\_PHOTOS\2023\10_01_xxxxx_xxxxx.jpg
G:\_PHOTOS\2023\10_01_xxxxx_xxxxx.jpg
G:\_PHOTOS\2023\10_08xxxxx xxxxx xxxxxxxxxxxx\01.jpg
G:\_PHOTOS\2023\10_22_xxxxx_xxxxx_xxxxx\01.jpg
G:\_PHOTOS\2023\12_02_xxxx_xxxxxx_xxxxxx\01.jpg
G:\_PHOTOS\2023\12_15_xxxxxx xxxxxxx xxxxxxxx xxx\01.jpg
G:\_PHOTOS\2023\12_15_xxxxxx xxxxxxx xxxxxxxx xxx\02.jpg
G:\_PHOTOS\2023\12_15_xxxxxx xxxxxxx xxxxxxxx xxx\03.jpg
G:\_PHOTOS\2023\12_15_xxxxxx xxxxxxx xxxxxxxx xxx\04.jpg
G:\_PHOTOS\2023\12_15_xxxxxx xxxxxxx xxxxxxxx xxx\05.jpg
G:\_PHOTOS\2023\12_15_xxxxxx xxxxxxx xxxxxxxx xxx\06.jpg
G:\_PHOTOS\2023\12_15_xxxxxx xxxxxxx xxxxxxxx xxx\07.jpg
G:\_PHOTOS\2023\12_15_xxxxxx xxxxxxx xxxxxxxx xxx\08.jpg
G:\_PHOTOS\2023\12_15_xxxxxx xxxxxxx xxxxxxxx xxx\09.jpg
G:\_PHOTOS\2023\12_15_xxxxxx xxxxxxx xxxxxxxx xxx\10.jpg
G:\_PHOTOS\2023\12_15_xxxxxx xxxxxxx xxxxxxxx xxx\11.jpg
G:\_PHOTOS\2023\12_15_xxxxxx xxxxxxx xxxxxxxx xxx\12.jpg
G:\_PHOTOS\2023\12_15_xxxxxx xxxxxxx xxxxxxxx xxx\13.jpg
G:\_PHOTOS\2023\12_26_xxxxx xxxxxxxxx xx xxxx xxxxxxx\01.jpg
G:\_PHOTOS\2023\12_26_xxxxx xxxxxxxxx xx xxxx xxxxxxx\02.jpg
G:\_PHOTOS\2023\12_26_xxxxx xxxxxxxxx xx xxxx xxxxxxx\03.jpg
G:\_PHOTOS\2023\12_31_xxxxxx - xxxxxxxx\01.jpg

In this way, we are sure that the zones, between delimiters, are unique like, for instance :

G:\_PHOTOS\2010\00_abcde_fghij\01.jpg
...
...
G:\_PHOTOS\2011\00_abcde_fghij\01.jpg

Then, I randomized this file, using the N++ option :

Edit > Line Operations > Sort Lines Randomly

So my download.txt file looks like :

G:\_PHOTOS\2014\08_01_xxxxxxxx xxxxxxxxxxxx\009_G.jpg
G:\_PHOTOS\2010\03_06_SKI_xxxxxxxxxx-xxxxxxx\14.jpg
G:\_PHOTOS\2011\01_15_SKI_xxxxxxxxx-xxxxxxx\06.jpg
G:\_PHOTOS\2014\02_21-22_xxxxxxxxxx_xxxxxxxxxx xxxxxx\07.jpg
G:\_PHOTOS\2012\08_07-22_xxxxxxxx xxxxxxxxx\034_X.jpg
G:\_PHOTOS\2010\05_29_xxxxxxxxx xxxxxxx_xxxxxxxx\14.jpg
...
...
...
G:\_PHOTOS\2014\09_13_xxxxxxxxxx_xxxxxxxxxx\023.jpg
G:\_PHOTOS\2017\08_10-28_xx xxxx\013.jpg
G:\_PHOTOS\2010\10_30-31_xxxxxx_xxxxxxxxxxxx xxxxx\076_X.jpg
G:\_PHOTOS\2022\07_13-08_27_xx_xxxx\099_A.jpg
G:\_PHOTOS\2016\03_05-07_SKI_xxxxxxxxxxxx\006.jpg
G:\_PHOTOS\2014\03_24_SKI_xxxxxxx-xxxxxxxx\44.jpg

Secondly, I created an exist.txt file, made of all the different zones, between the STX delimiters. I obtained a file of 366 lines, whose I randomly deleted 45 of them, giving a final exist.txt file with 321 lines. So, at the end of the new method, we should get a file of all the lines containing one of the missing 45 zones !

Important :

For a correct realization, you must use the last v8.6.5 version of Notepad++, which improves the multi-selection process !
In all the search/replacements, listed below :
- The Wrap around option is checked
- The Regular expression search mode is checked
- All the other options are un-checked

Let’s go :

First, re-copy your download.txt file as mark.txt
Open the mark.txt file in N++
Open the Replace dialog ( Ctrl + H )
SEARCH (?-s)^.*\x02(.+)\x02.*
REPLACE $1
Click on the Replace All button

=> We just keep the zones between delimiters

Now, use the menu option Edit > Line Operations > Sort Lines Lexicographically Ascending
Re-open the Replace dialog ( Ctrl + H )
SEARCH (?-s)^(.+\R)\K\1+
REPLACE Leave EMPTY
Click on the Replace All button

=> The duplicate lines are deleted and your mark.txt file should have decreased drastically ! In my case, I did get a mark.txt file with only 366 different lines

Then, append your exist.txt at the end of the mark.txt file. In my case, the file contains 366 + 321 so 687 lines
Again, use the menu option Edit > Line Operations > Sort Lines Lexicographically Ascending
Re-open the Replace dialog ( Ctrl + H )
SEARCH (?-s)^(.+\R)\1
REPLACE Leave EMPTY
Click on the Replace All button

=> The mark.txt file should have decreased and now contains only the zones which require downloading. In my case, it contains, as expected, 45 lines / zones !

If the last line of the mark.txt file ends with an EOL, delete the EOL characters of this last line

Note :

If all or some lines contain sub-folders, you’ll have to replace any \ character with a the literal \\ string
Now, on column 1, do a zero-length COLUMN selection of all the lines ( indication N × 0 in the status bar )
Type in a | pipe character
Hit the Home key
Hit the Backspace key

=> The file is changed into a one-line file

Hit the Home key, again
Delete the first | character
Finally, save the mark.txt file, now a single-line file

Remark :

If the entire line contains more than 2,000 characters, split this long line in parts, right before a | char and delete any | remaining at beginning and/or end of the lines

For example :

abc|def|.......................|uvw|xyz
01|23|.........................|67|89

Of course, in this case, you'll have to REPEAT the MARK operation, described below, for each CREATED line

Now, re-copy your download.txt file as to_do.txt
Switch to the mark.txt tab, containing, most of a time, just a single line
Select all the text ( Ctrl + A )
Open the Mark dialog ( Ctrl + M )

=> The text should be automatically inserted in the dialog

Check the Bookmark line and Purge for each search options ( IMPORTANT )
Switch back to the to_do.txt tab
Click on the Mark All button

=> Message of the dialog Mark: xxx matches in entire file ( 876, in my case )

In the Bookmark margin, select, with the right-click button, the option Remove Unmarked Lines or use the menu option Search > Bookmark > Remove Unmarked Lines
Click on the Clear all marks button of the Mark dialog
Finally, save the to_do.txt file

=> You should get all the files that require downloading, In my theoric case, from the 45 zones to take in account, I got a list of 876 files / lines to “download” ;-))

Best Regards,

guy038

P.S. :

Here’s a tip to count a list of numbers :

Do a multi-column selection of all these numbers, located anywhere in your current file
Paste them in a new tab
Do a zero-length COLUMN selection of all these numbers
Hit the + sign
Hit the Home key
Hit the Backspace key
Hit the End key
Insert the = sign
Copy all contents of this single line ( Ctrl + C )
Open calc.exe
Paste the contents of the clipboard ( Ctrl + V )

=> Here you are : the Windows calculator should show you the total of your **list of numbers ;-)) No possibility of errors and quick result !

You may even count numbers in other bases !