[nsfw] Help extracting links from page source code

Ravi K

[NSFW]
Hi Guys, I have this source code of a page.
https://workupload.com/file/mGszjxtb7vB

What I want to achieve.

Extract all direct image file links like this from source code text file above.

h ttps://cdn04.atkingdom-network.com/secure/content/a/amy038/232716/amy038SRS_232716004.jpg

and discard rest of source code.

Then add /3000/ before jpg file in the link, so above link
will be become

h ttps://cdn04.atkingdom-network.com/secure/content/a/amy038/232716/3000/amy038SRS_232716004.jpg

All links in seperate lines. And Finally Save!
I want to this on multiple text files i have.
I failed at first step. regex is my weakness.

I want to do same on multiple files.
/a/ can be any alphabet in other file ,/amy038/, and /232716/ also be different for other files.

If anyone can help me create macro for this.

Now cherry on top would be if final text file is saved as name amy038-232716.txt

Thanks

guy038

Hello, @ravi-k and All,

After downloading your page.txt file, I could identify, with regexes and sorting, 763 links to a .jpg picture, divided into four classes :

254 lines <a href="https ............... .jpg ...............> <img src="http ..... /thumbs/ ........ .jpg .............></a>, so containing 2 x 254 = 508 links
254 lines Content-Location: http ... cdn04 .............. .jpg ....., so 254 links
1 line https .... www.atkpetites.com .................... .jpg, so 1 link

After sorting, we get :

<a href="https://cdn04.atkingdom-network.com/secure/content/a/amy038/232716/amy038SRS_232716001.jpg
....
....  (A)
....
<a href="https://cdn04.atkingdom-network.com/secure/content/a/amy038/232716/amy038SRS_232716254.jpg



<img src="http://cdn04.atkingdom-network.com/secure/content/a/amy038/232716/thumbs/amy038SRS_232716001.jpg
....
....  (B)
....
<img src="http://cdn04.atkingdom-network.com/secure/content/a/amy038/232716/thumbs/amy038SRS_232716254.jpg



Content-Location: http://cdn04.atkingdom-network.com/secure/content/a/amy038/232716/thumbs/amy038SRS_232716001.jpg
....
....  (C)
....
Content-Location: http://cdn04.atkingdom-network.com/secure/content/a/amy038/232716/thumbs/amy038SRS_232716254.jpg



Content-Location: https://www.atkpetites.com/css/images/america_flag.jpg  (D)

My question is : in order to get the right search regex and isolate the right links, which kind of links are you looking at, given the four lists A, B, C or D, above ?

Best Regards,

guy038

Ravi K

@guy038 A.

and adding /3000/ before jpg.

https://cdn04.atkingdom-network.com/secure/content/a/amy038/232716/3000/amy038SRS_232716001.jpg

guy038

Hi, @ravi-k and All,

Assuming that :

All your files are located in a specific folder
The 3 digits, before the string .jpg, are reserved for numbering
You want to add the 3000 folder in any link, right before the picture name

Here is the road map :

Duplicate this folder. So the duplicated folder contains exactly the same files as the initial folder
Open the Find in Files dialog ( Ctrl + Shift + F )
- SEARCH (?s-i)(?:(\A)|).+?<a\x20href="(?-s)(https?[^>\r\n]+/)((.+?)...\.jpg)|(?s).+
- REPLACE (?1\4.txt\r\n\r\n)?2\23000/\3\r\n
- FILTERS *.txt
- DIRECTORY The absolutepath to theduplicated folder
- Select the Regular expression search mode
- Click on the Replace in Files button and valid the confirmation dialog

From your downloaded example, you should get this kind of output, which should be similar, for all files of your duplicated folder

amy038SRS_232716.txt

https://cdn04.atkingdom-network.com/secure/content/a/amy038/232716/3000/amy038SRS_232716001.jpg
https://cdn04.atkingdom-network.com/secure/content/a/amy038/232716/3000/amy038SRS_232716002.jpg
https://cdn04.atkingdom-network.com/secure/content/a/amy038/232716/3000/amy038SRS_232716003.jpg
https://cdn04.atkingdom-network.com/secure/content/a/amy038/232716/3000/amy038SRS_232716004.jpg
.....
.....
.....
https://cdn04.atkingdom-network.com/secure/content/a/amy038/232716/3000/amy038SRS_232716251.jpg
https://cdn04.atkingdom-network.com/secure/content/a/amy038/232716/3000/amy038SRS_232716252.jpg
https://cdn04.atkingdom-network.com/secure/content/a/amy038/232716/3000/amy038SRS_232716253.jpg
https://cdn04.atkingdom-network.com/secure/content/a/amy038/232716/3000/amy038SRS_232716254.jpg

Now, the goal is, with the Python, Lua or NppExec script plugin, to rename the current name of each file, in your duplicated folder, with the name located in the very first line of each file !

This seems fairly easy and I bet that some script’s gurus, on N++ community, will find out a solution, very soon !

However, test my regex S/R against all your files, first, to verify possible issues and/or improvements !

Best Regards,

guy038

P.S. :

BTW, I noticed that your sample file contains, both :

Some lines with Windows line endings ( CRLF )
Some lines with Unix line endings ( LF )

Fortunately, this does not pertub the regex S/R. If you prefer to deal with Unix files, only, simply change the replacement regex as :

REPLACE (?1\4.txt\n\n)?2\23000/\3\n

Ravi K

@guy038 This is exactly what I wanted. Thanks for that complex RegEx.
No issues.