[nsfw] Help extracting links from page source code
-
[NSFW]
Hi Guys, I have this source code of a page.
https://workupload.com/file/mGszjxtb7vBWhat I want to achieve.
- Extract all direct image file links like this from source code text file above.
h ttps://cdn04.atkingdom-network.com/secure/content/a/amy038/232716/amy038SRS_232716004.jpg
and discard rest of source code.
- Then add /3000/ before jpg file in the link, so above link
will be become
h ttps://cdn04.atkingdom-network.com/secure/content/a/amy038/232716/3000/amy038SRS_232716004.jpg
All links in seperate lines. And Finally Save!
I want to this on multiple text files i have.
I failed at first step. regex is my weakness.I want to do same on multiple files.
/a/ can be any alphabet in other file ,/amy038/, and /232716/ also be different for other files.If anyone can help me create macro for this.
Now cherry on top would be if final text file is saved as name amy038-232716.txt
Thanks
-
Hello, @ravi-k and All,
After downloading your
page.txt
file, I could identify, with regexes and sorting,763
links to a.jpg
picture, divided into four classes :-
254
lines<a href="https ............... .jpg ...............> <img src="http ..... /thumbs/ ........ .jpg .............></a>
, so containing 2 x 254 =508
links -
254
linesContent-Location: http ... cdn04 .............. .jpg .....
, so254
links -
1
linehttps .... www.atkpetites.com .................... .jpg
, so1
link
After sorting, we get :
<a href="https://cdn04.atkingdom-network.com/secure/content/a/amy038/232716/amy038SRS_232716001.jpg .... .... (A) .... <a href="https://cdn04.atkingdom-network.com/secure/content/a/amy038/232716/amy038SRS_232716254.jpg <img src="http://cdn04.atkingdom-network.com/secure/content/a/amy038/232716/thumbs/amy038SRS_232716001.jpg .... .... (B) .... <img src="http://cdn04.atkingdom-network.com/secure/content/a/amy038/232716/thumbs/amy038SRS_232716254.jpg Content-Location: http://cdn04.atkingdom-network.com/secure/content/a/amy038/232716/thumbs/amy038SRS_232716001.jpg .... .... (C) .... Content-Location: http://cdn04.atkingdom-network.com/secure/content/a/amy038/232716/thumbs/amy038SRS_232716254.jpg Content-Location: https://www.atkpetites.com/css/images/america_flag.jpg (D)
My question is : in order to get the right search regex and isolate the right links, which kind of links are you looking at, given the four lists
A
,B
,C
orD
, above ?Best Regards,
guy038
-
-
@guy038 A.
and adding /3000/ before jpg.
https://cdn04.atkingdom-network.com/secure/content/a/amy038/232716/3000/amy038SRS_232716001.jpg
-
Hi, @ravi-k and All,
Assuming that :
-
All your files are located in a specific folder
-
The
3
digits, before the string.jpg
, are reserved for numbering -
You want to add the
3000
folder in any link, right before the picture name
Here is the road map :
-
Duplicate this folder. So the duplicated folder contains exactly the same files as the initial folder
-
Open the Find in Files dialog (
Ctrl + Shift + F
)-
SEARCH
(?s-i)(?:(\A)|).+?<a\x20href="(?-s)(https?[^>\r\n]+/)((.+?)...\.jpg)|(?s).+
-
REPLACE
(?1\4.txt\r\n\r\n)?2\23000/\3\r\n
-
FILTERS
*.txt
-
DIRECTORY
The
absolutepath to the
duplicatedfolder
-
Select the
Regular expression
search mode -
Click on the
Replace in Files
button and valid the confirmation dialog
-
From your downloaded example, you should get this kind of output, which should be similar, for all files of your duplicated folder
amy038SRS_232716.txt https://cdn04.atkingdom-network.com/secure/content/a/amy038/232716/3000/amy038SRS_232716001.jpg https://cdn04.atkingdom-network.com/secure/content/a/amy038/232716/3000/amy038SRS_232716002.jpg https://cdn04.atkingdom-network.com/secure/content/a/amy038/232716/3000/amy038SRS_232716003.jpg https://cdn04.atkingdom-network.com/secure/content/a/amy038/232716/3000/amy038SRS_232716004.jpg ..... ..... ..... https://cdn04.atkingdom-network.com/secure/content/a/amy038/232716/3000/amy038SRS_232716251.jpg https://cdn04.atkingdom-network.com/secure/content/a/amy038/232716/3000/amy038SRS_232716252.jpg https://cdn04.atkingdom-network.com/secure/content/a/amy038/232716/3000/amy038SRS_232716253.jpg https://cdn04.atkingdom-network.com/secure/content/a/amy038/232716/3000/amy038SRS_232716254.jpg
Now, the goal is, with the
Python
,Lua
orNppExec
script plugin, to rename the current name of each file, in your duplicated folder, with the name located in the very first line of each file !This seems fairly easy and I bet that some script’s gurus, on N++ community, will find out a solution, very soon !
However, test my regex S/R against all your files, first, to verify possible issues and/or improvements !
Best Regards,
guy038
P.S. :
BTW, I noticed that your sample file contains, both :
-
Some lines with Windows line endings (
CRLF
) -
Some lines with Unix line endings (
LF
)
Fortunately, this does not pertub the regex S/R. If you prefer to deal with Unix files, only, simply change the replacement regex as :
REPLACE
(?1\4.txt\n\n)?2\23000/\3\n
-
-
@guy038 This is exactly what I wanted. Thanks for that complex RegEx.
No issues.