[nsfw] Help extracting links from page source code



  • [NSFW]
    Hi Guys, I have this source code of a page.
    https://workupload.com/file/mGszjxtb7vB

    What I want to achieve.

    1. Extract all direct image file links like this from source code text file above.

    h ttps://cdn04.atkingdom-network.com/secure/content/a/amy038/232716/amy038SRS_232716004.jpg

    and discard rest of source code.

    1. Then add /3000/ before jpg file in the link, so above link
      will be become

    h ttps://cdn04.atkingdom-network.com/secure/content/a/amy038/232716/3000/amy038SRS_232716004.jpg

    All links in seperate lines. And Finally Save!
    I want to this on multiple text files i have.
    I failed at first step. regex is my weakness.

    I want to do same on multiple files.
    /a/ can be any alphabet in other file ,/amy038/, and /232716/ also be different for other files.

    If anyone can help me create macro for this.

    Now cherry on top would be if final text file is saved as name amy038-232716.txt

    Thanks



  • Hello, @ravi-k and All,

    After downloading your page.txt file, I could identify, with regexes and sorting, 763 links to a .jpg picture, divided into four classes :

    • 254 lines <a href="https ............... .jpg ...............> <img src="http ..... /thumbs/ ........ .jpg .............></a>, so containing 2 x 254 = 508 links

    • 254 lines Content-Location: http ... cdn04 .............. .jpg ....., so 254 links

    • 1 line https .... www.atkpetites.com .................... .jpg, so 1 link

    After sorting, we get :

    <a href="https://cdn04.atkingdom-network.com/secure/content/a/amy038/232716/amy038SRS_232716001.jpg
    ....
    ....  (A)
    ....
    <a href="https://cdn04.atkingdom-network.com/secure/content/a/amy038/232716/amy038SRS_232716254.jpg
    
    
    
    <img src="http://cdn04.atkingdom-network.com/secure/content/a/amy038/232716/thumbs/amy038SRS_232716001.jpg
    ....
    ....  (B)
    ....
    <img src="http://cdn04.atkingdom-network.com/secure/content/a/amy038/232716/thumbs/amy038SRS_232716254.jpg
    
    
    
    Content-Location: http://cdn04.atkingdom-network.com/secure/content/a/amy038/232716/thumbs/amy038SRS_232716001.jpg
    ....
    ....  (C)
    ....
    Content-Location: http://cdn04.atkingdom-network.com/secure/content/a/amy038/232716/thumbs/amy038SRS_232716254.jpg
    
    
    
    Content-Location: https://www.atkpetites.com/css/images/america_flag.jpg  (D)
    

    My question is : in order to get the right search regex and isolate the right links, which kind of links are you looking at, given the four lists A, B, C or D, above ?

    Best Regards,

    guy038



  • @guy038 A.

    and adding /3000/ before jpg.

    https://cdn04.atkingdom-network.com/secure/content/a/amy038/232716/3000/amy038SRS_232716001.jpg
    


  • Hi, @ravi-k and All,

    Assuming that :

    • All your files are located in a specific folder

    • The 3 digits, before the string .jpg, are reserved for numbering

    • You want to add the 3000 folder in any link, right before the picture name

    Here is the road map :

    • Duplicate this folder. So the duplicated folder contains exactly the same files as the initial folder

    • Open the Find in Files dialog ( Ctrl + Shift + F )

      • SEARCH (?s-i)(?:(\A)|).+?<a\x20href="(?-s)(https?[^>\r\n]+/)((.+?)...\.jpg)|(?s).+

      • REPLACE (?1\4.txt\r\n\r\n)?2\23000/\3\r\n

      • FILTERS *.txt

      • DIRECTORY The absolutepath to theduplicated folder

      • Select the Regular expression search mode

      • Click on the Replace in Files button and valid the confirmation dialog

    From your downloaded example, you should get this kind of output, which should be similar, for all files of your duplicated folder

    amy038SRS_232716.txt
    
    https://cdn04.atkingdom-network.com/secure/content/a/amy038/232716/3000/amy038SRS_232716001.jpg
    https://cdn04.atkingdom-network.com/secure/content/a/amy038/232716/3000/amy038SRS_232716002.jpg
    https://cdn04.atkingdom-network.com/secure/content/a/amy038/232716/3000/amy038SRS_232716003.jpg
    https://cdn04.atkingdom-network.com/secure/content/a/amy038/232716/3000/amy038SRS_232716004.jpg
    .....
    .....
    .....
    https://cdn04.atkingdom-network.com/secure/content/a/amy038/232716/3000/amy038SRS_232716251.jpg
    https://cdn04.atkingdom-network.com/secure/content/a/amy038/232716/3000/amy038SRS_232716252.jpg
    https://cdn04.atkingdom-network.com/secure/content/a/amy038/232716/3000/amy038SRS_232716253.jpg
    https://cdn04.atkingdom-network.com/secure/content/a/amy038/232716/3000/amy038SRS_232716254.jpg
    

    Now, the goal is, with the Python, Lua or NppExec script plugin, to rename the current name of each file, in your duplicated folder, with the name located in the very first line of each file !

    This seems fairly easy and I bet that some script’s gurus, on N++ community, will find out a solution, very soon !

    However, test my regex S/R against all your files, first, to verify possible issues and/or improvements !

    Best Regards,

    guy038

    P.S. :

    BTW, I noticed that your sample file contains, both :

    • Some lines with Windows line endings ( CRLF )

    • Some lines with Unix line endings ( LF )

    Fortunately, this does not pertub the regex S/R. If you prefer to deal with Unix files, only, simply change the replacement regex as :

    REPLACE (?1\4.txt\n\n)?2\23000/\3\n



  • @guy038 This is exactly what I wanted. Thanks for that complex RegEx.
    No issues.


Log in to reply