Community
    • Login

    [nsfw] Help extracting links from page source code

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    regexurls
    5 Posts 2 Posters 851 Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Ravi KR
      Ravi K
      last edited by Ravi K

      [NSFW]
      Hi Guys, I have this source code of a page.
      https://workupload.com/file/mGszjxtb7vB

      What I want to achieve.

      1. Extract all direct image file links like this from source code text file above.

      h ttps://cdn04.atkingdom-network.com/secure/content/a/amy038/232716/amy038SRS_232716004.jpg

      and discard rest of source code.

      1. Then add /3000/ before jpg file in the link, so above link
        will be become

      h ttps://cdn04.atkingdom-network.com/secure/content/a/amy038/232716/3000/amy038SRS_232716004.jpg

      All links in seperate lines. And Finally Save!
      I want to this on multiple text files i have.
      I failed at first step. regex is my weakness.

      I want to do same on multiple files.
      /a/ can be any alphabet in other file ,/amy038/, and /232716/ also be different for other files.

      If anyone can help me create macro for this.

      Now cherry on top would be if final text file is saved as name amy038-232716.txt

      Thanks

      1 Reply Last reply Reply Quote 0
      • guy038G
        guy038
        last edited by guy038

        Hello, @ravi-k and All,

        After downloading your page.txt file, I could identify, with regexes and sorting, 763 links to a .jpg picture, divided into four classes :

        • 254 lines <a href="https ............... .jpg ...............> <img src="http ..... /thumbs/ ........ .jpg .............></a>, so containing 2 x 254 = 508 links

        • 254 lines Content-Location: http ... cdn04 .............. .jpg ....., so 254 links

        • 1 line https .... www.atkpetites.com .................... .jpg, so 1 link

        After sorting, we get :

        <a href="https://cdn04.atkingdom-network.com/secure/content/a/amy038/232716/amy038SRS_232716001.jpg
        ....
        ....  (A)
        ....
        <a href="https://cdn04.atkingdom-network.com/secure/content/a/amy038/232716/amy038SRS_232716254.jpg
        
        
        
        <img src="http://cdn04.atkingdom-network.com/secure/content/a/amy038/232716/thumbs/amy038SRS_232716001.jpg
        ....
        ....  (B)
        ....
        <img src="http://cdn04.atkingdom-network.com/secure/content/a/amy038/232716/thumbs/amy038SRS_232716254.jpg
        
        
        
        Content-Location: http://cdn04.atkingdom-network.com/secure/content/a/amy038/232716/thumbs/amy038SRS_232716001.jpg
        ....
        ....  (C)
        ....
        Content-Location: http://cdn04.atkingdom-network.com/secure/content/a/amy038/232716/thumbs/amy038SRS_232716254.jpg
        
        
        
        Content-Location: https://www.atkpetites.com/css/images/america_flag.jpg  (D)
        

        My question is : in order to get the right search regex and isolate the right links, which kind of links are you looking at, given the four lists A, B, C or D, above ?

        Best Regards,

        guy038

        Ravi KR 1 Reply Last reply Reply Quote 2
        • Ravi KR
          Ravi K @guy038
          last edited by Ravi K

          @guy038 A.

          and adding /3000/ before jpg.

          https://cdn04.atkingdom-network.com/secure/content/a/amy038/232716/3000/amy038SRS_232716001.jpg
          
          1 Reply Last reply Reply Quote 0
          • guy038G
            guy038
            last edited by guy038

            Hi, @ravi-k and All,

            Assuming that :

            • All your files are located in a specific folder

            • The 3 digits, before the string .jpg, are reserved for numbering

            • You want to add the 3000 folder in any link, right before the picture name

            Here is the road map :

            • Duplicate this folder. So the duplicated folder contains exactly the same files as the initial folder

            • Open the Find in Files dialog ( Ctrl + Shift + F )

              • SEARCH (?s-i)(?:(\A)|).+?<a\x20href="(?-s)(https?[^>\r\n]+/)((.+?)...\.jpg)|(?s).+

              • REPLACE (?1\4.txt\r\n\r\n)?2\23000/\3\r\n

              • FILTERS *.txt

              • DIRECTORY The absolutepath to theduplicated folder

              • Select the Regular expression search mode

              • Click on the Replace in Files button and valid the confirmation dialog

            From your downloaded example, you should get this kind of output, which should be similar, for all files of your duplicated folder

            amy038SRS_232716.txt
            
            https://cdn04.atkingdom-network.com/secure/content/a/amy038/232716/3000/amy038SRS_232716001.jpg
            https://cdn04.atkingdom-network.com/secure/content/a/amy038/232716/3000/amy038SRS_232716002.jpg
            https://cdn04.atkingdom-network.com/secure/content/a/amy038/232716/3000/amy038SRS_232716003.jpg
            https://cdn04.atkingdom-network.com/secure/content/a/amy038/232716/3000/amy038SRS_232716004.jpg
            .....
            .....
            .....
            https://cdn04.atkingdom-network.com/secure/content/a/amy038/232716/3000/amy038SRS_232716251.jpg
            https://cdn04.atkingdom-network.com/secure/content/a/amy038/232716/3000/amy038SRS_232716252.jpg
            https://cdn04.atkingdom-network.com/secure/content/a/amy038/232716/3000/amy038SRS_232716253.jpg
            https://cdn04.atkingdom-network.com/secure/content/a/amy038/232716/3000/amy038SRS_232716254.jpg
            

            Now, the goal is, with the Python, Lua or NppExec script plugin, to rename the current name of each file, in your duplicated folder, with the name located in the very first line of each file !

            This seems fairly easy and I bet that some script’s gurus, on N++ community, will find out a solution, very soon !

            However, test my regex S/R against all your files, first, to verify possible issues and/or improvements !

            Best Regards,

            guy038

            P.S. :

            BTW, I noticed that your sample file contains, both :

            • Some lines with Windows line endings ( CRLF )

            • Some lines with Unix line endings ( LF )

            Fortunately, this does not pertub the regex S/R. If you prefer to deal with Unix files, only, simply change the replacement regex as :

            REPLACE (?1\4.txt\n\n)?2\23000/\3\n

            Ravi KR 1 Reply Last reply Reply Quote 3
            • Ravi KR
              Ravi K @guy038
              last edited by Ravi K

              @guy038 This is exactly what I wanted. Thanks for that complex RegEx.
              No issues.

              1 Reply Last reply Reply Quote 1
              • First post
                Last post
              The Community of users of the Notepad++ text editor.
              Powered by NodeBB | Contributors