regex: Parsing html tags in other tags / links and titles



  • <ul id="myNavigation">
        <li><a href="https://my-website.com/page-1.html" title="Page 1">Page 1 (34)</a></li>
        <li><a href="https://my-website.com/page-2.html" title="Page 2">Page 2 (29)</a></li>
        <li><a href="https://my-website.com/page-3.html" title="Page-3">Page 3 (11)</a></li>
    </ul>
    

    MUST BECOME:

    <div class="categories-name">
       <a href="https://my-website.com/page-66.html" title="Page 66">
       <p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i> Page 66 <span>27</span> </p>
      </a>
    </div>
    <div class="categories-name">
       <a href="https://my-website.com/page-67.html" title="Page 67">
       <p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i> Page 67 <span>24</span> </p>
       </a>
    </div>
    <div class="categories-name">
       <a href="https://my-website.com/page-68.html" title="Page 68">
       <p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i> Page 68 <span>07</span> </p>
       </a>
    </div>
    

    can something like this be done with regex?



  • Hello, @robin-cruise,

    In your first text, there are these numbers :

        >Page 1 (34)</a>
        >Page 2 (29)</a>
        >Page 3 (11)</a>
    

    And, in you second text, there are these other numbers :

       > Page 66 <span>27</span>
       > Page 67 <span>24</span>
       > Page 68 <span>07</span>
    

    How are all these numbers connected ?!

    BR

    guy038



  • Regex writing service continues…

    I hope something NEW and INTERESTING about N++ regex comes out of this, otherwise it belongs on a regex forum and not this one.



  • Hello, @robin-cruise, @alan-kilborn and All,

    I totally support the @alan-kilborn statement ! That’s why I tried, in my previous post, to just ask about the main problem, in few words !

    So, I probably reply to you, only if you can provide very understanding explanations about your desired S/R. Otherwise, we just have to remember that this forum is a Notepad++ forum, dealing about the general N++ functioning !

    BR

    guy038



  • I’m not against new posters asking about data manipulation questions that will involve regex. How would they know this except by asking here and being started down a learning path? I certainly provide some of these types of responses myself, and am glad to do so

    But those that refuse to learn (or are incapable), this is a a different story. These are the “TAKERS”. One of the examples is the OP in this thread.

    So Guy, if you’re willing to be a regex writing service for them, I suggest you “take it offline” and have a back and forth email exchange where you solve all of their problems for them. I don’t want to “sign you up” for special work, but if this is agreeable, I encourage it as it reduces the bandwidth of the forum for what is, in the usual case, a general regex topic and not a N++ specific topic.

    But, if something interesting/peculiar to some NEW facet of regex and especially N++ regex comes out of it, do post and share knowledge. But for basic regex solve-my-problem-for-me from the repetitive askers, please spare the forum of the mundane.



  • @guy038 said in regex: Parsing html tags in other tags / links and titles:

    Hello, @robin-cruise,

    In your first text, there are these numbers :

        >Page 1 (34)</a>
        >Page 2 (29)</a>
        >Page 3 (11)</a>
    

    And, in you second text, there are these other numbers :

       > Page 66 <span>27</span>
       > Page 67 <span>24</span>
       > Page 68 <span>07</span>
    

    How are all these numbers connected ?!

    BR

    guy038

    hello, sir. The numbers represents the number of the articles which are included in that section. Every page, such as Page 1Page 66, are fictitious names. In fact there are Management, Finance, Maths...etc)

    Yes, the numbers are not the same in the first text and in the second text.

    @Alan-Kilborn good day, sir. I write this topic, because there are thousands of topics with regex here, and many of them have a solution. Please, be kind, and write a message to all that thousands of people which opened discussions about regex. And, also, copy-paste your messaje for the next thousand who will write in the future. Tell them to go other place…

    can you do that, so that everyone can understand? Or, are you discriminating and only thinking about me?



  • @Robin-Cruise said in regex: Parsing html tags in other tags / links and titles:

    Or, are you discriminating and only thinking about me?

    Nope, you’re just one of the worst offenders, i.e. “takers”.
    But you’ve helped me to try to set a strategy here.
    Hopefully it will be fruitful.



  • Hi, @rogin-cruise and All,

    So, globally, your have initial text like below :

    <ul id="myNavigation">
    ¤¤¤¤<li><a href="https://my-website.com/••••.html" title="••••">••••</a></li>
    ¤¤¤¤<li><a href="https://my-website.com/••••.html" title="••••">••••</a></li>
    ¤¤¤¤<li><a href="https://my-website.com/••••.html" title="••••">••••</a></li>
    ....
    ....
    ....
    ¤¤¤¤<li><a href="https://my-website.com/••••.html" title="••••">••••</a></li>
    </ul>
    

    which should be changed into :

    <div class="categories-name">
    ¤¤¤¤<a href="https://my-website.com/••••.html" title="••••">
    ¤¤¤¤<p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i> •••• <span>••••</span> </p>
    ¤¤¤¤</a>
    </div>
    <div class="categories-name">
    ¤¤¤¤<a href="https://my-website.com/••••.html" title="••••">
    ¤¤¤¤<p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i> •••• <span>••••</span> </p>
    ¤¤¤¤</a>
    </div>
    <div class="categories-name">
    ¤¤¤¤<a href="https://my-website.com/••••.html" title="••••">
    ¤¤¤¤<p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i> •••• <span>••••</span> </p>
    ¤¤¤¤</a>
    </div>
    ....
    ....
    ....
    <div class="categories-name">
    ¤¤¤¤<a href="https://my-website.com/••••.html" title="••••">
    ¤¤¤¤<p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i> •••• <span>••••</span> </p>
    ¤¤¤¤</a>
    </div>
    

    where :

    • All the zones ••••, either in the search part and the expected replacement, are arbitrary and not connected to each other

    • All the zones ¤¤¤¤ are leading space characters, whatever their number


    More generally, you would like that any line, as below, in a <ul .............</ul> section :

    ¤¤¤¤<li><a href="https://my-website.com/••••.html" title="••••">••••</a></li>
    

    is changed into :

    <div class="categories-name">
    ¤¤¤¤<a href="https://my-website.com/••••.html" title="••••">
    ¤¤¤¤<p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i> •••• <span>••••</span> </p>
    ¤¤¤¤</a>
    </div>
    

    If so, I think it should be possible to create an appropriate S/R, with that important consequence : AFTER the global replacement that I will provide you, in a next post :

    • You’ll have to replace any literal •••• zone with the real data for each of them

    • You’ll have to replace any literal ¤¤¤¤ zone with the exact number of leading space chars required

    I cannot find of a best solution. It’s just your data, anyway ! What’s your feelings about it ?

    Best Regards,

    guy038



  • @Robin-Cruise said in regex: Parsing html tags in other tags / links and titles:

    <a href=“https://my-website.com/page-66.html” title=“Page 66”>

    hello. So, basicaly, this line is almost the same in both parts, having: <a href and title=, even the name of the link and the name of the title is different. For this it can be done a parsing.

    <a href="https://my-website.com/page-1.html" title="Page 1">
    with this
    <a href="https://my-website.com/page-66.html" title="Page 66">

    the last part is a little tricky, I have to import the values PAGE and NUMBER OF THE PAGE, for one place to another:

    Page 1 (34)</a> must become </i> Page 66 <span>

    and

    (34)</a></li> should become <span>27</span> </p> </a>



  • Hi, @robin-cruise,

    Sorry, but I still do not see what you want to achieve ??!!

    How, on earth, do you think a regular expression can make the link between Page 1 and Page 66 or between numbers (34) and 27 ??

    BR

    guy038



  • …and then some of the takers don’t even think through the logistics of their problem first, before asking for help, so the net result is that they waste the time of the helpers and any of the thread’s readers. :-(



  • simple, with the formula:

    FIND: (FROM-HERE(.*?)TO-HERE.*?)(FROM-HERE).*?(TO HERE)

    REPLACE BY \1\3\2\4

    for example, the first part:

    <ul id="myNavigation">
        <li><a href="https://my-website.com/page-1.html" title="Page 1">Page 1 (34)</a></li>
    </ul>
    

    and this:

    <div class="categories-name">
       <a href="https://my-website.com/page-66.html" title="Page 66">
       <p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i> Page 66 <span>27</span> </p>
      </a>
    </div>
    

    Please test this formula, and you will see that the link from the first text will be parse/copy to the second text.

    FIND: (<li><a href="https://my-website.com/(.*?)" title=".*?)(\h+<a href="https://my-website.com/).*?(" title=")

    REPLACE BY: \1\3\2\4

    Now, the problem is that I must replace each part differently.



  • also, the same basic formula, for another part:

    FIND: (title="(.*?)">.*?)(mr-1"></i> ).*?(<span>)

    REPLACE BY: \1\3\2 \4

    Check this 2 regex formulas I wrote in the last 2 posts, and see what has changed



  • Hi, @Robin-cruise and All,

    When I said, in my previous post :

    More generally, you would like that any line, as below, in a <ul .............</ul> section :

    ¤¤¤¤<li><a href="https://my-website.com/••••.html" title="••••">••••</a></li>
    

    is changed into :

    <div class="categories-name">
    ¤¤¤¤<a href="https://my-website.com/••••.html" title="••••">
    ¤¤¤¤<p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i> •••• <span>••••</span> </p>
    ¤¤¤¤</a>
    </div>
    

    This can be achieved with the following regex S/R, without any group :

    SEARCH ^\h*<li><a href="https://my\-website.com/.+?\.html" title=".+?">.+?</a></li>

    REPLACE <div class="categories-name">\r\n <a href="https://my-website.com/••••.html" title="••••">\r\n <p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i>••••<span>••••</span> </p>\r\n </a>\r\n</div>

    So, from any initial line, inside a <ul .............</ul> section, as below :

                <li><a href="https://my-website.com/••••.html" title="••••">••••</a></li>
    

    You get these output lines :

    <div class="categories-name">
        <a href="https://my-website.com/••••.html" title="••••">
        <p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i>••••<span>••••</span> </p>
        </a>
    </div>
    

    Just try it out !


    Now, the problem is HOW to perform this S/R, ONLY on lines inside these <ul> sections, below :

    <ul id="myNavigation">
    ....
    ....
    ....
    </ul>
    

    To do that, we’ll use, again, the generic regex, already discussed, with :

    BSR = <ul id="myNavigation">
    ESR = (?!</ul>)
    SR = ^\h*<li><a href="https://my\-website.com/.+?\.html" title=".+?">.+?</a></li>
    RR = <div class="categories-name">\r\n <a href="https://my-website.com/••••.html" title="••••">\r\n <p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i> •••• <span>••••</span> </p>\r\n </a>\r\n</div>

    Leading to this general S/R :

    SEARCH (?s-i)(<ul id="myNavigation">|(?!\A)\G)((?!</ul>).)*?\K^\h*<li><a href="https://my\-website.com/.+?\.html" title=".+?">.+?</a></li>

    REPLACE <div class="categories-name">\r\n <a href="https://my-website.com/••••.html" title="••••">\r\n <p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i>••••<span>••••</span> </p>\r\n </a>\r\n</div>

    Notes :

    • You must use N++ v7.9.1 or a later release, which correctly handles the \A assertion !

    • In replacement, everything is literal, but the parts \r\n, standing for Windows EOL

    • Of course, AFTER the replacement, you’ll have to change any •••• zone with its exact contents !


    Let’s try this final regex S/R :

    • Place this sample text, below, in a N++ new tab :
    ...
    ...
    <ul id="myNavigation">
        <li><a href="https://my-website.com/••••.html" title="••••">••••</a></li>
        <li><a href="https://my-website.com/••••.html" title="••••">••••</a></li>
        <li><a href="https://my-website.com/••••.html" title="••••">••••</a></li>
    ....
    ....
    ....
        <li><a href="https://my-website.com/••••.html" title="••••">••••</a></li>
    </ul>
    
        <li><a href="https://my-website.com/••••.html" title="••••">••••</a></li>
        <li><a href="https://my-website.com/••••.html" title="••••">••••</a></li>
        <li><a href="https://my-website.com/••••.html" title="••••">••••</a></li>
    
    <ul id="myNavigation">
        <li><a href="https://my-website.com/••••.html" title="••••">••••</a></li>
    ....
    ....
    ....
        <li><a href="https://my-website.com/••••.html" title="••••">••••</a></li>
        <li><a href="https://my-website.com/••••.html" title="••••">••••</a></li>
        <li><a href="https://my-website.com/••••.html" title="••••">••••</a></li>
    </ul>
    
    
        <li><a href="https://my-website.com/••••.html" title="••••">••••</a></li>
        <li><a href="https://my-website.com/••••.html" title="••••">••••</a></li>
        <li><a href="https://my-website.com/••••.html" title="••••">••••</a></li>
    
    • Perform the last regex S/R, above

    • You’re left with the expected text :

    ...
    ...
    <ul id="myNavigation">
    <div class="categories-name">
        <a href="https://my-website.com/••••.html" title="••••">
        <p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i>••••<span>••••</span> </p>
        </a>
    </div>
    <div class="categories-name">
        <a href="https://my-website.com/••••.html" title="••••">
        <p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i>••••<span>••••</span> </p>
        </a>
    </div>
    <div class="categories-name">
        <a href="https://my-website.com/••••.html" title="••••">
        <p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i>••••<span>••••</span> </p>
        </a>
    </div>
    ....
    ....
    ....
    <div class="categories-name">
        <a href="https://my-website.com/••••.html" title="••••">
        <p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i>••••<span>••••</span> </p>
        </a>
    </div>
    </ul>
    
        <li><a href="https://my-website.com/••••.html" title="••••">••••</a></li>
        <li><a href="https://my-website.com/••••.html" title="••••">••••</a></li>
        <li><a href="https://my-website.com/••••.html" title="••••">••••</a></li>
    
    <ul id="myNavigation">
    <div class="categories-name">
        <a href="https://my-website.com/••••.html" title="••••">
        <p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i>••••<span>••••</span> </p>
        </a>
    </div>
    ....
    ....
    ....
    <div class="categories-name">
        <a href="https://my-website.com/••••.html" title="••••">
        <p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i>••••<span>••••</span> </p>
        </a>
    </div>
    <div class="categories-name">
        <a href="https://my-website.com/••••.html" title="••••">
        <p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i>••••<span>••••</span> </p>
        </a>
    </div>
    <div class="categories-name">
        <a href="https://my-website.com/••••.html" title="••••">
        <p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i>••••<span>••••</span> </p>
        </a>
    </div>
    </ul>
    
    
        <li><a href="https://my-website.com/••••.html" title="••••">••••</a></li>
        <li><a href="https://my-website.com/••••.html" title="••••">••••</a></li>
        <li><a href="https://my-website.com/••••.html" title="••••">••••</a></li>
    

    You can verify that the remaining lines <li><a href="https://my-website.com/••••.html" title="••••">••••</a></li> are only lines, which are outside a <ul id="myNavigation">............</ul> multi-lines section, so not replaced at all !

    Best Regards,

    guy038

    Oups !! I forgot to delete, either, the lines <ul id="myNavigation"> and </ul>. Quite easy with :

    SEARCH (?-i)<(ul id="myNavigation"|/ul)>\R

    REPLACE Leave EMPTY



  • This post is deleted!


  • @guy038 yes, sir. Thank you. You succeded to transform the part 1 with part 2.

    My problem, or I didn’t understand to good, are those ••••

    How can I change those many •••• with different words and numbers ? If I have 200 lines, I need 2 months to change those ••••

    I return to my regex,

    please .matches newlines and run this 2 regex

    FIND: (<li><a href="https://my-website.com/(.*?)" title=".*?)(\h+<a href="https://my-website.com/).*?(" title=")

    REPLACE BY: \1\3\2\4

    and

    FIND: (title="(.*?)">.*?)(mr-1"></i> ).*?(<span>)

    REPLACE BY: \1\3\2 \4

    With this 2 regex, I manage to make a parsing. I copy a information from part 1 to part 2. Of course, is an example of how the formula should work



  • Hello @Robin-cruise,

    If you want me to test your two regex S/R ( Let’s say respectively, S/R A and S/R B )

    • I need an exact raw text which must be processed with the SR A

    • I need any additional action, if any, which is to be done, after processing S/R A and before processing S/R B

    • I need an exact raw text which must be processed with the S/R B ( Of course, it may be the result of the S/R A on the initial text )

    BR

    guy038



  • @guy038 @Alan-Kilborn This is the solution. Yes, it can be done with regex, but must integrated in powershell. Works super !

    $file1 = 'c:\Folder1\file1.html'
    $file2 = 'c:\Folder1\file2.html'
    $result = 'c:\Folder1\result.html'
    $link=@()
    $title=@()
    $number=@()
    Get-Content -Path $file1 -Delimiter '</li>'|ForEach-Object{
        $_|ForEach-Object{
            if($_ -match '(?<=href=").+?(?=")'){$link += $Matches.Values}
            if($_ -match '(?<=title=").+?(?=")'){$title += $Matches.Values}
            if($_ -match '(?<=\()\d+(?=\))'){$number += $Matches.Values}
        }
    }
    $content = Get-Content -Path $file2 -Delimiter '</div>'
    for($i=0;$i -lt $content.Count;$i++){
        $content[$i] | ForEach-Object{
            if($_ -match '(?<=href=").+?(?=")'){$link2 = $Matches.Values}
            if($_ -match '(?<=title=").+?(?=")'){$title2 = $Matches.Values}
            if($_ -match '(?<=<span>)\d+(?=</span>)'){$number2 = $Matches.Values}
        }
        $content[$i] -replace $link2, $link[$i] -replace $title2, $title[$i] -replace $number2, $number[$i] | Out-File -FilePath $result -Append
    }
    

    Source: https://docs.microsoft.com/en-us/answers/questions/307621/powershell-copy-strings-from-some-html-tags-to-ano.html



  • @Robin-Cruise said in regex: Parsing html tags in other tags / links and titles:

    This is the solution. Yes, it can be done with regex, but must integrated in powershell.

    Ok… but this is really off-topic for a Notepad++ forum.
    Whatever you’re doing could I’m sure be done in many different ways.
    This isn’t a general “data manipulation” forum.



  • Hi, @Robin-cruise,

    Nice ! So you, finally, got a working solution.


    Out of curiosity, could you provide me a short sample of your initial files file1.html with several sections <ul......</ul> and file2.html and the corresponding result.html file, created by your Powershell script

    This could help me to understand which you really wanted to achieve and why I could not imagine about suitable and useful regexes !

    Thanks for your cooperation !

    Best Regards,

    guy038


Log in to reply