Community
    • 登入

    regex: Parsing html tags in other tags / links and titles

    已排程 已置頂 已鎖定 已移動 Help wanted · · · – – – · · ·
    32 貼文 3 Posters 2.1k 瀏覽
    正在載入更多貼文
    • 從舊到新
    • 從新到舊
    • 最多點贊
    回覆
    • 在新貼文中回覆
    登入後回覆
    此主題已被刪除。只有擁有主題管理權限的使用者可以查看。
    • Alan KilbornA
      Alan Kilborn @Robin Cruise
      最後由 編輯

      @Robin-Cruise said in regex: Parsing html tags in other tags / links and titles:

      Or, are you discriminating and only thinking about me?

      Nope, you’re just one of the worst offenders, i.e. “takers”.
      But you’ve helped me to try to set a strategy here.
      Hopefully it will be fruitful.

      1 條回覆 最後回覆 回覆 引用 0
      • guy038G
        guy038
        最後由 guy038 編輯

        Hi, @rogin-cruise and All,

        So, globally, your have initial text like below :

        <ul id="myNavigation">
        ¤¤¤¤<li><a href="https://my-website.com/••••.html" title="••••">••••</a></li>
        ¤¤¤¤<li><a href="https://my-website.com/••••.html" title="••••">••••</a></li>
        ¤¤¤¤<li><a href="https://my-website.com/••••.html" title="••••">••••</a></li>
        ....
        ....
        ....
        ¤¤¤¤<li><a href="https://my-website.com/••••.html" title="••••">••••</a></li>
        </ul>
        

        which should be changed into :

        <div class="categories-name">
        ¤¤¤¤<a href="https://my-website.com/••••.html" title="••••">
        ¤¤¤¤<p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i> •••• <span>••••</span> </p>
        ¤¤¤¤</a>
        </div>
        <div class="categories-name">
        ¤¤¤¤<a href="https://my-website.com/••••.html" title="••••">
        ¤¤¤¤<p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i> •••• <span>••••</span> </p>
        ¤¤¤¤</a>
        </div>
        <div class="categories-name">
        ¤¤¤¤<a href="https://my-website.com/••••.html" title="••••">
        ¤¤¤¤<p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i> •••• <span>••••</span> </p>
        ¤¤¤¤</a>
        </div>
        ....
        ....
        ....
        <div class="categories-name">
        ¤¤¤¤<a href="https://my-website.com/••••.html" title="••••">
        ¤¤¤¤<p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i> •••• <span>••••</span> </p>
        ¤¤¤¤</a>
        </div>
        

        where :

        • All the zones ••••, either in the search part and the expected replacement, are arbitrary and not connected to each other

        • All the zones ¤¤¤¤ are leading space characters, whatever their number


        More generally, you would like that any line, as below, in a <ul .............</ul> section :

        ¤¤¤¤<li><a href="https://my-website.com/••••.html" title="••••">••••</a></li>
        

        is changed into :

        <div class="categories-name">
        ¤¤¤¤<a href="https://my-website.com/••••.html" title="••••">
        ¤¤¤¤<p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i> •••• <span>••••</span> </p>
        ¤¤¤¤</a>
        </div>
        

        If so, I think it should be possible to create an appropriate S/R, with that important consequence : AFTER the global replacement that I will provide you, in a next post :

        • You’ll have to replace any literal •••• zone with the real data for each of them

        • You’ll have to replace any literal ¤¤¤¤ zone with the exact number of leading space chars required

        I cannot find of a best solution. It’s just your data, anyway ! What’s your feelings about it ?

        Best Regards,

        guy038

        1 條回覆 最後回覆 回覆 引用 1
        • Robin CruiseR
          Robin Cruise
          最後由 編輯

          @Robin-Cruise said in regex: Parsing html tags in other tags / links and titles:

          <a href=“https://my-website.com/page-66.html” title=“Page 66”>

          hello. So, basicaly, this line is almost the same in both parts, having: <a href and title=, even the name of the link and the name of the title is different. For this it can be done a parsing.

          <a href="https://my-website.com/page-1.html" title="Page 1">
          with this
          <a href="https://my-website.com/page-66.html" title="Page 66">

          the last part is a little tricky, I have to import the values PAGE and NUMBER OF THE PAGE, for one place to another:

          Page 1 (34)</a> must become </i> Page 66 <span>

          and

          (34)</a></li> should become <span>27</span> </p> </a>

          1 條回覆 最後回覆 回覆 引用 -1
          • guy038G
            guy038
            最後由 guy038 編輯

            Hi, @robin-cruise,

            Sorry, but I still do not see what you want to achieve ??!!

            How, on earth, do you think a regular expression can make the link between Page 1 and Page 66 or between numbers (34) and 27 ??

            BR

            guy038

            1 條回覆 最後回覆 回覆 引用 1
            • Alan KilbornA
              Alan Kilborn
              最後由 編輯

              …and then some of the takers don’t even think through the logistics of their problem first, before asking for help, so the net result is that they waste the time of the helpers and any of the thread’s readers. :-(

              1 條回覆 最後回覆 回覆 引用 0
              • Robin CruiseR
                Robin Cruise
                最後由 Robin Cruise 編輯

                simple, with the formula:

                FIND: (FROM-HERE(.*?)TO-HERE.*?)(FROM-HERE).*?(TO HERE)

                REPLACE BY \1\3\2\4

                for example, the first part:

                <ul id="myNavigation">
                    <li><a href="https://my-website.com/page-1.html" title="Page 1">Page 1 (34)</a></li>
                </ul>
                

                and this:

                <div class="categories-name">
                   <a href="https://my-website.com/page-66.html" title="Page 66">
                   <p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i> Page 66 <span>27</span> </p>
                  </a>
                </div>
                

                Please test this formula, and you will see that the link from the first text will be parse/copy to the second text.

                FIND: (<li><a href="https://my-website.com/(.*?)" title=".*?)(\h+<a href="https://my-website.com/).*?(" title=")

                REPLACE BY: \1\3\2\4

                Now, the problem is that I must replace each part differently.

                1 條回覆 最後回覆 回覆 引用 0
                • Robin CruiseR
                  Robin Cruise
                  最後由 編輯

                  also, the same basic formula, for another part:

                  FIND: (title="(.*?)">.*?)(mr-1"></i> ).*?(<span>)

                  REPLACE BY: \1\3\2 \4

                  Check this 2 regex formulas I wrote in the last 2 posts, and see what has changed

                  1 條回覆 最後回覆 回覆 引用 0
                  • guy038G
                    guy038
                    最後由 guy038 編輯

                    Hi, @Robin-cruise and All,

                    When I said, in my previous post :

                    More generally, you would like that any line, as below, in a <ul .............</ul> section :

                    ¤¤¤¤<li><a href="https://my-website.com/••••.html" title="••••">••••</a></li>
                    

                    is changed into :

                    <div class="categories-name">
                    ¤¤¤¤<a href="https://my-website.com/••••.html" title="••••">
                    ¤¤¤¤<p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i> •••• <span>••••</span> </p>
                    ¤¤¤¤</a>
                    </div>
                    

                    This can be achieved with the following regex S/R, without any group :

                    SEARCH ^\h*<li><a href="https://my\-website.com/.+?\.html" title=".+?">.+?</a></li>

                    REPLACE <div class="categories-name">\r\n <a href="https://my-website.com/••••.html" title="••••">\r\n <p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i>••••<span>••••</span> </p>\r\n </a>\r\n</div>

                    So, from any initial line, inside a <ul .............</ul> section, as below :

                                <li><a href="https://my-website.com/••••.html" title="••••">••••</a></li>
                    

                    You get these output lines :

                    <div class="categories-name">
                        <a href="https://my-website.com/••••.html" title="••••">
                        <p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i>••••<span>••••</span> </p>
                        </a>
                    </div>
                    

                    Just try it out !


                    Now, the problem is HOW to perform this S/R, ONLY on lines inside these <ul> sections, below :

                    <ul id="myNavigation">
                    ....
                    ....
                    ....
                    </ul>
                    

                    To do that, we’ll use, again, the generic regex, already discussed, with :

                    BSR = <ul id="myNavigation">
                    ESR = (?!</ul>)
                    SR = ^\h*<li><a href="https://my\-website.com/.+?\.html" title=".+?">.+?</a></li>
                    RR = <div class="categories-name">\r\n <a href="https://my-website.com/••••.html" title="••••">\r\n <p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i> •••• <span>••••</span> </p>\r\n </a>\r\n</div>

                    Leading to this general S/R :

                    SEARCH (?s-i)(<ul id="myNavigation">|(?!\A)\G)((?!</ul>).)*?\K^\h*<li><a href="https://my\-website.com/.+?\.html" title=".+?">.+?</a></li>

                    REPLACE <div class="categories-name">\r\n <a href="https://my-website.com/••••.html" title="••••">\r\n <p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i>••••<span>••••</span> </p>\r\n </a>\r\n</div>

                    Notes :

                    • You must use N++ v7.9.1 or a later release, which correctly handles the \A assertion !

                    • In replacement, everything is literal, but the parts \r\n, standing for Windows EOL

                    • Of course, AFTER the replacement, you’ll have to change any •••• zone with its exact contents !


                    Let’s try this final regex S/R :

                    • Place this sample text, below, in a N++ new tab :
                    ...
                    ...
                    <ul id="myNavigation">
                        <li><a href="https://my-website.com/••••.html" title="••••">••••</a></li>
                        <li><a href="https://my-website.com/••••.html" title="••••">••••</a></li>
                        <li><a href="https://my-website.com/••••.html" title="••••">••••</a></li>
                    ....
                    ....
                    ....
                        <li><a href="https://my-website.com/••••.html" title="••••">••••</a></li>
                    </ul>
                    
                        <li><a href="https://my-website.com/••••.html" title="••••">••••</a></li>
                        <li><a href="https://my-website.com/••••.html" title="••••">••••</a></li>
                        <li><a href="https://my-website.com/••••.html" title="••••">••••</a></li>
                    
                    <ul id="myNavigation">
                        <li><a href="https://my-website.com/••••.html" title="••••">••••</a></li>
                    ....
                    ....
                    ....
                        <li><a href="https://my-website.com/••••.html" title="••••">••••</a></li>
                        <li><a href="https://my-website.com/••••.html" title="••••">••••</a></li>
                        <li><a href="https://my-website.com/••••.html" title="••••">••••</a></li>
                    </ul>
                    
                    
                        <li><a href="https://my-website.com/••••.html" title="••••">••••</a></li>
                        <li><a href="https://my-website.com/••••.html" title="••••">••••</a></li>
                        <li><a href="https://my-website.com/••••.html" title="••••">••••</a></li>
                    
                    • Perform the last regex S/R, above

                    • You’re left with the expected text :

                    ...
                    ...
                    <ul id="myNavigation">
                    <div class="categories-name">
                        <a href="https://my-website.com/••••.html" title="••••">
                        <p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i>••••<span>••••</span> </p>
                        </a>
                    </div>
                    <div class="categories-name">
                        <a href="https://my-website.com/••••.html" title="••••">
                        <p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i>••••<span>••••</span> </p>
                        </a>
                    </div>
                    <div class="categories-name">
                        <a href="https://my-website.com/••••.html" title="••••">
                        <p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i>••••<span>••••</span> </p>
                        </a>
                    </div>
                    ....
                    ....
                    ....
                    <div class="categories-name">
                        <a href="https://my-website.com/••••.html" title="••••">
                        <p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i>••••<span>••••</span> </p>
                        </a>
                    </div>
                    </ul>
                    
                        <li><a href="https://my-website.com/••••.html" title="••••">••••</a></li>
                        <li><a href="https://my-website.com/••••.html" title="••••">••••</a></li>
                        <li><a href="https://my-website.com/••••.html" title="••••">••••</a></li>
                    
                    <ul id="myNavigation">
                    <div class="categories-name">
                        <a href="https://my-website.com/••••.html" title="••••">
                        <p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i>••••<span>••••</span> </p>
                        </a>
                    </div>
                    ....
                    ....
                    ....
                    <div class="categories-name">
                        <a href="https://my-website.com/••••.html" title="••••">
                        <p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i>••••<span>••••</span> </p>
                        </a>
                    </div>
                    <div class="categories-name">
                        <a href="https://my-website.com/••••.html" title="••••">
                        <p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i>••••<span>••••</span> </p>
                        </a>
                    </div>
                    <div class="categories-name">
                        <a href="https://my-website.com/••••.html" title="••••">
                        <p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i>••••<span>••••</span> </p>
                        </a>
                    </div>
                    </ul>
                    
                    
                        <li><a href="https://my-website.com/••••.html" title="••••">••••</a></li>
                        <li><a href="https://my-website.com/••••.html" title="••••">••••</a></li>
                        <li><a href="https://my-website.com/••••.html" title="••••">••••</a></li>
                    

                    You can verify that the remaining lines <li><a href="https://my-website.com/••••.html" title="••••">••••</a></li> are only lines, which are outside a <ul id="myNavigation">............</ul> multi-lines section, so not replaced at all !

                    Best Regards,

                    guy038

                    Oups !! I forgot to delete, either, the lines <ul id="myNavigation"> and </ul>. Quite easy with :

                    SEARCH (?-i)<(ul id="myNavigation"|/ul)>\R

                    REPLACE Leave EMPTY

                    Robin CruiseR 1 條回覆 最後回覆 回覆 引用 1
                    • Robin CruiseR
                      Robin Cruise @guy038
                      最後由 編輯

                      此回覆已被刪除!
                      1 條回覆 最後回覆 回覆 引用 0
                      • Robin CruiseR
                        Robin Cruise
                        最後由 Robin Cruise 編輯

                        @guy038 yes, sir. Thank you. You succeded to transform the part 1 with part 2.

                        My problem, or I didn’t understand to good, are those ••••

                        How can I change those many •••• with different words and numbers ? If I have 200 lines, I need 2 months to change those ••••

                        I return to my regex,

                        please .matches newlines and run this 2 regex

                        FIND: (<li><a href="https://my-website.com/(.*?)" title=".*?)(\h+<a href="https://my-website.com/).*?(" title=")

                        REPLACE BY: \1\3\2\4

                        and

                        FIND: (title="(.*?)">.*?)(mr-1"></i> ).*?(<span>)

                        REPLACE BY: \1\3\2 \4

                        With this 2 regex, I manage to make a parsing. I copy a information from part 1 to part 2. Of course, is an example of how the formula should work

                        1 條回覆 最後回覆 回覆 引用 0
                        • guy038G
                          guy038
                          最後由 guy038 編輯

                          Hello @Robin-cruise,

                          If you want me to test your two regex S/R ( Let’s say respectively, S/R A and S/R B )

                          • I need an exact raw text which must be processed with the SR A

                          • I need any additional action, if any, which is to be done, after processing S/R A and before processing S/R B

                          • I need an exact raw text which must be processed with the S/R B ( Of course, it may be the result of the S/R A on the initial text )

                          BR

                          guy038

                          1 條回覆 最後回覆 回覆 引用 0
                          • Robin CruiseR
                            Robin Cruise
                            最後由 編輯

                            @guy038 @Alan-Kilborn This is the solution. Yes, it can be done with regex, but must integrated in powershell. Works super !

                            $file1 = 'c:\Folder1\file1.html'
                            $file2 = 'c:\Folder1\file2.html'
                            $result = 'c:\Folder1\result.html'
                            $link=@()
                            $title=@()
                            $number=@()
                            Get-Content -Path $file1 -Delimiter '</li>'|ForEach-Object{
                                $_|ForEach-Object{
                                    if($_ -match '(?<=href=").+?(?=")'){$link += $Matches.Values}
                                    if($_ -match '(?<=title=").+?(?=")'){$title += $Matches.Values}
                                    if($_ -match '(?<=\()\d+(?=\))'){$number += $Matches.Values}
                                }
                            }
                            $content = Get-Content -Path $file2 -Delimiter '</div>'
                            for($i=0;$i -lt $content.Count;$i++){
                                $content[$i] | ForEach-Object{
                                    if($_ -match '(?<=href=").+?(?=")'){$link2 = $Matches.Values}
                                    if($_ -match '(?<=title=").+?(?=")'){$title2 = $Matches.Values}
                                    if($_ -match '(?<=<span>)\d+(?=</span>)'){$number2 = $Matches.Values}
                                }
                                $content[$i] -replace $link2, $link[$i] -replace $title2, $title[$i] -replace $number2, $number[$i] | Out-File -FilePath $result -Append
                            }
                            

                            Source: https://docs.microsoft.com/en-us/answers/questions/307621/powershell-copy-strings-from-some-html-tags-to-ano.html

                            Alan KilbornA 1 條回覆 最後回覆 回覆 引用 -1
                            • Alan KilbornA
                              Alan Kilborn @Robin Cruise
                              最後由 編輯

                              @Robin-Cruise said in regex: Parsing html tags in other tags / links and titles:

                              This is the solution. Yes, it can be done with regex, but must integrated in powershell.

                              Ok… but this is really off-topic for a Notepad++ forum.
                              Whatever you’re doing could I’m sure be done in many different ways.
                              This isn’t a general “data manipulation” forum.

                              1 條回覆 最後回覆 回覆 引用 1
                              • guy038G
                                guy038
                                最後由 guy038 編輯

                                Hi, @Robin-cruise,

                                Nice ! So you, finally, got a working solution.


                                Out of curiosity, could you provide me a short sample of your initial files file1.html with several sections <ul......</ul> and file2.html and the corresponding result.html file, created by your Powershell script

                                This could help me to understand which you really wanted to achieve and why I could not imagine about suitable and useful regexes !

                                Thanks for your cooperation !

                                Best Regards,

                                guy038

                                1 條回覆 最後回覆 回覆 引用 0
                                • Robin CruiseR
                                  Robin Cruise
                                  最後由 編輯

                                  File1.html

                                  <ul id="myNavigation">
                                      <li><a href="https://my-website.com/page-1.html" title="Page 1">Page 1 (34)</a></li>
                                      <li><a href="https://my-website.com/page-2.html" title="Page 2">Page 2 (29)</a></li>
                                      <li><a href="https://my-website.com/page-3.html" title="Page-3">Page 3 (11)</a></li>
                                  </ul>
                                  

                                  File2.html

                                  <div class="categories-name">
                                     <a href="https://my-website.com/page-66.html" title="Page 66">
                                     <p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i> Page 66 <span>27</span> </p>
                                    </a>
                                  </div>
                                  <div class="categories-name">
                                     <a href="https://my-website.com/page-67.html" title="Page 67">
                                     <p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i> Page 67 <span>24</span> </p>
                                     </a>
                                  </div>
                                  <div class="categories-name">
                                     <a href="https://my-website.com/page-68.html" title="Page 68">
                                     <p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i> Page 68 <span>07</span> </p>
                                     </a>
                                  </div>
                                  

                                  @guy038 @Alan-Kilborn
                                  Thank you. Please understand that finding this solution, was not easy.

                                  First, I knew that @guy038 is one of the best REGEX developer on earth. He already gave a solution, in another post, to a PARSING html. so I thought he would succeed.

                                  Second, in another post, someone else gave a solution with PowerShell, of another problem. I remember also that post.

                                  So, I knew that this problem can be resolve, but I had to try both options.

                                  Now, maybe will be a better way to handle regex and files, by integrate notepad++ them into another program such as PowerShell.

                                  1 條回覆 最後回覆 回覆 引用 -1
                                  • guy038G
                                    guy038
                                    最後由 guy038 編輯

                                    Hi, @robin-cruise,

                                    But, could you provide, also, the result.html file ?

                                    Thanks

                                    BR

                                    guy038

                                    Robin CruiseR 1 條回覆 最後回覆 回覆 引用 0
                                    • Robin CruiseR
                                      Robin Cruise
                                      最後由 編輯

                                      @Alan-Kilborn said in regex: Parsing html tags in other tags / links and titles:

                                      but this is really off-topic for a Notepad++ forum

                                      No, sir. Is an opportunity to understand that a problem has many solution, even if people told you that isn’t possible.

                                      By the way, the same problem can be done in LINUX, much easier…with bash script, but that’s another story.

                                      1 條回覆 最後回覆 回覆 引用 -1
                                      • Robin CruiseR
                                        Robin Cruise @guy038
                                        最後由 Robin Cruise 編輯

                                        @guy038 result.html is just a blank file, put in the same folder with File1.html and File2.html

                                        After you run the code in PowerShell, the output result will appear into result.html

                                        please test the PowerShell code. You will see in result.html the structure html from the File2.html but with the values extracted from File1.html

                                        1 條回覆 最後回覆 回覆 引用 -1
                                        • guy038G
                                          guy038
                                          最後由 guy038 編輯

                                          Hi, @robin-cruise,

                                          OK but could you provide the result.html file AFTER running the Powershell script !

                                          BR

                                          guy038

                                          Sorry, with my old XP SP3 machine is a bit out of date !

                                          1 條回覆 最後回覆 回覆 引用 -1
                                          • Robin CruiseR
                                            Robin Cruise
                                            最後由 Robin Cruise 編輯

                                            @guy038 this is the result.html after run the PowerShell code

                                            alt text

                                            1 條回覆 最後回覆 回覆 引用 1
                                            • 第一個貼文
                                              最後的貼文
                                            The Community of users of the Notepad++ text editor.
                                            Powered by NodeBB | Contributors