Community
    • 登入

    regex: Parsing html tags in other tags / links and titles

    已排程 已置頂 已鎖定 已移動 Help wanted · · · – – – · · ·
    32 貼文 3 Posters 1.8k 瀏覽
    正在載入更多貼文
    • 從舊到新
    • 從新到舊
    • 最多點贊
    回覆
    • 在新貼文中回覆
    登入後回覆
    此主題已被刪除。只有擁有主題管理權限的使用者可以查看。
    • Robin CruiseR
      Robin Cruise
      最後由 編輯

      @guy038 @Alan-Kilborn This is the solution. Yes, it can be done with regex, but must integrated in powershell. Works super !

      $file1 = 'c:\Folder1\file1.html'
      $file2 = 'c:\Folder1\file2.html'
      $result = 'c:\Folder1\result.html'
      $link=@()
      $title=@()
      $number=@()
      Get-Content -Path $file1 -Delimiter '</li>'|ForEach-Object{
          $_|ForEach-Object{
              if($_ -match '(?<=href=").+?(?=")'){$link += $Matches.Values}
              if($_ -match '(?<=title=").+?(?=")'){$title += $Matches.Values}
              if($_ -match '(?<=\()\d+(?=\))'){$number += $Matches.Values}
          }
      }
      $content = Get-Content -Path $file2 -Delimiter '</div>'
      for($i=0;$i -lt $content.Count;$i++){
          $content[$i] | ForEach-Object{
              if($_ -match '(?<=href=").+?(?=")'){$link2 = $Matches.Values}
              if($_ -match '(?<=title=").+?(?=")'){$title2 = $Matches.Values}
              if($_ -match '(?<=<span>)\d+(?=</span>)'){$number2 = $Matches.Values}
          }
          $content[$i] -replace $link2, $link[$i] -replace $title2, $title[$i] -replace $number2, $number[$i] | Out-File -FilePath $result -Append
      }
      

      Source: https://docs.microsoft.com/en-us/answers/questions/307621/powershell-copy-strings-from-some-html-tags-to-ano.html

      Alan KilbornA 1 條回覆 最後回覆 回覆 引用 -1
      • Alan KilbornA
        Alan Kilborn @Robin Cruise
        最後由 編輯

        @Robin-Cruise said in regex: Parsing html tags in other tags / links and titles:

        This is the solution. Yes, it can be done with regex, but must integrated in powershell.

        Ok… but this is really off-topic for a Notepad++ forum.
        Whatever you’re doing could I’m sure be done in many different ways.
        This isn’t a general “data manipulation” forum.

        1 條回覆 最後回覆 回覆 引用 1
        • guy038G
          guy038
          最後由 guy038 編輯

          Hi, @Robin-cruise,

          Nice ! So you, finally, got a working solution.


          Out of curiosity, could you provide me a short sample of your initial files file1.html with several sections <ul......</ul> and file2.html and the corresponding result.html file, created by your Powershell script

          This could help me to understand which you really wanted to achieve and why I could not imagine about suitable and useful regexes !

          Thanks for your cooperation !

          Best Regards,

          guy038

          1 條回覆 最後回覆 回覆 引用 0
          • Robin CruiseR
            Robin Cruise
            最後由 編輯

            File1.html

            <ul id="myNavigation">
                <li><a href="https://my-website.com/page-1.html" title="Page 1">Page 1 (34)</a></li>
                <li><a href="https://my-website.com/page-2.html" title="Page 2">Page 2 (29)</a></li>
                <li><a href="https://my-website.com/page-3.html" title="Page-3">Page 3 (11)</a></li>
            </ul>
            

            File2.html

            <div class="categories-name">
               <a href="https://my-website.com/page-66.html" title="Page 66">
               <p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i> Page 66 <span>27</span> </p>
              </a>
            </div>
            <div class="categories-name">
               <a href="https://my-website.com/page-67.html" title="Page 67">
               <p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i> Page 67 <span>24</span> </p>
               </a>
            </div>
            <div class="categories-name">
               <a href="https://my-website.com/page-68.html" title="Page 68">
               <p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i> Page 68 <span>07</span> </p>
               </a>
            </div>
            

            @guy038 @Alan-Kilborn
            Thank you. Please understand that finding this solution, was not easy.

            First, I knew that @guy038 is one of the best REGEX developer on earth. He already gave a solution, in another post, to a PARSING html. so I thought he would succeed.

            Second, in another post, someone else gave a solution with PowerShell, of another problem. I remember also that post.

            So, I knew that this problem can be resolve, but I had to try both options.

            Now, maybe will be a better way to handle regex and files, by integrate notepad++ them into another program such as PowerShell.

            1 條回覆 最後回覆 回覆 引用 -1
            • guy038G
              guy038
              最後由 guy038 編輯

              Hi, @robin-cruise,

              But, could you provide, also, the result.html file ?

              Thanks

              BR

              guy038

              Robin CruiseR 1 條回覆 最後回覆 回覆 引用 0
              • Robin CruiseR
                Robin Cruise
                最後由 編輯

                @Alan-Kilborn said in regex: Parsing html tags in other tags / links and titles:

                but this is really off-topic for a Notepad++ forum

                No, sir. Is an opportunity to understand that a problem has many solution, even if people told you that isn’t possible.

                By the way, the same problem can be done in LINUX, much easier…with bash script, but that’s another story.

                1 條回覆 最後回覆 回覆 引用 -1
                • Robin CruiseR
                  Robin Cruise @guy038
                  最後由 Robin Cruise 編輯

                  @guy038 result.html is just a blank file, put in the same folder with File1.html and File2.html

                  After you run the code in PowerShell, the output result will appear into result.html

                  please test the PowerShell code. You will see in result.html the structure html from the File2.html but with the values extracted from File1.html

                  1 條回覆 最後回覆 回覆 引用 -1
                  • guy038G
                    guy038
                    最後由 guy038 編輯

                    Hi, @robin-cruise,

                    OK but could you provide the result.html file AFTER running the Powershell script !

                    BR

                    guy038

                    Sorry, with my old XP SP3 machine is a bit out of date !

                    1 條回覆 最後回覆 回覆 引用 -1
                    • Robin CruiseR
                      Robin Cruise
                      最後由 Robin Cruise 編輯

                      @guy038 this is the result.html after run the PowerShell code

                      alt text

                      1 條回覆 最後回覆 回覆 引用 1
                      • Robin CruiseR
                        Robin Cruise
                        最後由 編輯

                        the single problem in PowerShell is that not every regex formulas that works in notepad++ works there. Seems that PowerShell don’t like \K or \G or replacement such as (?2\2)

                        1 條回覆 最後回覆 回覆 引用 -1
                        • guy038G
                          guy038
                          最後由 guy038 編輯

                          Hello, @robin-cruise, @alan-kilborn and All,

                          OMG, now I understand the all story ! In fact, you’re did not express your needs correctly, in your first post ! The File-2.tml is useless and we just can go from the result_1.hmtl file to the result.html file !

                          Indeed, in your first post, you said :

                          <ul id="myNavigation">
                              <li><a href="https://my-website.com/page-1.html" title="Page 1">Page 1 (34)</a></li>
                              <li><a href="https://my-website.com/page-2.html" title="Page 2">Page 2 (29)</a></li>
                              <li><a href="https://my-website.com/page-3.html" title="Page-3">Page 3 (11)</a></li>
                          </ul>
                          

                          MUST BECOME:

                          <div class="categories-name">
                             <a href="https://my-website.com/page-66.html" title="Page 66">
                             <p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i> Page 66 <span>27</span> </p>
                            </a>
                          </div>
                          <div class="categories-name">
                             <a href="https://my-website.com/page-67.html" title="Page 67">
                             <p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i> Page 67 <span>24</span> </p>
                             </a>
                          </div>
                          <div class="categories-name">
                             <a href="https://my-website.com/page-68.html" title="Page 68">
                             <p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i> Page 68 <span>07</span> </p>
                             </a>
                          </div>
                          

                          can something like this be done with regex?

                          But… you SHOULD have written :

                          <ul id="myNavigation">
                              <li><a href="https://my-website.com/page-1.html" title="Page 1">Page 1 (34)</a></li>
                              <li><a href="https://my-website.com/page-2.html" title="Page 2">Page 2 (29)</a></li>
                              <li><a href="https://my-website.com/page-3.html" title="Page-3">Page 3 (11)</a></li>
                          </ul>
                          

                          MUST BECOME

                          <div class="categories-name">
                             <a href="https://my-website.com/page-1.html" title="Page 1">
                             <p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i> Page 1 <span>34</span> </p>
                            </a>
                          </div>
                          <div class="categories-name">
                             <a href="https://my-website.com/page-2.html" title="Page 2">
                             <p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i> Page 2 <span>29</span> </p>
                             </a>
                          </div>
                          <div class="categories-name">
                             <a href="https://my-website.com/page-3.html" title="Page-3">
                             <p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i> Page-3 <span>11</span> </p>
                             </a>
                          </div>
                          

                          can something like this be done with regex?

                          And yes, this transformation can, indeed, be solved with ONE regex S/R only !!!

                          What misled me was the fact that your output file contained other values than the ones indicated in the initial file ! ( Page 66 <span>27</span>, Page 67 <span>24</span>, Page 68 <span>07</span>... which are completely out of scope :-(( I vainly looked for a link between, for example, the strings "Page 1">Page 1 (34) and Page 66 <span>27</span> !


                          So, Robin, now, it’s fairly easy to get the right regex S/R !

                          From this initial text

                          ===========================================================================================================================================
                          <ul id="myNavigation">
                              <li><a href="https://my-website.com/page-1.html" title="Page 1">Page 1 (34)</a></li>
                              <li><a href="https://my-website.com/page-2.html" title="Page 2">Page 2 (29)</a></li>
                              <li><a href="https://my-website.com/page-3.html" title="Page-3">Page 3 (11)</a></li>
                          </ul>
                          ===========================================================================================================================================
                          

                          with the following regex S/R :

                          SEARCH (?-i)^\h*<li><a href="https://my\-website.com/(.+?)" title="(.+?)">.+?\((.+?)\)</a></li>|<(ul id="myNavigation"|/ul)>\R

                          REPLACE ?4:<div class="categories-name">\r\n\x20\x20\x20<a href="https://my-website.com/\1" title="\2">\r\n\x20\x20\x20<p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i>\x20\2\x20<span>\3</span> </p>\r\n\x20\x20\x20</a>\r\n</div>

                          we do get the expected text, totally identical to your result.html contents ! Moreover, you do not need the Powershell script any more !

                          ===========================================================================================================================================
                          <div class="categories-name">
                             <a href="https://my-website.com/page-1.html" title="Page 1">
                             <p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i> Page 1 <span>34</span> </p>
                             </a>
                          </div>
                          <div class="categories-name">
                             <a href="https://my-website.com/page-2.html" title="Page 2">
                             <p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i> Page 2 <span>29</span> </p>
                             </a>
                          </div>
                          <div class="categories-name">
                             <a href="https://my-website.com/page-3.html" title="Page-3">
                             <p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i> Page-3 <span>11</span> </p>
                             </a>
                          </div>
                          ===========================================================================================================================================
                          

                          Et voilà !


                          Notes :

                          • The groups used are :

                            • Group 1 : the part after website.com/ till the double quote, not included

                            • Group 2 : the title part

                            • Group 3 : the string between parentheses, before </a>

                            • Group 4 : either the literal string <ul id="myNavigation"> or the literal string </ul>, which must be deleted, in replacement

                          • Ths regex string \x20\x20\x20 occurs three times, in the replacement regex, and correspond to the needed leading spaces in each <div ......</div> section. Changed it as desired !

                          Best Regards,

                          guy038

                          P.S. :

                          And, if you want to change the lines <li>.........</li>, ONLY inside the <ul...............</ul> sections, use this second regex S/R, derived of the generic regex S/R, discussed in other topics :

                          SEARCH (?s-i)(?:<ul id="myNavigation">|(?!\A)\G)(?:(?!</ul>).)*?\K^\h*<li><a href="https://my\-website.com/(.+?)" title="(.+?)">.+?\((.+?)\)</a></li>

                          REPLACE <div class="categories-name">\r\n\x20\x20\x20<a href="https://my-website.com/\1" title="\2">\r\n\x20\x20\x20<p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i>\x20\2\x20<span>\3</span> </p>\r\n\x20\x20\x20</a>\r\n</div>


                          Now, end the job, getting rid of the literal string <ul id="myNavigation"> and the literal string </ul>, with, simply :

                          SEARCH (?-i)<(ul id="myNavigation"|/ul)>\R

                          REPLACE Leave EMPTY

                          We need consecutive regexes S/R because the values <ul id="myNavigation"> and </ul> are used as anchors, in the first generic S/R and must not be deleted during the first regex process !

                          Alan KilbornA 1 條回覆 最後回覆 回覆 引用 2
                          • Alan KilbornA
                            Alan Kilborn @guy038
                            最後由 編輯

                            @guy038

                            Hi Guy… it is too funny.
                            Now, before solving a problem, you have to solve another problem: What the problem actually is!
                            It’s good that you have the talent and perseverance to do that; well, it’s good for the OP I suppose.

                            1 條回覆 最後回覆 回覆 引用 1
                            • Robin CruiseR
                              Robin Cruise
                              最後由 Robin Cruise 編輯

                              @guy038 thank you, I knew that you can do it. I believe, you are the best regex developer in the entire word.

                              I understood one thing. Never say “it can’t be done” and even if it can’t be done in one way, you can always find another way. Be open to another programs.

                              Much more important: if you give someone a new perspective, that someone will be able to see clearly the problem and to find the solution much easy.

                              So, my friends. Thanks a lot, from now on I know that many people understood also the power of PowerShell, and it can be used very often to modify files by integrate regex.

                              1 條回覆 最後回覆 回覆 引用 0
                              • guy038G
                                guy038
                                最後由 編輯

                                Hi, @robin-cruise and All,

                                You said :

                                I believe, you are the best regex developer in the entire word.

                                I would say definitively not the best !

                                For instance, just think of this man here and here.

                                Jan Goyvaerts :

                                • Created the fantastic regular-expression.info site and still maintains it !

                                • Developed all these valuable regex tools : PowerGREP, RegexBuddy, RegexMagic and EditPad Pro


                                This is just an example of the many developers, involved in regular expressions !

                                BR

                                guy038

                                1 條回覆 最後回覆 回覆 引用 1
                                • Robin CruiseR
                                  Robin Cruise
                                  最後由 編輯

                                  of course there are others, but only @guy038 helped me and all notepad++ community with so hard to find solutions.

                                  and I am using notepad++ and grepWin (and as for the Batch Process I using a great tool named TextCrawler)

                                  1 條回覆 最後回覆 回覆 引用 0
                                  • 第一個貼文
                                    最後的貼文
                                  The Community of users of the Notepad++ text editor.
                                  Powered by NodeBB | Contributors