Community
    • Login

    Regex: How to get off the connecting line from the title of a hyperlink?

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    7 Posts 3 Posters 1.1k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Hellena CrainicuH
      Hellena Crainicu
      last edited by Hellena Crainicu

      I have several lines with this kind of hyperlinks:

      <p class="mb-40px"><a href="my-name-is-prince.html">My-name-is-prince</a></p>

      I want to use regex as to get off the connecting line from the title.

      The Output should be:

      <p class="mb-40px"><a href="my-name-is-prince.html">My name is prince</a></p>

      Hellena CrainicuH 1 Reply Last reply Reply Quote 0
      • Hellena CrainicuH
        Hellena Crainicu @Hellena Crainicu
        last edited by

        I find the solution:

        FIND: (?-s)(\G(?!^)|html">)((?!</a).)*?\K[-]

        REPLACE BY: \x20

        1 Reply Last reply Reply Quote 4
        • guy038G
          guy038
          last edited by guy038

          Hello, @hellena-crainicu and All,

          You said :

          I find a solution :

          FIND: (?-s)(\G(?!^)|html">)((?!</a).)*?\K[-]

          REPLACE BY: \x20

          I was a bit intrigued and I tried to dig out a bit your solution

          • First, no need to place the dash between square brackets

          • Secondly, to be rigourous, it would be better to use the exact <a href=".......">........</a> definition and place it as the first alternative. In addition, if we use a non-capturing group with a non-insensitive modifier inside this group, this leads to this equivalent search regex :

          (?s)(?-i:<a\x20href=".+?">|\G(?!^))((?!</a>).)*?\K-

          • Thirdly, as you’re using the (?-s) modifier, this means that, after the last character of each line, as it needs to cross through the EOL char(s) to access a next line, the \G asssertion will not be true. So, from the beginning of each line, we’ll have to find a <a href.... definition first. In this case, it’s useless to add that the ending region is the negative look-ahead (?<!</a>)

          So, your regex could be simplified as :

          (?-s)(?-i:<a\x20href=".+?">|\G(?!^)).*?\K-

          However, note that if you use the (?s) single_line modifier, you must use the look-ahead (?<!</a>) to limit the action of your multi-lines search :

          (?s)(?-i:<a\x20href=".+?">|\G(?!^))((?!</a>).)*?\K-


          Now, in the topic below, we already tried to normalize this kind of regex !

          https://community.notepad-plus-plus.org/topic/20728/changing-data-inside-xml-element/15?_=1645313706435

          • Let FR (Find Regex ) be the regex which defines the char, string or expression to be searched

          • Let RR (Replacement Regex ) be the regex which defines the char, string or expression which must replace the FR expression

          • Let BSR ( Begin Search-region Regex ) be the regex which defines the beginning of the area where the search for FR, must start

          • Let ESR ( End Search-region Regex) be the regex which defines, implicitly, the area where the search for FR, must end

          Then, the generic regex can be expressed :

          SEARCH (?-i:BSR|(?!\A)\G)(?s:(?!ESR).)*?\K(?-i:FR)

          REPLACE RR

          So I was curious to compare our previous syntax with yours, which is :

          SEARCH (?-i:BSR|\G(?!^))(?s:(?!ESR).)*?\K(?-i:FR)

          REPLACE RR

          After some tests, I must say that your syntax \G(?!^), which can also be expressed as (?!^)\G, seems more accurate and practical than (?!\A)\G). Let me explain :


          When you perform a Replace All or a Mark All operation, you simply have to tick the Wrap aound option to get the correct results / replacements !

          But, if you just use the Find Next button :

          • With the (?!\A)\G) syntax, you need to move the caret at very beginning of file in order to get a correct match ELSE you may match some incorrect FR

          • With the (?!^)\G syntax, you need to move to any beginning of line, in order to get a correct match. ELSE any start from position > 1 may match incorrect FR

          In other words :

          • With the @hellena-crainicu syntax, associated to \G, if you are at beginning of any line, a first hit on the Find Next button will always give you a correct match

          • With our previous syntax, associated to \G, you must be at the very begining of file in order that a first hit on the Find Next button gives you a correct match

          To be convinced :

          • Select the Mark dialog ( Ctrl + M )

          • Untick the Wrap around option ( IMPORTANT )

          • Tick the Purge for each search option

          • Move the caret at beginning or not of the first line or the subsequent lines ( a FR part must be present in some lines to see the differences ! )

          • For each case, note all the matches after a click on the Mark All button, for both methods :

          (?-i:BSR|(?!\A)\G)(?s:(?!ESR).)*?\K(?-i:FR)    and    (?-i:BSR|(?!^)\G)(?s:(?!ESR).)*?\K(?-i:FR)

          Best Regards,

          guy038

          Hellena CrainicuH 1 Reply Last reply Reply Quote 6
          • Hellena CrainicuH
            Hellena Crainicu @guy038
            last edited by

            @guy038 THANKS

            1 Reply Last reply Reply Quote 0
            • guy038G
              guy038
              last edited by guy038

              Hi, @hellena-crainicu and All,

              Let me expand on my previous post. Here is a real example, based on the @hellena-crainicu problem !

              In this example, I supposed that @hellena-crainicu wanted to search for any dash symbol, contained in the • region of the tag
              <a href="...........">••••••••••••••</a>, in a multi-lines text, so using the (?s) single_line modifier.

              In a new tab, paste the 23- lines text, below :

              This-is
              --
              
              a-
              test
              
              <p class="mb-40px"><a href="my-na
              me-is-prince.html">My-
              name
              
              
              -
              is---pr
              ince</a></p>
              
              <p class="mb-40px"><a href="
              my-name-is-prince.
              html">M
              y-name
              --
              
              is-prince
              </a></p>
              

              Now, we must detect the differences between the two regexes :

              • Regex A : (?s)(?-i:<a\x20href=".+?">|\G(?!\A))((?!</a>).)*?\K- ( The used syntax, up to now )

              and

              • Regex B : (?s)(?-i:<a\x20href=".+?">|\G(?!^))((?!</a>).)*?\K- ( The @hellena-crainicu’s syntax )

              • Open the Mark dialog ( Ctrl + M )

              • Untick all options

              • Tick the Purge fore each search AND Wrap around options

              • Select the Regular epression search mode

              • Click on the Mark All button

              => Message Mark: 9 matches in entire file, corresponding to the 9 dashes between the > and </a>, in the two multi-lines beginning with <p class. This is correct !

              In the same way, if the Wrap around is ticked, a replacement of each dash by a space char would correctly give the message Replace All: 9 occurrences were replaced in entire file


              Now, let’s see the differences when using the Mark dialog, with the Wrap aound option unticked and the Purge for each search still ticked

              Here is, below, some results depending on the caret’s position ( Line x, column y ), right before a click on the Mark All button :

                  •--------------------•--------------------•--------------------•--------------•--------------•----------------------------------•
                  |   Caret position   |      Regex  A      |      Regex  B      |   Regex  A   |   Regex  B   |           Observations           |
                  •--------------------•--------------------•--------------------•--------------•--------------•----------------------------------•
                  |  Line 1, column 1  |      9  matches    |      9  matches    |      OK      |      OK      |  Beginning of **file** and line  |
                  |  Line 1, column 2  |     17  matches    |     17  matches    |      ko      |      ko      |                                  |
                  •--------------------•--------------------•--------------------•--------------•--------------•----------------------------------•
                  |  Line 2, column 1  |     16  matches    |      9  matches    |      ko      |      OK      |  Beginning of line               |
                  |  Line 2, column 2  |     15  matches    |     15  matches    |      ko      |      ko      |                                  |
                  •--------------------•--------------------•--------------------•--------------•--------------•----------------------------------•
                  |  Line 3, column 1  |     14  matches    |      9  matches    |      ko      |      OK      |  Beginning of **empty** line     |
                  •--------------------•--------------------•--------------------•--------------•--------------•----------------------------------•
                  |  Line 4, column 1  |     14  matches    |      9  matches    |      ko      |      OK      |  Beginning of line               |
                  |  Line 4, column 2  |     14  matches    |     14  matches    |      ko      |      ko      |                                  |
                  •--------------------•--------------------•--------------------•--------------•--------------•----------------------------------•
                  |  Line 5, column 1  |     13  matches    |      9  matches    |      ko      |      OK      |  Beginning of line               |
                  |  Line 5, column 2  |     13  matches    |     13  matches    |      ko      |      ko      |                                  |
                  •--------------------•--------------------•--------------------•--------------•--------------•----------------------------------•
                  |  Line 6, column 1  |     13  matches    |      9  matches    |      ko      |      OK      |  Beginnin of **empty** line      |
                  •--------------------•--------------------•--------------------•--------------•--------------•----------------------------------•
                  |  Line 7, column 1  |     13  matches    |      9  matches    |      ko      |      OK      |  Beginning of line               |
                  |  Line 7, column 2  |     13  matches    |     13  matches    |      ko      |      ko      |                                  |
                  •--------------------•--------------------•--------------------•--------------•--------------•----------------------------------•
              

              Note that the exact message is : Mark: xx matches from caret to end-of-file

              It’s easy to notice that the @hellena-crainicu syntax ( Regex B ) gives more correct results than the previous one ( Regex A ), when the Wrap aound option is not checked ;-))

              Best Regards

              guy038

              1 Reply Last reply Reply Quote 1
              • guy038G
                guy038
                last edited by guy038

                Hi, @hellena-crainicu and All,

                I did additional tests and, sorry Hellena, but using your negative look-ahead (?!^), instead of (?!\A), may miss matches in some cases, too !

                Indeed, imagine that the searched string would just be the EOL char(s) with the following regex :

                SEARCH (?s)(?-i:<a\x20href=".+?">|\G(?!^))((?!</a>).)*?\K\R

                Then, the part \G(?!^)((?!</a>).)*?, before a next match of line-ending chars, would never occur, as the range, after \G, should start at beginning of line which is just forbidden due to the \G(?!^) syntax !


                Finally, the present (?!\A) syntax is preferable. We do not even need to bother about the status of the Wrap around option. Just ONE rule :

                • Move at the very beginning of current file, with the Ctrl + Home shortcut, before applying this specific S/R !

                You may test the regex :

                (?s)(?-i:<a\x20href=".+?">|\G(?!\A))((?!</a>).)*?\K\R    ( and your version (?s)(?-i:<a\x20href=".+?">|\G(?!^))((?!</a>).)*?\K\R )

                Against the 23-lines text of my previous post to see the obvious differences !

                BR

                guy038

                P.S. I’m about to send an e-mail to @peterjones to know where this specific S/R should be placed. Probably, at this location :

                Developing generic regex sequences

                Vasile CarausV 1 Reply Last reply Reply Quote 1
                • Vasile CarausV
                  Vasile Caraus @guy038
                  last edited by

                  @guy038 I use https://chat.openai.com/ to find different solution. ChatGPT learns everything. In about 5 seconds generates another 4 solutions.

                  I just put your regex as an example, and I ask ChatGPT to write me another 4 solution. Is the most inteligent tood ever. Artificial Inteligent.

                  :
                  Căutare: (?-s)(\G(?!^)|html">)((?!</a>).)*?\K-
                  Înlocuire: \x20

                  Căutare: (?-s)(\G(?!^)|html">)((?!</a>).)*?\K-
                  Înlocuire: \x20

                  Căutare: (?-s)(\G(?!^)|html">)((?!</a>).)*?\K-
                  Înlocuire: \x20

                  Căutare: (?-s)(\G(?!^)|html">)((?!</a>).)*?\K-
                  Înlocuire: \x20

                  1 Reply Last reply Reply Quote 0
                  • First post
                    Last post
                  The Community of users of the Notepad++ text editor.
                  Powered by NodeBB | Contributors