Regex: How to get off the connecting line from the title of a hyperlink?

Hellena Crainicu

I have several lines with this kind of hyperlinks:

<p class="mb-40px"><a href="my-name-is-prince.html">My-name-is-prince</a></p>

I want to use regex as to get off the connecting line from the title.

The Output should be:

<p class="mb-40px"><a href="my-name-is-prince.html">My name is prince</a></p>

Hellena Crainicu

I find the solution:

FIND: (?-s)(\G(?!^)|html">)((?!</a).)*?\K[-]

REPLACE BY: \x20

guy038

Hello, @hellena-crainicu and All,

You said :

I find a solution :

FIND: (?-s)(\G(?!^)|html">)((?!</a).)*?\K[-]

REPLACE BY: \x20

I was a bit intrigued and I tried to dig out a bit your solution

First, no need to place the dash between square brackets
Secondly, to be rigourous, it would be better to use the exact <a href=".......">........</a> definition and place it as the first alternative. In addition, if we use a non-capturing group with a non-insensitive modifier inside this group, this leads to this equivalent search regex :

(?s)(?-i:<a\x20href=".+?">|\G(?!^))((?!</a>).)*?\K-

Thirdly, as you’re using the (?-s) modifier, this means that, after the last character of each line, as it needs to cross through the EOL char(s) to access a next line, the \G asssertion will not be true. So, from the beginning of each line, we’ll have to find a <a href.... definition first. In this case, it’s useless to add that the ending region is the negative look-ahead (?<!</a>)

So, your regex could be simplified as :

(?-s)(?-i:<a\x20href=".+?">|\G(?!^)).*?\K-

However, note that if you use the (?s) single_line modifier, you must use the look-ahead (?<!</a>) to limit the action of your multi-lines search :

(?s)(?-i:<a\x20href=".+?">|\G(?!^))((?!</a>).)*?\K-

Now, in the topic below, we already tried to normalize this kind of regex !

https://community.notepad-plus-plus.org/topic/20728/changing-data-inside-xml-element/15?_=1645313706435

Let FR (Find Regex ) be the regex which defines the char, string or expression to be searched

Let RR (Replacement Regex ) be the regex which defines the char, string or expression which must replace the FR expression

Let BSR ( Begin Search-region Regex ) be the regex which defines the beginning of the area where the search for FR, must start

Let ESR ( End Search-region Regex) be the regex which defines, implicitly, the area where the search for FR, must end

Then, the generic regex can be expressed :

SEARCH (?-i:BSR|(?!\A)\G)(?s:(?!ESR).)*?\K(?-i:FR)

REPLACE RR

So I was curious to compare our previous syntax with yours, which is :

SEARCH (?-i:BSR|\G(?!^))(?s:(?!ESR).)*?\K(?-i:FR)

REPLACE RR

After some tests, I must say that your syntax \G(?!^), which can also be expressed as (?!^)\G, seems more accurate and practical than (?!\A)\G). Let me explain :

When you perform a Replace All or a Mark All operation, you simply have to tick the Wrap aound option to get the correct results / replacements !

But, if you just use the Find Next button :

With the (?!\A)\G) syntax, you need to move the caret at very beginning of file in order to get a correct match ELSE you may match some incorrect FR
With the (?!^)\G syntax, you need to move to any beginning of line, in order to get a correct match. ELSE any start from position > 1 may match incorrect FR

In other words :

With the @hellena-crainicu syntax, associated to \G, if you are at beginning of any line, a first hit on the Find Next button will always give you a correct match
With our previous syntax, associated to \G, you must be at the very begining of file in order that a first hit on the Find Next button gives you a correct match

To be convinced :

Select the Mark dialog ( Ctrl + M )
Untick the Wrap around option ( IMPORTANT )
Tick the Purge for each search option
Move the caret at beginning or not of the first line or the subsequent lines ( a FR part must be present in some lines to see the differences ! )
For each case, note all the matches after a click on the Mark All button, for both methods :

(?-i:BSR|(?!\A)\G)(?s:(?!ESR).)*?\K(?-i:FR) and (?-i:BSR|(?!^)\G)(?s:(?!ESR).)*?\K(?-i:FR)

Best Regards,

guy038

Hellena Crainicu

@guy038 THANKS

guy038

Hi, @hellena-crainicu and All,

Let me expand on my previous post. Here is a real example, based on the @hellena-crainicu problem !

In this example, I supposed that @hellena-crainicu wanted to search for any dash symbol, contained in the • region of the tag
<a href="...........">••••••••••••••</a>, in a multi-lines text, so using the (?s) single_line modifier.

In a new tab, paste the 23- lines text, below :

This-is
--

a-
test

<p class="mb-40px"><a href="my-na
me-is-prince.html">My-
name


-
is---pr
ince</a></p>

<p class="mb-40px"><a href="
my-name-is-prince.
html">M
y-name
--

is-prince
</a></p>

Now, we must detect the differences between the two regexes :

Regex A : (?s)(?-i:<a\x20href=".+?">|\G(?!\A))((?!</a>).)*?\K- ( The used syntax, up to now )

and

Regex B : (?s)(?-i:<a\x20href=".+?">|\G(?!^))((?!</a>).)*?\K- ( The @hellena-crainicu’s syntax )

Open the Mark dialog ( Ctrl + M )
Untick all options
Tick the Purge fore each search AND Wrap around options
Select the Regular epression search mode
Click on the Mark All button

=> Message Mark: 9 matches in entire file, corresponding to the 9 dashes between the > and </a>, in the two multi-lines beginning with <p class. This is correct !

In the same way, if the Wrap around is ticked, a replacement of each dash by a space char would correctly give the message Replace All: 9 occurrences were replaced in entire file

Now, let’s see the differences when using the Mark dialog, with the Wrap aound option unticked and the Purge for each search still ticked

Here is, below, some results depending on the caret’s position ( Line x, column y ), right before a click on the Mark All button :

    •--------------------•--------------------•--------------------•--------------•--------------•----------------------------------•
    |   Caret position   |      Regex  A      |      Regex  B      |   Regex  A   |   Regex  B   |           Observations           |
    •--------------------•--------------------•--------------------•--------------•--------------•----------------------------------•
    |  Line 1, column 1  |      9  matches    |      9  matches    |      OK      |      OK      |  Beginning of **file** and line  |
    |  Line 1, column 2  |     17  matches    |     17  matches    |      ko      |      ko      |                                  |
    •--------------------•--------------------•--------------------•--------------•--------------•----------------------------------•
    |  Line 2, column 1  |     16  matches    |      9  matches    |      ko      |      OK      |  Beginning of line               |
    |  Line 2, column 2  |     15  matches    |     15  matches    |      ko      |      ko      |                                  |
    •--------------------•--------------------•--------------------•--------------•--------------•----------------------------------•
    |  Line 3, column 1  |     14  matches    |      9  matches    |      ko      |      OK      |  Beginning of **empty** line     |
    •--------------------•--------------------•--------------------•--------------•--------------•----------------------------------•
    |  Line 4, column 1  |     14  matches    |      9  matches    |      ko      |      OK      |  Beginning of line               |
    |  Line 4, column 2  |     14  matches    |     14  matches    |      ko      |      ko      |                                  |
    •--------------------•--------------------•--------------------•--------------•--------------•----------------------------------•
    |  Line 5, column 1  |     13  matches    |      9  matches    |      ko      |      OK      |  Beginning of line               |
    |  Line 5, column 2  |     13  matches    |     13  matches    |      ko      |      ko      |                                  |
    •--------------------•--------------------•--------------------•--------------•--------------•----------------------------------•
    |  Line 6, column 1  |     13  matches    |      9  matches    |      ko      |      OK      |  Beginnin of **empty** line      |
    •--------------------•--------------------•--------------------•--------------•--------------•----------------------------------•
    |  Line 7, column 1  |     13  matches    |      9  matches    |      ko      |      OK      |  Beginning of line               |
    |  Line 7, column 2  |     13  matches    |     13  matches    |      ko      |      ko      |                                  |
    •--------------------•--------------------•--------------------•--------------•--------------•----------------------------------•

Note that the exact message is : Mark: xx matches from caret to end-of-file

It’s easy to notice that the @hellena-crainicu syntax ( Regex B ) gives more correct results than the previous one ( Regex A ), when the Wrap aound option is not checked ;-))

Best Regards

guy038

guy038

Hi, @hellena-crainicu and All,

I did additional tests and, sorry Hellena, but using your negative look-ahead (?!^), instead of (?!\A), may miss matches in some cases, too !

Indeed, imagine that the searched string would just be the EOL char(s) with the following regex :

SEARCH (?s)(?-i:<a\x20href=".+?">|\G(?!^))((?!</a>).)*?\K\R

Then, the part \G(?!^)((?!</a>).)*?, before a next match of line-ending chars, would never occur, as the range, after \G, should start at beginning of line which is just forbidden due to the \G(?!^) syntax !

Finally, the present (?!\A) syntax is preferable. We do not even need to bother about the status of the Wrap around option. Just ONE rule :

Move at the very beginning of current file, with the Ctrl + Home shortcut, before applying this specific S/R !

You may test the regex :

(?s)(?-i:<a\x20href=".+?">|\G(?!\A))((?!</a>).)*?\K\R ( and your version (?s)(?-i:<a\x20href=".+?">|\G(?!^))((?!</a>).)*?\K\R )

Against the 23-lines text of my previous post to see the obvious differences !

BR

guy038

P.S. I’m about to send an e-mail to @peterjones to know where this specific S/R should be placed. Probably, at this location :

Developing generic regex sequences

Vasile Caraus

@guy038 I use https://chat.openai.com/ to find different solution. ChatGPT learns everything. In about 5 seconds generates another 4 solutions.

I just put your regex as an example, and I ask ChatGPT to write me another 4 solution. Is the most inteligent tood ever. Artificial Inteligent.

:
Căutare: (?-s)(\G(?!^)|html">)((?!</a>).)*?\K-
Înlocuire: \x20

Căutare: (?-s)(\G(?!^)|html">)((?!</a>).)*?\K-
Înlocuire: \x20