Regex: Select only the first instance of search results / first match

PeterJones

You’ve already gotten the freebie from me on this question. You need to put more thought and effort into it.

Did you know, when I am helping anyone in this forum by coming up with a regular expression for their problem, that I don’t automatically know the solution off the top of my head? Do you know what I do? I break the problem down into little pieces, then translate each of those pieces into regex syntax, then I try the regex out; if it doesn’t work, I try to figure out why, and see if I can tweak my initial guess until it does work.

This same process will work for you, if you give it a try. I will even give you a boost, by stating your problem in the little pieces I would initially try

find <tr> to </tr>, without any contained <tr>
followed by anything that’s not a <tr>
until the end of the file

Also, although it’s not highly Notepad++ related: if you are working on a large website, where you’re going to be frequently changing the template (I assume you are changing content in boilerplate that surrounds your HTML), then I highly recommend a system with a templating language, often implemented as a CMS (content management system), so that you don’t have to be modifying the same code multiple times. In the long run, that will be more efficient than coming up with a new regex for every template change.

I hope you take to heart the advice I’ve given you here.

Vasile Caraus

@PeterJones it is easy for a developer to understand code, but is not easy for a painter to understand the code… :)

Alan Kilborn

@Vasile-Caraus

it is easy for a developer to understand code, but is not easy for a painter to understand the code…

If you can’t stand the heat, get out of the kitchen.

Q: What’s the meaning of the phrase ‘If you can’t stand the heat, get out of the kitchen’?

A: Don’t persist with a task if the pressure of it is too much for you. The implication being that, if you can’t cope, you should leave the work to someone who can.

It’s getting embarrassing for you.
It’s almost an uncomfortable thing to witness.
:-(

PeterJones

@Vasile-Caraus said in Regex: Select only the first instance of search results / first match:

@PeterJones it is easy for a developer to understand code, but is not easy for a painter to understand the code… :)

The whole point of CMS is to make it easier for painters to make a website, almost literally. If you cannot learn the CMS, then you should probably hire an expert.

Conversely, to your statement, if I go to my painter friend and ask him to paint me a picture to hang on the wall, he might do it for free one time, but after that, he’ll tell me “either learn to paint yourself, or pay me”.

You know enough about regex to come up with some guesses, so you already know a lot of the code, and you have been pointed to the documentation to be able to learn more. Try to piece together the bits you know in the right order for the problem you have. I have given you hints as to the right order. But at some point, you’re going to have to choose to learn, or find someone you can pay to do the development for your website.

Because expecting us to be your personal free regular-expression writers for 4.5years, without ever contributing anything else to this forum, is … Nevermind.

Good luck.

Terry R

@Vasile-Caraus said in Regex: Select only the first instance of search results / first match:

your (?s)\b<tr>.+?</tr>\b is not working :(

Now that I’m on a PC I can see the original post (recent) from you stating your regex was “finding both <tr>…</tr>” was most likely false. Copying the examples I see that by using the \b, it actually prevented the regex from capturing the example text.

So my first solution to add a ? to your regex would have fixed the issue IF your regex was actually capturing too much. My later statement to remove the \b was in fact a correct move and would/should have resolved the issue. Except your statement about capturing the “first” only “<tr>…</tr>” was a bit troubling as you didn’t have the \A anchor as @PeterJones stated was needed. I assumed (wrongly possibly) that you possibly had a large html file in which you wanted to find (and replace?) the first “<tr>…</tr>” and if using ONLY the “Find Next” or “Replace” button when cursor was at start of an open file would have found the first set.

I guess I have learnt a valuable lesson, don’t provide answers when not able to independently confirm OP statements. I was on an android tablet late in the evening and about to go to bed when I saw this post and assumed it was an easy fix given the statement about it selecting too much text.

Terry

PS thanks Alan for coming to my rescue ;-))

Vasile Caraus

@PeterJones said in Regex: Select only the first instance of search results / first match:

<tr>

if someone knows how to select only the last instance of my request, please write the solution, not for me, but for those 36 k who have read this topic, and will read it from now on. Don’t do it for me, do it for other that search this solution.

PeterJones

Future readers,

Since there historically have been a lot of reads on this discussion, it will be good to have the both the “first match” and “last match” versions of the <tr> question posed earlier, since it was brought up. Also, I had forgotten that there was an advanced concept needed for the find-the-last that wasn’t needed in the regex for find-the-first, which even after reading the docs isn’t obvious. So for teaching purposes, here is how I went about solving the problem.

I started by saying, "I want from <tr> to </tr>, then as few characters as possible between the </tr> and the end. I knew that wasn’t enough, but it was a starting point for my regex. I translated that into (?s)<tr>.*?</tr>.*?\z – which searches for literal text, then as few as possible of any character, then literal text, then as few as possible of any character followed by end of file. That gets you close, but isn’t quite there, because “as few as possible” doesn’t guarantee that it won’t contain other <tr>, because of the way regex engines work.

There’s a useful idiom of “multiple characters in a row, as long as they don’t contain some unwanted string”. It actually took me a while to re-find this, because I lost my bookmark (and don’t use it often): ((?!UNWANTED).)*, which can be described as: do a lookahead (?!...) where the text after here cannot match what’s inside here – in this case, the literal UNWANTED, but it could be an arbitrary regex itself – but that lookahead doesn’t consume any characters, so now we want to match exactly one character (I might call that, “capture one character if the UNWANTED lookahead isn’t found”); repeat this sequence 0 or more times. For the purposes of the dummy example proposed by the previous questioner, ((?!<tr>).)* says “find as many characters as you can, as long as they don’t contain <tr>”.

We can use this in place of both of the .* in the regex above: (?s)<tr>((?!<tr>).)*?</tr>((?!<tr>).)*?\z. This gets us really close…

But we’ve matched all of the text from the beginning of the final <tr> to the end of the document. That will make it harder to replace just the contents of the <tr>...</tr> pair. It could be done with groups, but in this nested mass of parentheses, getting the right group number will be confusing. Instead, I will convert the first <tr> to a lookbehind (?<=<tr>), which says that <tr> must come immediately before our match, but not be in the match. And I convert the ending </tr> as well as everything that comes after that into a lookahead, so it’s not included in our match, either. That brings our expression to (?s)(?<=<tr>)((?!</tr>).)*?(?=</tr>((?!<tr>).)+?\z) – which, at first glance, looks really confusing; but when it was built up step-by-step, it’s not so bad.

If I add in the (?x) (or, (?sx) to combine it), I can add some space and document it:

(?xs)   (?<=<tr>)      ((?!</tr>).)*?      (?=</tr>((?!<tr>).)+?\z)
        ^^^^^^^^^      ^^^^^^^^^^^^^^      ^^^^^^^^^^^^^^^^^^^^^^^^
        |              |                   `- this must come after our match
        |              `- this is what we really want to match; 
        |                 the <tr> is literal, the rest is syntax
        `- this must come before our match; the <tr> is literal, 
           the rest is syntax

With this FIND expression, what you put in REPLACE just needs to be the literal new content for your final <tr> entry.

so with the text

<other>stuff</other>
<tr>first</tr>
<tr>second</tr>
<tr>third</tr>
<tr>4</tr>
<tr>5</tr>
<tr>6</tr>
<tr>last</tr>
<other>stuff</other>

FIND NEXT will just select last, and if you do a REPLACE (or REPLACE ALL), it will just replace that content.

Personally, I would not expect a fresh newbie to regexes to get very far on this problem, but someone who has been using regular expressions for a few years, I would expect to get at least as far as (?s)<tr>.*?</tr>.*?\z, so their question would be “I want to find the last <tr>...</tr> pair in a document and replace its contents; I tried (?s)<tr>.*?</tr>.*?\z because I thought that would find as little as possible between the open and close tag, and as little as possible between the close tag and the end of file, but it’s still grabbing all the <tr> pairs rather than just the last one; can you help me find what I’m doing wrong?”.

It all looks difficult… but the more you use regular expressions, and the more you try to figure out new combinations of the pieces you know, the more you will understand how the bits and pieces work together. Every time you come across a need for a new complicated search-and-replace expression, start with the pieces you know, look at the docs, try to add in the new bits you are trying to learn; ask specific questions (“here’s what I tried and why, but it didn’t work” or “here’s what I’ve tried so far, but I don’t know how to edit this regex so that XXXX” (where XXXX is specific) is infinitely better than “here’s my text; make a regex for me”), and every time you learn something new, put it in your mental bag of tricks. (Sometimes, even for people like me who use regex a lot, the bag of tricks overflows, and you’ll have to re-figure out something you’ve done before; sometimes, the end result will look the same, but other times it will end up looking quite different… which is good, because you’ve learned a new way of doing it that you didn’t know before… and maybe this one will stick with you better than the one you forgot before.)

Alan Kilborn

@PeterJones

Nice treatment, Peter. +1 (or more).
It’s an analysis by a thinking human being that has demonstrated he is capable of learning, adapting, and growing – wouldn’t it be great if everyone was like that?
Thank you on behalf of 32.6K readers (if I may be so bold as to speak for them) for this and your other thoughtful and thorough contributions to the Notepad++ user Community, and of course the N++ user manual.

Alan Kilborn

@PeterJones

BTW, I solved it on my own as well, for my own “pleasure”.
But I wasn’t going to post it, punishing, I guess, the 32.6K readers that were on the edge of their seat waiting for it – at the expense of those that I didn’t really want to have it. But since you let the cat out of the bag, perhaps it is instructive to see a different approach:

(?s)<tr>.*</tr>.*?<tr>\K.+?(?=</tr>.*?\z)

It seems to work; maybe there are holes.

Terry R

@Alan-Kilborn said in Regex: Select only the first instance of search results / first match:

BTW, I solved it on my own as well, for my own “pleasure”.

I also had, using @PeterJones solution for the “first” instance, removing JUST 1 character. Maybe mine also has holes.
(?s)\A.*<tr>\s*\K.*?(\s*</tr>)
So turning a non-greedy regex into a greedy one. It firstly grabs everything, then backs up until the <tr>…</tr> sequence is true. Even the \A sequence could be removed IF the cursor were in the first position of the open file.

Terry

PeterJones

@Terry-R and @Alan-Kilborn ,

Those are so much simpler than mine! Congrats! 🎉👏👍

Anyway, I am still glad I presented my solution, as it hopefully shows future readers a thought process that can arrive at a working regex, even if it’s not the simplest or most efficient.

Alan Kilborn

@PeterJones said in Regex: Select only the first instance of search results / first match:

…so much simpler…

Well, maybe.
But nothing is going to beat your discussion of your thought process.
An important factor in a good solution.

I’ve always thought of the ((?!UNWANTED).)* construct as somewhat “expensive”, but maybe that’s just because it “feels” complicated, but it would take a true regex genius like @guy038 to discuss that.

@Terry-R

Nice one as well!

Alan Kilborn

@Terry-R

I was experimenting with your regex a bit and I noticed that not only did it match the text inside the final <tr></tr> pair, but it also matched the </tr> tag as well?

Peter’s and my regexes only matched what was inside; not sure if you were solving something Vasile wanted or not with that – not going back to read/revisit it! – but I took the liberty of tweaking yours a bit so it matches what ours does:

(?s)\A.*<tr>\K.+?(?=</tr>)

and that appears to be the shortest matching regex thus far.

Terry R

@Alan-Kilborn said in Regex: Select only the first instance of search results / first match:

I was experimenting with your regex a bit and I noticed that not only did it match the text inside the final <tr></tr> pair, but it also matched the </tr> tag as well?

As I said it was from @PeterJones solution for the first instance. Thus in his post:

FIND = (?s)\A.?<tr>\s\K.?(\s</tr>)
REPLACE = new contents$1
MODE = regular expression
REPLACE ALL
then I get

So the replacement text would have been new contents$1, again same as the first instance solution. Sorry forgot to mention that.

Terry

Vasile Caraus

This post is deleted!

Vasile Caraus

This post is deleted!

Vasile Caraus

so, conclusion. I select all regex from the las converstion:

Select and replace the first instance:

SEARCH: (?s)\A.*?<tr>\s*\K.*?(\s*</tr>)(?=$)
REPLACE BY: NEW CONTENT $1

or

SEARCH: (?s)\A.*?<tr>\s*\K.*?(\s*</tr>)
REPLACE BY: NEW CONTENT $1

Select and replace the last instance:

SEARCH: (?s)<tr>.*</tr>.*?<tr>\K.+?(?=</tr>.*?\z)
REPLACE BY: \r NEW CONTENS $1 \r

or

SEARCH: (?s)\A.*<tr>\K.+?(?=</tr>)
REPLACE BY: \r NEW CONTENS $1 \r

WORKS. Thanks a lot friends.

Alan Kilborn

This all seems rather “special case”.
This <tr> and </tr> junk…

To be generic, that is, a roadmap for other interested parties to use, why not specify it like this:

Match only the first occurrence in a file of a regular expression RE:

(?s)\A.*?\KRE

Match the last occurrence of a regular expression RE:

(?s)\A.*(RE).*?\K\1

Of course, clearly the RE has to be something a bit more specific than (example) .., but these seem to mostly work to achieve the goal.

guy038

Hello, @vasile-caraus, @Terry-R, @alan-kilborn, @peterjones and All,

IMPORTANT : I wrote this post, after reading posts from the banner 4 YEARS LATER till the @peterjones’s post, below :

https://community.notepad-plus-plus.org/post/62964

But I going to add a second post, after reading the last recent solutions ! Sorry for my incomplete work !

First, @vasile-caraus, I totally agree to @alan-kilbron’s comment on your attitude ! Not very fair and nice to @Terry-r, which was trying to help you :-((

Seemingly, you quite know, by now, the powerful of regexes, regarding text manipulations. And if you had studied, seriously, some regex tutorials, you would not have spoken about that regex (?s)\z.*?<tr>\s*\K.*?(\s*</tr>) which is a complete nonsense !

For instance, from the two pages of the Regular-expressions.info site, below, you had understood, at once, that the \z syntax always comes at the very end of a regex expression or, possibly, before an alternation symbol | !!

https://www.regular-expressions.info/anchors.html

https://www.regular-expressions.info/refanchors.html

Now, I slightly simplified the @peterjones’s search regex, which searches for the first element <tr> ••••• </tr>, of an HTML page :

SEARCH (?s-i)\A.*?<tr>\K.*?(?=</tr>)

In return, if your replacement regex is :

The expression Here is the NEW text, you’ll get the simple text

 </tr>Here is the NEW text</tr>

The expression is \r\nHere is the NEW text\r\n the output text will be :

<tr>
Here is the NEW text
</tr>

Tick the Wrap around option
Click on the Replace All button, exclusively !

Now, to search for the last element <tr> ••••• </tr>, of an HTML page, use the following regex :

SEARCH (?s-i)<tr>\K((?!<tr>).)*?(?=</tr>((?!<tr>).)*?\z)

Note that I use exactly the scheme proposed by @Peterjones :


- find from <tr> to </tr> ( NOT included )          =>    (?s-i)<tr>\K •••••••••• (?=</tr> •••••••••• )
                                                                           ^                 ^    ^
                                                                           |                 |    |
- WITHOUT any contained <tr>                        =>    ((?!<tr>).)*? ---•                 |    |
																							 |    |
- FOLLOWED by anything that’s NOT a <tr>            =>    ((?!<tr>).)*? ---------------------•    |
																								  |
- until the VERY END of the file                    =>    \z -------------------------------------•

To All :

You could ask me : why the regex to search for the last <tr> ••••• </tr> block is more complicated than the one to search for the first one ?

This is because of the general direction used by the regex engine : from LEFT to RIGHT !

Indeed, when we search for (?s-i)\A.*?<tr>, part of the first regex, the range of any char (?s).* with the lazy quantifier ? is then extended to the first occurrence of the string <tr> and means that, necessarily, this range cannot contain any <tr> inside !
Similarly, the regex (?s).*?(?=</tr>) would search for any range of any char, possibly empty, till the nearest string </tr>, meaning, implicitly, that this range of chars cannot contain a </tr> string
Whereas, when searching the last <tr> ••••• </tr> block, as our reference is the anchor \z ( very end of current file ), we must build up the regex, using a kind of back-propagation method :
- Starting from the very end of file
- Moving back, through characters without any <tr> string
- Till a </tr> string
- Moving back, again, through characters without any <tr> string
- Till a <tr> string

Of course, I assume that any <tr> correctly ends with </tr> !

Test these two regexes against this sample, derived from Peter’s one, which contains 4 blocks </tr> •••• </tr> :

<html><body>
<table>
<tr>
get rid of stuff, in case of \A anchor, including <embedded/> <tags/>
</tr>
<tr>
keep stuff including <embedded/> <tags/>
</tr>
<tr>
keep stuff including <embedded/> <tags/>
</tr>
<tr>
get rid of stuff, in case of \z anchor, including <embedded/> <tags/>
</tr>
</table>
</body>
</html>

The first regex, with the \A syntax should replace the first block, only and the last regex, with the \z syntax, should replace the fourth and last <tr> block

Best Regards,

guy038

P.S. :

@vasile-caraus, note that I’m willing, and probably, all people involved in that discussion, to help you if you have difficulty understanding a specific part of a regex tutorial, that you have decided to study. A different perspective will certainly be very useful to you … and others ;-))

guy038

Hi, @vasile-caraus, @Terry-R, @alan-kilborn, @peterjones and All,

My God !! Of course, the @terry-r’s regex is just magic and so simple ! Congratulations, Terry ;-)) How could we not think of it ??

If I adapt Terry concept to the regexes of my previous post, everything becomes crystal clear :

SEARCH (?s-i)\A.*?<tr>\K.*?(?=</tr>) to search ( and replace ) the first <tr> ••••• </tr> block

SEARCH (?s-i)\A.*<tr>\K.*?(?=</tr>) to search ( and replace ) the last <tr> ••••• </tr> block

As usual, tick the Regular expression and Wrap around options and click on the Replace All button, exclusively

@vasile-caraus, this demonstrates, in a masterful way, that things can be skillfully solved by other people than me and moreover… by @terry-r !!

Now, @alan-kilborn you said :

Match the last occurrence of a regular expression RE:

(?s)\A.*(RE).*?\K\1

But, unless I’m mistaken, doesn’t this regex, below, do the same search ?

(?s)\A.*\KRE

Best regards,

guy038