How to find and highlight a specific occurance of a symbol?

Scott Sumner

Terry R is heading in the direction I was going, but he posted first. I would slightly tweak his regex (which works) to get this:

(?s)(?:.+?===========){5}.+?\K===========.*

The tweaks:

Use (?s) at the start to avoid having the “newline” box be a dependency
Remove some unnecessary group-capturing parentheses
Add .* at the end to prevent matching every multiple-of-six groupings (it will highlight/mark from the sixth grouping of = thru the end-of-file…maybe this is undesirable but I think in a large file it would make it easier to find where the highlighting/marking begins!)

Note that the OP said he wanted to highlight the match, so I take this to mean using the Mark tab of the Find window for this.

Terry R said:

It might not work it there are any sequences of 2 lines together with “=” in them

It WILL work in such a case!

Here’s an explanation of the regex:

Use these options for the whole regular expression (?s)
- Dot matches line breaks s
Match the regular expression below (?:.+?===========){5}
- Exactly 5 times {5}
- Match any single character .+?
  - Between one and unlimited times, as few times as possible, expanding as needed (lazy) +?
- Match the character string “===========” literally ===========
Match any single character .+?
- Between one and unlimited times, as few times as possible, expanding as needed (lazy) +?
Keep the text matched so far out of the overall regex match \K
Match the character string “===========” literally ===========
Match any single character .*
- Between zero and unlimited times, as many times as possible, giving back as needed (greedy) *

Created with RegexBuddy

Terry R

Thanks Scott for those tweaks.

I have a question about why the \A didn’t work in this instance. Had it done so, then it would ALWAYS start at the start of the file, and even if run multiple times it would still select only the nth occurrence. As well, your .* at the end of the regex wouldn’t then be required.

I do continue to forget the (?s) and (?i) at the start, often using the tick boxes over this approach. Of course using these allows for a standard approach to the regex (not having to change tick boxes for every expression) and allowing each regex to stand on it’s own.

I hadn’t tested the occurrence of 2 consecutive lines of '='s hence my disclaimer, that was only added into the reply as I was typing it (my testing wasn’t exhaustive).

Terry

(The day you stop learning is the day you die!)

guy038

Hi, @rumi-balkhi, @scott-sumner, @terry-r, and All,

Terry, The \A feature is broken in the N++ implementation of the Boost Regex library. For an alternate regex engine, see the last part the the updated FAQ topic, below :

https://notepad-plus-plus.org/community/topic/15765/faq-desk-where-to-find-regex-documentation

However, I found a very simple way to prevent from matching, every 6 times the =========== line !

Just add a specific expression at the very beginning of the file, that does not exist elsewhere, in current file. Then, Scott, we just add, first, that expression in your regex :-)) For instance :

I added the ### mark on top of the Rumi’s text
And I used the modified regex (?s)###(?:.+?===========){5}.+?\K===========

Voilà !

Best Regards,

guy038

Scott Sumner

@guy038 said:

I added the ### mark on top of the Rumi’s text

Well…sure, but when one answers a regex question here, one sort of assumes that changing the OP’s source text to solve a problem is not allowed. :-)

Of course, doing that when solving a problem that requires “table-building”…well, then…maybe in that case we bend the rules, eh? :-D

Terry R

I think I may have found another regex (building on what we already have) that will; (no matter where in the file the cursor is); ALWAYS select the nth occurence. I used some other string check that I’d provided some weeks ago to someone who originally stated the \A didn’t work for them. My latest rendition is:
Find: (?s)(?<!\x0A)^((.+?)===========){5}(.+?)\K(===========)

So the (?<!\x0A)^ should (hopefully) look for the occurence of a line starting position where there aren’t any line feed/carriage returns immediately before (actually I only test for the line feed portion). So far my tests have shown it ALWAYS selects only the 6th occurrence (in the example) and even with a second click does not change position. With putting the cursor in various positions within the file it still correctly locates the 6th position.

This negates the need to be careful where the cursor is in the file before looking for the nth occurrence and also the need to include the additional .* at the end to grab the rest of the content, thus preventing a double click going to the nth * 2 occurrence.

Since the \A seems very problematic I’ve now added (?<!\x0A)^ to my arsenal!

Terry

Scott Sumner

@Terry-R

Nice one…hopefully a downside to this is NOT found. I’m not “with” Notepad++ or RegexBuddy right now, but maybe this also works?: (?<!\R)^ Maybe not, though as lookbehinds must be of constant length and \R could be of length 1 (in the case of \r or \n) or length 2 (for \r\n)…would be nice if it did though because I think it is nicer on the eyes than \x0A. :-)

@guy038, I think there’s a(nother) regex sheriff in town, and his name is @Terry-R…

Terry R

Sorry, not a sheriff, maybe a deputy.

I’m quite enjoying helping out (where I can) although i do need to curb my enthusiasm somewhat. And thanks to you Scott Summer and @guy038 for your support.

And yes, Scott i agree the \R or \r\n would probably look better, I just haven’t tested that yet. As my dad always said, measure twice, cut once. So i need to test, refine, test again before presenting!

Terry

guy038

Hello @terry-r, @scott-sumner, @rumi-balkhi and All,

Very clever deduction, Terry. If we generalize to any kind of EOL characters, it gives (?<!\n|\r|\f)^.

Note that I added the \f syntax ( Control character Form Feed, \x0C or 012 decimal ) because, given, for instance, the text abcdefghij, with the Form Feed char, between the strings abcde et fghij, the regex (?-s)^. would also match the f letter avec the FF char. !

So, to be short, the regex (?<!\n|\r|\f)^ seems a very nice word-around to emulate the bugged \A feature of the N++ regex engine :-))

I used the verb seems and not the verb is because, unfortunately, there are still some problems with that syntax :-((

Let’s work on that sample text, below, that you will copy on a new tab :

Notepad++ v7.5.8 bug-fixes:
This is
a simple
text
12345><67890
to test
the \A
feature

Note : For all the tests, below, the options Regular expression and Wrap around are ticked !

First problem :

Let’s suppose that your cursor is located between the > and < characters, on the 5th line. Using the regex (?<!\n|\r|\f)^(?-s). ( which should stand for \A(?-s). ), it does find the letter N of Notepad++, on top of the text and any other click on the Find Next button does not find anything else. Nice !.

Now place the cursor at beginning of the 5th line, right before the 1 digit, without any selection and re-run the (?<!\n|\r|\f)^(?-s). regex. This time, the first click on the Find Next button wrongly match the 1 digit. The second click finds the letter N as expected, and any subsequent clicks do nothing.

Second problem :

Let apply the new regex (?<!\n|\r|\f)^(?-s).*\R.* ( which should be a work-around of \A(?-s).*\R.* ) against our sample text. The result is just identical to what I described in the point, just above. That is to say :

If cursor was between the > and < characters, it matches all contents of the 1st line, with its EOL chars and all contents of the 2nd line, without its EOL chars
If cursor was at beginning of the 5th line , then :
- After a first click, it matches all contents of the 5th line, with its EOL chars + all contents of the 6th line, without its EOL chars
- After a second click, it matches all contents of the 1st line, with its EOL chars + all contents of the 2nd line, without its EOL chars

Now, let’s slightly change the regex, adding an \R syntax, at the end of the regex, which becomes :

(?<!\n|\r|\f)^(?-s).*\R.*\R

Now, even if we place the cursor between the > and < characters, any click on the Find Next button will match, successively, two consecutive lines ( The 1st and the 2nd, then, the 3rd and the 4th one,… and so on :-((

Just because the end of this regex matches a \r or \n character !

Anyway, Terry, don’t be sad ! Logically, your (?<!\n|\r|\f)^ regex should work as a work-around of the \A syntax. It’s simply because our present regex engine does not handle backward assertions, properly, too ! I didn’t test it, yet, but I suppose the your regex should work in some regex testers, on Web :-))

And I agree, with Scott : You, certainly, are a “regex sheriff” !

Cheers,

guy038

Scott Sumner

@guy038 and @Terry-R :

So after some thinking about this, I’ve decided that I don’t think the \A syntax is broken in Notepad++, and I don’t think that lookbehind assertions (either positive or negative) are broken in Notepad++ either. One just has to fully understand how Notepad++ searching works. And @guy038, in your post just above, where you talk about a “first problem” and a “second problem” and beyond…I have NO problem with how Notepad++ works in these cases, given my new thinking!

Every Notepad++ search has a “starting-point” when the user initiates a search, or that Notepad++ itself initiates after a successful match in the case where one of the “find all” searches (or a Find in Files) is conducted. Each starting-point has exactly NOTHING before it. YOU may SEE data before your caret when you initiate a Find Next, but Notepad++ doesn’t. And that, IMO, isn’t necessarily a broken search feature, it is just “the way it works”. To Notepad++, each starting-point appears the same as a start-of-file does–no data (aka NOTHING) comes before it.

At the beginning of your regex, an \A assertion or ANY valid negative lookbehind assertion will match the NOTHING right at a starting-point (i.e., that part of the regex will always succeed). Note that a negative lookbehind assertion doesn’t match the real data to the left of your caret, it matches the NOTHING. Example: Have 12345678 (only) in your buffer and your caret between the 4 and the 5. Do a regex search for (?<!4)5 and see that it matches the 5. The (?<!4) in this case allows the match because NOTHING is not 4. Some would call this regex behavior “broken”…because YOU the user can see the darned 4 right there!

While a search is “in-progress” on a buffer, however, no one would call negative lookbehind assertions broken. Example: Use the same buffer as before but have your caret between the 1 and the 2. Do a regex search for (?<!4)5 and see that there is no match. In this case there is a 4 before the 5 (the search is “underway” at this point and Notepad++ is looking at the buffer…and this time the 4 is a part of that buffer) so the assertion fails the overall match.

Side note: With “Wrap around” ticked during a search, if no match is found between the current caret point and the end-of-file, a second (internal) search is done by Notepad++, with a starting-point at the start-of-file. This is an additional opportunity for a match to happen at a “starting point”. For more info on the “2nd search”, see here…quite far into that thread… Although that thead discusses replacement, find a part of replacement, so it works the same way.

So that brings us back to the regex under discussion: (?<!\r|\n)^.

It will work like an “unbroken” \A in most cases. A possible problem with it comes when doing “multiple” searches with it (or when the user-starting-point isn’t the start-of-file). In those cases, if the previous action with it leaves the next starting-point at the start of any line, the next search will match right there – not only at start-of-file – which may not be what the user desires. (Again, it will match that case because the (?<!\r|\n) part will match the NOTHING and the ^ will hit).

The best way to use it as an \A-equivalent is to have the end of your regex not leave the next starting-point at any start-of-line.

So let’s go back to Terry’s original regex (or very close to it):

(?s)(?<!\r|\n)^((.+?)===========){5}(.+?)\K(===========)

When run on the OP’s sample data (duplicated a few times so that there are many more than 9 lines of =========== data) this will find exactly ONE match when a “find all” is done. That is because the end of the regex sets up the follow-on starting-point to NOT be able to match a start-of-line (needed by the ^ assertion).

Changing the regex slightly at the end:

(?s)(?<!\r|\n)^((.+?)===========){5}(.+?)\K(===========)\R

This one will result in multiple matches using a “find all” in the OP’s (extended) data, for reasons which should now be apparent. Thus it is worth pointing out in such a case that the (?<!\r|\n)^ regex is NOT what one normally thinks \A should be doing. So while it can be a \A substitute, it still has to be used with some amount of caution, and of course, understanding. :-)

Back to \A: You can be the judge of whether or not it is broken: The Boost documentation for \A says “Matches at the start of a buffer only” – does one consider the “buffer” to be the entirety of the Notepad++ editor tab data, or the starting-point(s) through a later point for a search? Your call. :-)

And now onto \G. In this thread , there are 3 conditions specified for where the \G assertion can match. I believe there is only ONE place it can match; hint: the “starting-point” of a search. :-)

Scott Sumner

This post is deleted!