Find - Replace

PeterJones

First, to solve your problem:

I can replicate your problem by using the same options:

But if I turn off “match whole words only”, then it finds it easily:

This is because "1000" is not the “whole word”; price="1000"/> is the “whole word”.

the error at the bottom is showing double quotes?

Because that error bar takes whatever is in the FIND box and puts it between quotes to display the text. If you had said Find What: gobbeldygook, the error message would say Find: Can't find the text "gobbeldygook", as shown here:

To reiterate the main solution: the reason your search did not work is because you told it to match whole words only, but then were trying to match against text that wasn’t a “whole word”.

------
see https://npp-user-manual.org/docs/searching/#find-replace-tabs

Alan Kilborn

@peterjones said in Find - Replace:

price="1000"/> is the “whole word”.

Can you elaborate on why this is?
Aside from “it works”? :-)

Reading the fine manual HERE doesn’t really shed light on it, for me.

Note that I know how to use the option, and would never have used it like OP did, but it never hurts to know deeper meanings in things, so that maybe I can use a function better.

PeterJones

@alan-kilborn ,

I don’t have insight into how the non-regex “word” is defined in the code.

However, at least in my brief experimentation, the “normal mode + match whole word only” seems to agree with “regex mode” and \b.*?\b.

For example, because the spot between the = and the " will not match a word boundary \b, a “whole word only” match will not match if just the " is included, but it will if the match starts with =" or if it starts at the 1000.

Maybe this will show it better: If you are searching the text price="1000"/>:

looking for text	regex version	normal+whole word matches	regex matches	notes
`1000`	`\b1000\b`	YES	YES	the zero-width between `"1` is a word boundary, as is `0"`
`"1000`	`\b"1000\b`	NO	NO	the zero-width between `="` is not a word boundary, so fails
`="1000`	`\b="1000\b`	YES	YES	the zero-width between `e=` is a boundary
`="1000"`	`\b="1000"\b`	~~YES~~	NO	ERROR `"/` is not a word boundary, so the regex fails, but the normal+whole somehow matches
`price="1000"`	`\bprice="1000"\b`	NO	NO	including `price` before the `=` seems to change the normal+whole defintion of “whole word”… weird.

Unfortunately, with experimentation, my theory broke down. I don’t know enough about the underlying details to explain exactly how it matches – someone with more insight into the source code would need to comment.

But I think a good general rule is, “if it doesn’t also match regex=\bXXX\b, then normal+word=XXX probably won’t work, though there are subtle exceptions”. For normal+word, I would stick to words that are obviously word units, like the 1000 or price (with no spaces or punctuation), rather than trying to get normal+word to go across words or word boundaries. If you want to search across multiple words, or want mixed words and punctuation, normal+word will not always work as you expect.

guy038

Hello, @kendall-demott, @peterjones, @alan-kilborn ans All,

Well, I would say :

For an ANSI file :
- If a string of word chars is immediately surrounded both, before and after, with one of the characters [\x00 - \x2F] , [\x3A - \x40] , [\x5B - \x5E] , \x60 or [ \x7B - \x7F], that string will match when the Match whole word only option is ticked
- In other words, if a string is immediately surrounded by, at least, one word char, in the strict range [0-9A-Z_a-z] or any char in range [\x80-\xFF], that string will not match when the Match whole word only option is ticked
For a NON-ANSI file ( so any encoding different from ANSI ) :
- If a string of word chars is immediately surrounded both, before and after, with a Unicode non-word character, recognized by Notepad++, that string will match when the Match whole word only option is ticked
- In other words, if a string is immediately surrounded by, at least, one Unicode word char, recognized by Notepad++, that string will not match when the Match whole word only option is ticked

Now, regarding the regex \b zero-width assertion, it represents, either :

The position between the very beginning of current file and a word character
The position between a non-word character and a word character
The position between a word character and a non-word character
The position between a word character and the very end of current file

Note also that the \n and/or \r line-endings chars are always considered as non-word chars

Best Regards,

guy038

Alan Kilborn

More on the subject from @guy038 in this old post: https://community.notepad-plus-plus.org/post/20424

Peter, could the user manual be better in this regard?

PeterJones

@guy038 said in Find - Replace:

If the string to search for is, itself, surrounded with non-word characters, that string will match when the Match whole word only option is ticked ONLY IF surrounded with the \n or \r chars

That’s not accurate.

If the document is

<a price="1000"/> x
<a price="1000"/>x

then FIND = ="1000"/> will match both those lines, even though it’s got an e to the left and either a space or an x to the right.

—

Also, I originally said that ="1000" matched normal+whole word in the document price="1000"/>, but it does not… so apparently my test was wrong yesterday. And with NORMAL=="1000" and REGEX=\b="1000"\b actually agreeing that it doesn’t match, I am back to thinking that for a “normal+whole word” FIND=☒☒☒, it is equivalent to a regex FIND=\b☒☒☒\b (or, I should say \b\Q☒☒☒\E\b, because ☒ might be a regex special character, so it needs to be escaped in the regex-equivalent). I haven’t been able to find an exception to this. If anyone can show me different, let me know.

Kendall DeMott

Peter, Thank You, unticking that box solved my issue.

guy038

Hi, @kendall-demott, @peterjones, @alan-kilborn and All,

I said, in my previous post ( from now on deleted ) :

If the string to search for is, itself, surrounded with non-word characters, that string will match when the Match whole word only option is ticked ONLY IF surrounded with the \n or \r chars

Actually, I really misspoke ! I wanted to mean :

Any string, containing word and/or non-word characters, at any location, will match, when the Match whole word only option is ticked, IF this string is surrounded with nothing, a \n char or a \r char

Now, Peter, you said in your last post :

I am back to thinking that for a “normal+whole word” FIND=☒☒☒, it is equivalent to a regex FIND=\b☒☒☒\b …

So I created a file, containing all Unicode characters of the BMP, only ( so 63,454 characters with code-point < U+FFFF ), in the form below :

NULabcd¤
SOHabcd¤
...
...
...
abcd¤
�abcd¤

And it happens that :

The search of the string abcd, in Normal mode, with the Match whole word only option ticked, returns 12,561 matches
The search of the regex string \babcd\b in Regular expression mode, returns 15,424 matches

So, obviously, these two kinds of searches are not equivalent at all !

For instance, let’s insert the string ¼abcd¤ in a new tab, whatever its encoding

First note that, either, the ¼ and the ¤ characters are non-word characters. To be convinced, just look for \w in Regular expression mode. The four letters are matched, only

However, the search of abcd, in Normal search mode, with the Match whole word only ticked, gives : NO match
Luckily, the search of \babcd\b, in Regular expression search mode, does give the correct answer : MATCH

Unfortunately, the general template \bString of Word chars\b is not exact, too, in numerous cases :

Let’s consider, for instance :

The Ԩ Unicode character. It’s the CYRILLIC CAPITAL LETTER EN WITH LEFT HOOK with code-point U+0528
The ᏹ Unicode character. It’s the CHEROKEE SMALL LETTER YI, with code-point U+13F9
The ⴭ Unicode character. It’s the GEORGIAN SMALL LETTER AEN, with code-point U+2D2D

Despite all these chars are seen as true letters by the Unicode Consortium, they are not considered, yet, as word chars by our N++ regex engine :((. Thus, the search of \babcd\b, in Regular expression mode, will wrongly match the string abcd in the examples below :

Ԩabcd¤
ᏹabcd¤
ⴭabcd¤

Conclusion :

Although the search of a whole word with the regex \b....\b seems more accurate and will give correct results with usual chars, it may fail with a lot of non-usual Unicode chars !

Best Regards,

guy038

P.S. :

Note that the use of the regex assertion \b may give correct but rather surprising results ! For instance, the regex \b\Q^!:/@?$\E\b matches the part ^!:/@?$, of the string A^!:/@?$Z, because the \b assertion may be the location between a word char and a non-word char ! So, definitively, the use of the \b assertion, in regexes and the option Match whole word only, in Normal mode, are not equivalent !

PeterJones

@guy038 ,

Thanks for the experiment. Basically, it boils down to “Unicode complicates things for whole word only”. ;-)

The phrasing I am considering for the user manual:

For ASCII text

if the left and right characters of your search string are both “word characters” (letters, numbers, underscore, and optionally additional characters set by your preferences), then “match whole word only” will only allow a match if the characters to the left and right of the match are non-word-characters or spaces or the beginning or ending of the line

if the left and right characters of your search string are both non-word characters (so not letters, numbers, underscore, and optionally additional characters set by your preferences)

if the left of your search string is a word character and the right is not (or vice versa), then the characters to the left and right must be of the opposite type, or spaces, or beginning/ending of line.

For non-ASCII text, the general concepts are the same; however, some edge cases may behave differently than you expect, and with thousands of possible Unicode characters and millions of combinations of pairs of Unicode characters, this manual cannot contain a full description.

Either way, if you want full control of what counts as a “word” or a “word boundary”, use Search Mode = Regular Expression instead of Normal with Match Whole Word Only, which allows you full and precise control of what is allowed before and after what you consider a “whole word”.

And yes, I did verify that Settings > Preferences > Delimiter > add your character as part of a word does affect whether Match whole word only matches.

PeterJones

The phrasing I am considering for the user manual:

issue #349 => PR #350

It should be in the next release of the user manual