"Whole Word Only" Option in Combination with Non-Alphanumeric Characters
-
When searching for the string “Notepad++” with the "Match whole word only" option in a CSV file, there are inconsistencies in the search results based on the placement of non-alphanumeric characters.
Examples and Outcomes:
-
Search Term:
Notepad++
Text:Notepad,Notepad++,Notepad
Result: No match found.
Explanation: There is no match even though “Notepad++” is present, because the + sign is hitting the comma. -
Search Term:
Notepad++
Text:Notepad,Notepad++ ,Notepad
Result: Match in the 2nd column.
Explanation: A match occurs here because there is a space after “Notepad++”, which seems to affect the search behavior. -
Search Term:
Notepad
Text:Notepad,Notepad,Notepad
Result: Hits in all columns.
Explanation: Searches without non-alphanumeric characters at the ends behave as expected.
It appears that the placement of non-alphanumeric characters at the beginning or end of the search string affects the outcome. This might be initially perceived as a bug.
Is it possible to adjust the behavior in the Scintilla component for
SCI_SEARCHINTARGET
to handle search terms uniformly regardless of surrounding characters? I’m looking into this for the MultiReplace Plugin and would appreciate any insights or suggestions. -
-
Have you read the fine user manual about
Match whole word only
? -
@Alan-Kilborn
Thanks for the hint. I checked the help documentation, and it seems this behavior is normal and something I’ll need to get used to. -
I don’t know if searching in regex mode works “better” or not:
e.g.
\b\QNotepad++\E\b
-
@Alan-Kilborn
I was just curious about this when I first recognized it in the plugin and thought it was a bug. However, after realizing that this behavior is normal, I found it a bit odd. Since this behavior cannot be changed, so it’s just part of the feature set of the plugin then. -
Based on the Scintilla documentation:
https://www.scintilla.org/ScintillaDoc.html#searchFlags
the plain text search with whole word enabled should be equivalent to:
(?<!\w)\QNotepad++\E(?!\w)
The implementation says otherwise:
“Whole word” effectively implies a word boundary; so it behaves like @Alan-Kilborn’s suggestion:
\b\QNotepad++\E\b
and not like the documentation indicates.
-
@Coises ,
I respectfully disagree.
With text
Notepad,Notepad++,Notepad
, search forNotepad++
with Whole Word checkmarked. It won’t be found, because there is no character-class difference between the+
at the end of the word and the comma (,
) after it that’s not part of the search string. Thus, the “Check that the given range is has transitions between character classes at both” comment that was in the source-code link is not fulfilled (+
to,
is not a character class transition).And the manual says, “If the left of your search string is a word character and the right is not (or vice versa), then the characters to the left and right must be of the opposite type, or be spaces, or be the beginning/ending of a line.” The left of the search string is
N
, so a word character; the right is+
so punctuation; thus, it would have to be non-word to the left of theN
(it is) and word or space to the right of the + (it is a comma, so punctuation, which is neither), thus the Manual correctly describes that Whole Word will not match for that.The Whole Word search for
Notepad++
in the stringNotepad,Notepad++,Notepad
is behaving as described in both the comments of the source code and in the User Manual. -
@PeterJones said in "Whole Word Only" Option in Combination with Non-Alphanumeric Characters:
The Whole Word search for Notepad++ in the string Notepad,Notepad++,Notepad is behaving as described in both the comments of the source code and in the User Manual.
Indeed, it does.
The documentation I looked at was the Scintilla documentation for the search flags. (Probably because I was thinking more as a plugin developer than as an end user.)
The Notepad++ User Manual documentation describes the actual behavior correctly.
-
@Coises said in "Whole Word Only" Option in Combination with Non-Alphanumeric Characters:
The documentation I looked at was the Scintilla documentation
Ah, okay, I misunderstood which “documentation” you were referring to. I agree that Scintilla’s description doesn’t cover the edge cases, though it probably should. (Who knows if they’ve even bothered to learn their own edge cases; I get the feeling that Notepad++ and it’s associated plugin authors push Scintilla in ways that the Scintilla developers never expected; though presumably other apps that use Scintilla push things in different directions than we do.)
-
I’ve been thinking about how the “Match Whole Word Only” search option might function:
- Text Segmentation: The entire text is divided into chunks by separating at non-word characters, ensuring symbols like ‘+’ are not included in these chunks.
- Search Within Chunks: The search then strictly focuses on these separated chunks.
Interestingly, spaces seem to act as primary separators, overruling other non-word characters if they are next to these characters and including adjacent non-word characters into the chunks.
This chunk preparation, which happens without analyzing the search string, likely makes the search process faster, especially if the text is pre-prepared. Finally the search will only focus on these seperated chunks.
-
Interesting idea. If you’re a scripter, maybe mockup some demo with a script and show it here?
-
@PeterJones said in "Whole Word Only" Option in Combination with Non-Alphanumeric Characters:
I agree that Scintilla’s description doesn’t cover the edge cases, though it probably should.
Having just discovered — to some horror — that Notepad++ uses a modified version of Scintilla (context here) I am no longer inclined to “blame” Scintilla for anything without first doing a lot of investigation.
I somehow just assumed that modifying Scintilla would be “off limits” for the Notepad++ project.
-
@Coises said in "Whole Word Only" Option in Combination with Non-Alphanumeric Characters:
that modifying Scintilla would be “off limits” for the Notepad++ project
It mostly is.
But I think in some areas it was judged to be something that “had to be done”.