Search with quantifier failed

PeterJones

Dagger † U+2020 as seen in the Extended Mode Docs:

It may be a crippled mode, but it should still be correctly documented. The docs as written don’t clarify well enough that \d requires three decimal digits, nor that it is not the same as the \d from regex mode. Other entries in the list use the dagger notation to reference the note about being different from the regex syntax that looks similar.

datatraveller1

@Alan-Kilborn I use the extended mode for hexadecimal search (\x), so for me it is still useful in some cases.

Alan Kilborn

@datatraveller1 said in Search with quantifier failed:

I use the extended mode for hexadecimal search (\x), so for me it is still useful in some cases.

For the most part, you can also use Regular expression mode for that. But, as I’ve recently cautioned another poster, if you find you are searching for hex sequences often, you are probably doing something wrong with how you are approaching text editing. Of course, that’s said with no info about what you are doing.

Terry R

Firstly, apologies to @Dr-N for possibly steering the posts in a wrong direction.

Secondly, sorry to the rest of you. I had not actually tested the Extended mode (wasn’t on a PC when I posted), rather just read from the manual (which has been verified as missing important information). It also appeared (on the surface) to explain the OP’s issue.

I have never used Extended because of the same reasoning as @Alan-Kilborn. I’m even more convinced now that it is a rubbish mode. If it’s intended to give users a leg up to full blown regex then (I believe) it completely misses the mark.

Given the manual didn’t fully explain the use and if someone new was to try it, the meta characters, being so closely resembling regex code will only serve to confuse users of that mode. I cannot see a reason for retaining this mode.

My 2c worth
Terry

Alan Kilborn

@Terry-R said in Search with quantifier failed:

My 2c worth

So here’s my take on it.

The history: Notepad++, pre version 6.0, had no regex search/replace mode. The author wanted to give some capability, and didn’t want to tackle a regex implementation himself (fun fact: in truth it was done by the author of the PythonScript plugin and some others). Thus “extended” mode was born.

When regex mode DID come along, “extended” mode really wasn’t necessary, but features aren’t usually removed, because someone may be using them (and complain). Thus, it lives. And maybe some still cling to it because regular expressions terrify them. :-)

datatraveller1

@Alan-Kilborn I rarely use the hexadecimal search, e.g. to search for the annoying character “Non-breaking space, HEX A0, DEC 160”.
-> Notepad++, extended search for \xA0
… but you are right, this also works with the regular expression mode.

BTW, I was a bit confused by the \X paragraph in the manual
https://npp-user-manual.org/docs/searching:
“For example, the letter ǭ̳̚, with four combining characters after the o, can be found either with the regex (?-i)o\x{0304}\x{0328}\x{031a}\x{0333} or with the shorter regex \X.” -> I miss the example for the shorter regex \X?

PeterJones

@datatraveller1 said in Search with quantifier failed:

I miss the example for the shorter regex \X

The entire regex is \X – it matches one letter plus all the combining characters that come after, hence it would match the o and the four shown combining-characterd

guy038

Hello @dr-n, @alan-kilborn, @terry-r, @peterjones, @datatraveller1 and All,

I tried to merge the two posts below, in order to get a complete summary of the Extended search mode feature !

https://community.notepad-plus-plus.org/post/45753

https://community.notepad-plus-plus.org/post/24236

I hope I have not forgotten anything important !

Peter, if you get some spare time, just check here and see if some points of this post could be added / improved !

In the Extended search mode, in addition to the search/replacement of standard characters and the 5 specific characters, below :

Character	Syntax
Tabulation	`\t`
New Line	`\n`
Carriage Return	`\r`
Backslash	`\\`
Null	`\0`

Within an Unicode encoded file, any single character of code-point U+xxxx, may be written, in the Find what: and the Replace With: zones, with one of the five syntaxes below :

Type	From	To	Character Range
Decimal	`\d000`	`\d999`	`[0-9]`
Octal	`\o000`	`\o777`	`[0-7]`
Binary	`\b00000000`	`\b11111111`	`[0-1]`
Hexadecimal	`\x00`	`\xFF`	`[0-9A-Fa-f]`
Unicode	`\u0000`	`\uFFFF`	`[0-9A-Fa-f]`

Consequence :

The character with the greatest Unicode code-point which can be searched and/or replaced, in Extended mode, is :

\d999, so the Unicode character ϧ ( COPTIC SMALL LETTER KHEI ), with code-point = \u03e7, in the decimal representation
\o777, so the Unicode character ǿ ( LATIN SMALL LETTER O WITH STROKE AND ACUTE ), with code-point = \u01ff, in the octal representation
\b11111111, so the Unicode character ÿ ( LATIN SMALL LETTER Y WITH DIAERESIS ), with code-point = \u00ff, in the binary representation
\xFF, so the Unicode character ÿ ( LATIN SMALL LETTER Y WITH DIAERESIS ), with code-point = \u00ff, in the the hexa representation
\uFFFD, so the Unicode character � ( REPLACEMENT CHARACTER ), with code-point = \ufffd, in the Unicode representation

Within an ANSI encoded file, any single character of code-point U+00xx, may be written, in the Find what: and the Replace With: zones, with one of the four syntaxes below :

Type	From	To	Character Range
Decimal	`\d000`	`\d255`	`[0-9]`
Octal	`\o000`	`\o377`	`[0-7]`
Binary	`\b00000000`	`\b11111111`	`[0-1]`
Hexadecimal	`\x00`	`\xFF`	`[0-9A-Fa-f]`

Remarks :

In all cases, the character with the greatest Unicode code-point which can be searched and/or replaced is, either, \d255 or \o377 or \b11111111 or \xFF which refers to the Unicode character ÿ ( LATIN SMALL LETTER Y WITH DIAERESIS )
An Unicode character, of code-point U+00xx, can be found ONLY IF xx belongs to the range [00-7F] OR to the range [A0-FF]. When xx lies between 80 and 9F, it generally searches for the question mark ( ? ) as it refers to an Unicode char, whose code-point is not handled by the ANSI encoding ! Only, the 5 characters U+0081, U+008D, U+008F, U+0090 and U+009D, without any glyph, are correctly searched !

Examples ( With the Match case option ticked and the Match whole word only option UN-ticked ) :

If you search for the uppercase letter A, you can choose, either, the syntax \d065 or \o101 or \b1000001 or \x41 or \u0041
And if you look for the character, with decimal ASCII code 201 ( É ), type in, either, the syntax \d201 or \o311 or \b11001001 or \xC9 or \u00C9
Of course, you may mix all these representations, either, in the Search and Replace zones. For instance, the text \d065\o102\b01000011Z\x44\u0045 represents the simple string ABCZDE

Remark : Depending of the End of Line character(s), used in your current file, indicated in the status bar ( \r\n for a Window file, \n for an Unix file, and \r for a Mac file ), you can search and/or replace text, containing line break(s). For instance :

The search, in the Extended or Regular expression search mode, of the string ABC\r\n123 and the replacement by the string Word\r\nNumber, in a Windows file, would change the two lines :

ABC
123

as the text :

Word
Number

This same S/R, in a Unix file, could be performed with the searched string ABC\n123 and the replaced string Word\nNumber
With the following Windows file :

Line_1
Line_2
Line_3
Line_4
Line_5
Line_6
Line_7
Line_8
Line_9

You could, perfectly, in Extended or Regular expression mode, use the following S/R :

SEARCH Line_1\r\nLine_2\r\nLine_3\r\nLine_4\r\nLine_5\r\nLine_6\r\nLine_7\r\nLine_8\r\nLine_9\r\n
REPLACE Modified Line #1\r\nModified Line #2\r\nModified Line #3\r\nModified Line #4\r\nModified Line #5\r\nModified Line #6\r\nModified Line #7\r\nModified Line #8\r\nModified Line #9\r\n

And get the text :

Modified Line #1
Modified Line #2
Modified Line #3
Modified Line #4
Modified Line #5
Modified Line #6
Modified Line #7
Modified Line #8
Modified Line #9

The nice trick, with the search dialog, is that you DON’T need to separate the text of each line, with the End of Line characters \r\n :

Select, first, the original 9-lines text
Open the Replace dialog ( Ctrl + H )

=> The entire searched text is automatically filled

Unfortunately, you CANNOT use this same work-around, for the replacement dialog -:(( So, you’ll still have to type all the text, below :

Modified Line #1\r\nModified Line #2\r\nModified Line #3\r\nModified Line #4\r\nModified Line #5\r\nModified Line #6\r\nModified Line #7\r\nModified Line #8\r\nModified Line #9\r\n

Note that, WHATEVER the search mode used :

Do not exceed 2046 characters for, both, the Search and the Replace zones. Anyway, any surplus character is simply ignored !
It could be worth to check the Match case option, in order to differentiate between upper and lower case letters
I strongly advice you to uncheck the Match whole word only option, especially when the searched string begins and/or ends with a NON-word character
The search of individual bytes of an UTF-8 or UCS-2 encoded character is not allowed !
The replacement zone may contain any char, except for the NUL char ( \0 ), whatever its representation ( \0, \d000, \o000, \b00000000, \x00 or \u0000 )

Best Regards,

guy038

P.S. :

Personally, I think that the only advantage of using the Extended mode is when using the \dxxx syntax, where xxx represents the decimal code of the character :
- Between 000 and 255 ( so in range U+0000 - U+00FF) within a UTF-8 or UCS-2 encoded file
- Between 000 and 127 or between 160 and 255 ( so in ranges U+0000 - U+007F or U+00A0 - U+00FF ) within an ANSI file

In all other cases, just prefer the Regular expression search mode ;-))

For information, about the Extended search mode, you may also refer to this old article, in N++ Wiki, via the web.archive site :

https://web.archive.org/web/20190609210114/http://docs.notepad-plus-plus.org/index.php/Searching_And_Replacing#Escape_sequences_supported_in_extended_mode

Finally, @peterjones, if you need to refer to the old N++ Wiki, here are some links, via the WayBack Machine site :

https://web.archive.org/web/20190719202854/http://docs.notepad-plus-plus.org/index.php/Main_Page

https://web.archive.org/web/20190719202854/http://docs.notepad-plus-plus.org/index.php/Category:Keywords

https://web.archive.org/web/20190719202854/http://docs.notepad-plus-plus.org/index.php/Category:Short_Title(All)

datatraveller1

@PeterJones said in Search with quantifier failed:

The entire regex is \X – it matches one letter plus all the combining characters that come after, hence it would match the o and the four shown combining-characterd

Sorry, but I still don’t understand the manual text:
“For example, the letter ǭ̳̚, with four combining characters after the o, can be found either with the regex (?-i)o\x{0304}\x{0328}\x{031a}\x{0333} or with the shorter regex \X.”

-> \X seems to find any letter but the text implies \X is an alternative to find exactly ǭ̳̚?

… So is \X a better alternative than a dot in a regular expression to find a letter, because a dot finds the four combining characters and \X the one whole letter?

PeterJones

Correction: I earlier said from memory,

it matches one letter plus all the combining characters that come after

When i should have said

it matches one character plus all the combining characters that come after

It was evening, I was tired, and I was typing on my phone from memory without the manual or a copy of Notepad++ in front of me.

@datatraveller1 said in Search with quantifier failed:

\X seems to find any letter but the text implies \X is an alternative to find exactly ǭ̳̚?

You read it different than it was intended. The manually literally says “Matches a single non-combining character followed by any number of combining characters” and that’s exactly what it matches. It doesn’t matter whether that character is o or a or Z or whatever. It matches one character, along with all the modifiers that come next. Just like \u matches one uppercase letter, or \l matches one lowercase letter, or \R matches either \r or \n or \r\n, the \X regex will match a character plus all the combining characters that come next.

In this example text, I have o followed by those four modifiers, a followed by those four modifiers, and _ followed by those

ǭ̳̚

ą̳̄̚

_̨̳̄̚

:̨̳̄̚

You will see that it matches all four of those sequences.

It says 20, because it also matches each of the bytes of the newlines between, each of which are 1 character followed by 0 modifying characters.

So is \X a better alternative than a dot in a regular expression to find a letter, because a dot finds the four combining characters and \X the one whole letter?

Depending on how one interprets your phrasing, that’s either right, or literally the opposite of what happens: if you mean that ǭ̳̚ with all the modifiers is “one whole letter” and that “dot finds the four combing characters” means that the dot matches each combining character independently with four separate matches, then yes, you are right. If you meant it the way i first read it, where it means “the single dot matches all four combinging characters at once, whereas the \X just finds the letter”, then it’s exactly opposite of what happens.

Dot matches a single character – this could be the initial character, or any of the four modifiers. \X matches the character plus all the modifiers in one unit. So the screenshot above with the \X showed 20 matches, whereas the . will show 36 (one for each character).

I really didn’t think there was ambiguity in the phrasing, but I will try to clarify it some more.

datatraveller1

@PeterJones Thank you very much indeed!