RegExpr search for utf8 chars not working anymore?

Pasquale Cerullo

Hi, as of today the search for words including utf8 chars isn’t working as expected.
The following search string:

^([a-z \u])

applied to, say ‘Bayern München’, now returns something like:

Bayern M

instead of the full text as it did until a week ago. The same search string works as expected in C# (with \u = \p{L}, of course!). Did you change something related to regular expression?

MAPJe71

Boost::RegEx \u: An “uppercase character” (any uppercase letter in the active code page).
C# \p{L}: A character from the Unicode category “letter” (any kind of letter from any language).

Pasquale Cerullo

How would you write \p{L} for notepad++ regex engine?

Pasquale Cerullo

Unchecking the option to distinguish between uppercase and lowercase letters solved my issue.

@MAPJe71 : Without the \u identifier the text search stops at unicode chars, so…

MAPJe71

Have a look here and/or here.

guy038

Hello, @pasquale-cerullo, and All :

Here are, below, two tables, which recapitulates the different “Escape” characters ( Shorthand character classes ), available with the C++ Boost library v1.55.0, in Notepad++ :

A) BOOST syntaxes, of a single character, IN and or OUT of Character Classes ( The Match case is SET ) :

•===================•===========================•==========•==========================================================•
|   IN Class [...]  |      OUT Class [...]      | IN / OUT |   Character Class Contents for "Windows-1252" Encoding   |
•===================•=============•=======•=====•==========•==========================================================•
| [:space:] | [:s:] | \p{space}   | \p{s} | \ps |    \s    | [\t\n\x0B\f\r\x20\x85\xA0]                               |
•-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
| [:digit:] | [:d:] | \p{digit}   | \p{d} | \pd |    \d    | [0-9¹²³]                                                 |
•-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
| [:lower:] | [:l:] | \p{lower}   | \p{l} | \pl |    \l    | (?-i)[a-zƒšœžªµºßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ]        |
•-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
| [:upper:] | [:u:] | \p{upper}   | \p{u} | \pu |    \u    | (?-i)[A-ZŠŒŽŸÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞ]             |
•-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
| [:word:]  | [:w:] | \p{word}    | \p{w} | \pw |    \w    | [_\d\l\u]                                                |
•-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
| [:blank:] | [:h:] | \p{blank}   | \p{h} | \ph |    \h    | [\t\x20\xA0]                                             |
•-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
|      [:v:]        |  \p{v}      | \p{v} | \pv |    \v    | [\n\x0B\f\r\x85]                                         |
•-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
|    [:alnum:]      | \p{alnum}   |       |     |          | [\d\l\u]                                                 |
•-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
|    [:alpha:]      | \p{alpha}   |       |     |          | [\l\u]                                                   |
•-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
|    [:cntrl:]      | \p{cntrl}   |       |     |          | [\x00-\x1F\x7F\x81\x8D\x8F\x90\x9D]                      |
•-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
|    [:graph:]      | \p{graph}   |       |     |          | [^\x00-\x20\x7F-\x81ˆ\x8D\x8F\x90˜™\x9D\xA0]             |
•-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
|    [:print:]      | \p{print}   |       |     |          | [\s[:graph:]]                                            |
•-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
|    [:punct:]      | \p{punct}   |       |     |          | []\x21-\x2F:;<=>?@\\^_`{|}~‚„…†‡‰‹‘’“”•–—›\xA1-\xBF×÷[]  |
•-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
|    [:xdigit:]     | \p{xdigit}  |       |     |          | [0-9A-Fa-f]                                              |
•-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
|    [:unicode:]    | \p{unicode} |       |     |          | (?-i)[\x{0100}-\x{FFFF}]                                 |
•===================•=============•=======•=====•==========•==========================================================•

B) BOOST syntaxes, of a single character, IN and or OUT of NEGATED Character Classes ( The Match case is SET ) :

•===================•===========================•==========•==========================================================•
|   IN Class [^..]  |      OUT Class [^..]      | IN / OUT |   Character Class Contents for "Windows-1252" Encoding   |
•===================•=============•=======•=====•==========•==========================================================•
| [:space:] | [:s:] | \P{space}   | \P{s} | \Ps |    \S    | [^\t\n\x0B\f\r\x20\xa0\x{2028}\x{2029}]                  |
•-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
| [:digit:] | [:d:] | \P{digit}   | \P{d} | \Pd |    \D    | [^0-9¹²³]                                                |
•-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
| [:lower:] | [:l:] | \P{lower}   | \P{l} | \Pl |    \L    | (?-i)[^a-zƒšœžªµºßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ]       |
•-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
| [:upper:] | [:u:] | \P{upper}   | \P{u} | \Pu |    \U    | (?-i)[^A-ZŠŒŽŸÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞ]            |
•-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
| [:word:]  | [:w:] | \P{word}    | \P{w} | \Pw |    \W    | [^_\d\l\u]                                               |
•-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
| [:blank:] | [:h:] | \P{blank}   | \P{h} | \Ph |    \H    | [^\t\x20\xA0]                                            |
•-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
|      [:v:]        | \P{v}       | \P{v} | \Pv |    \V    | [^\n\x0B\f\r\x85\x{2028}\x{2029}]                        |
•-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
|    [:alnum:]      | \P{alnum}   |       |     |          | [^\d\l\u] = [\D\L\U]                                     |
•-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
|    [:alpha:]      | \P{alpha}   |       |     |          | [^\l\u]   = [\L\U]                                       |
•-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
|    [:cntrl:]      | \P{cntrl}   |       |     |          | [^\x00-\x1F\x7F\x81\x8D\x8F\x90\x9D]                     |
•-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
|    [:graph:]      | \P{graph}   |       |     |          | [\x00-\x20\x7F-\x81ˆ\x8D\x8F\x90˜™\x9D\xA0]              |
•-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
|    [:print:]      | \P{print}   |       |     |          | [\x00-\x08\x0E-\x1f\x7F-\x81ˆ\x8D\x8F\x90˜™\x9D]         |
•-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
|    [:punct:]      | \P{punct}   |       |     |          | [^]\x21-\x2F:;<=>?@\\^_`{|}~‚„…†‡‰‹‘’“”•–—›\xA1-\xBF×÷[] |
•-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
     [:xdigit:]     | \P{xdigit}  |       |     |          | [^0-9A-Fa-f]                                             |
•-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
|    [:unicode:]    | \P{unicode} |       |     |          | (?-i)[^\x{0100}-\x{FFFF}]                                |
•===================•=============•=======•=====•==========•==========================================================•

So, your regex ^([a-z \u]) matches any single character, at beginning of a line, which is, either :

A lowercase letter, between a and z, or a Space char or an uppercase letter ( accentuated or not ), if the Match case option is set
A letter, between a and z or between A and Z, or a Space char or a letter ( accentuated or not ), if the Match case option is unset

As for me, I would,simply, rewrite it as ^([\w ]) or, more strictly, ^([[:alpha:]\x20])

And, in order to match any list of word characters, separated by, only, one horizontal blank character, use the regex (\w+\h)*\w+, which finds any group, below :

Bayern

Bayern München

           Bayern	München

Test_123_END Test

Bayern_München

	München

Best Regards,

guy038

Esboutique

@MAPJe71 Hey mapje! Is er een mogelijkheid om hulp te krijgen bij een opdracht in Notepad++? Ik zit met een deadline en geen uitweg. Grt

MAPJe71

@Esboutique
General remarks:

This is a forum that is used worldwide so please keep any communication in english :-)
Any help requested here should (IMO) be related to Notepad++ i.e. its functionality or its source;
Prior to asking for help make sure you have searched the forum and/or the GitHub repository for simular and related issues;
Keep posts in a topic related to that topic i.e. don’t piggyback;

Remarks on questions addressed to me directly:

I’m not going to do any work you are supposed to do free-of-charge;
I’m not going to do your homework (you’re supposed to learn something from doing the work yourself so go do it!);
For most Regular Expression related questions there are more suitable forums and informational websites e.g. Regular-Expressions.info
Regular Expressions related to FunctionList, just open a topic and ask the question.

Thnx

Esboutique

@MAPJe71 Hi, you are quite right! Sorry for being such an IT-nitwit. But I will remember this for future questions, so thanks.

Wasn’t my intention to get anyone to do my job, but I have spend to many days searching how to work with Notepadd++ that I have run out of time (so I put in my time and effort to get it down, but I cannot get aquinted with npp so quickly).

I am really not good at forums and asking for help. I guess this was just a random panic reaction (also speaking the same language did it for me).
I just found out that I can delete my messages so I will do so in order not to disturb the topic that I am posting this on. But just for getting this out to you, I quickly put it here.

Thanks again and sorry for messing this topic up a little!

Bye