RegExpr search for utf8 chars not working anymore?
-
Hi, as of today the search for words including utf8 chars isn’t working as expected.
The following search string:^([a-z \u])
applied to, say ‘Bayern München’, now returns something like:
Bayern M
instead of the full text as it did until a week ago. The same search string works as expected in C# (with \u = \p{L}, of course!). Did you change something related to regular expression?
-
Boost::RegEx
\u
: An “uppercase character” (any uppercase letter in the active code page).
C#\p{L}
: A character from the Unicode category “letter” (any kind of letter from any language). -
How would you write \p{L} for notepad++ regex engine?
-
Unchecking the option to distinguish between uppercase and lowercase letters solved my issue.
@MAPJe71 : Without the \u identifier the text search stops at unicode chars, so…
-
-
Hello, @pasquale-cerullo, and All :
Here are, below, two tables, which recapitulates the different “Escape” characters ( Shorthand character classes ), available with the C++ Boost library
v1.55.0
, in Notepad++ :
A) BOOST syntaxes, of a single character,
IN
and orOUT
of Character Classes ( TheMatch case
is SET ) :•===================•===========================•==========•==========================================================• | IN Class [...] | OUT Class [...] | IN / OUT | Character Class Contents for "Windows-1252" Encoding | •===================•=============•=======•=====•==========•==========================================================• | [:space:] | [:s:] | \p{space} | \p{s} | \ps | \s | [\t\n\x0B\f\r\x20\x85\xA0] | •-------------------•-------------•-------•-----•----------•----------------------------------------------------------• | [:digit:] | [:d:] | \p{digit} | \p{d} | \pd | \d | [0-9¹²³] | •-------------------•-------------•-------•-----•----------•----------------------------------------------------------• | [:lower:] | [:l:] | \p{lower} | \p{l} | \pl | \l | (?-i)[a-zƒšœžªµºßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ] | •-------------------•-------------•-------•-----•----------•----------------------------------------------------------• | [:upper:] | [:u:] | \p{upper} | \p{u} | \pu | \u | (?-i)[A-ZŠŒŽŸÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞ] | •-------------------•-------------•-------•-----•----------•----------------------------------------------------------• | [:word:] | [:w:] | \p{word} | \p{w} | \pw | \w | [_\d\l\u] | •-------------------•-------------•-------•-----•----------•----------------------------------------------------------• | [:blank:] | [:h:] | \p{blank} | \p{h} | \ph | \h | [\t\x20\xA0] | •-------------------•-------------•-------•-----•----------•----------------------------------------------------------• | [:v:] | \p{v} | \p{v} | \pv | \v | [\n\x0B\f\r\x85] | •-------------------•-------------•-------•-----•----------•----------------------------------------------------------• | [:alnum:] | \p{alnum} | | | | [\d\l\u] | •-------------------•-------------•-------•-----•----------•----------------------------------------------------------• | [:alpha:] | \p{alpha} | | | | [\l\u] | •-------------------•-------------•-------•-----•----------•----------------------------------------------------------• | [:cntrl:] | \p{cntrl} | | | | [\x00-\x1F\x7F\x81\x8D\x8F\x90\x9D] | •-------------------•-------------•-------•-----•----------•----------------------------------------------------------• | [:graph:] | \p{graph} | | | | [^\x00-\x20\x7F-\x81ˆ\x8D\x8F\x90˜™\x9D\xA0] | •-------------------•-------------•-------•-----•----------•----------------------------------------------------------• | [:print:] | \p{print} | | | | [\s[:graph:]] | •-------------------•-------------•-------•-----•----------•----------------------------------------------------------• | [:punct:] | \p{punct} | | | | []\x21-\x2F:;<=>?@\\^_`{|}~‚„…†‡‰‹‘’“”•–—›\xA1-\xBF×÷[] | •-------------------•-------------•-------•-----•----------•----------------------------------------------------------• | [:xdigit:] | \p{xdigit} | | | | [0-9A-Fa-f] | •-------------------•-------------•-------•-----•----------•----------------------------------------------------------• | [:unicode:] | \p{unicode} | | | | (?-i)[\x{0100}-\x{FFFF}] | •===================•=============•=======•=====•==========•==========================================================•
B) BOOST syntaxes, of a single character,
IN
and orOUT
of NEGATED Character Classes ( TheMatch case
is SET ) :•===================•===========================•==========•==========================================================• | IN Class [^..] | OUT Class [^..] | IN / OUT | Character Class Contents for "Windows-1252" Encoding | •===================•=============•=======•=====•==========•==========================================================• | [:space:] | [:s:] | \P{space} | \P{s} | \Ps | \S | [^\t\n\x0B\f\r\x20\xa0\x{2028}\x{2029}] | •-------------------•-------------•-------•-----•----------•----------------------------------------------------------• | [:digit:] | [:d:] | \P{digit} | \P{d} | \Pd | \D | [^0-9¹²³] | •-------------------•-------------•-------•-----•----------•----------------------------------------------------------• | [:lower:] | [:l:] | \P{lower} | \P{l} | \Pl | \L | (?-i)[^a-zƒšœžªµºßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ] | •-------------------•-------------•-------•-----•----------•----------------------------------------------------------• | [:upper:] | [:u:] | \P{upper} | \P{u} | \Pu | \U | (?-i)[^A-ZŠŒŽŸÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞ] | •-------------------•-------------•-------•-----•----------•----------------------------------------------------------• | [:word:] | [:w:] | \P{word} | \P{w} | \Pw | \W | [^_\d\l\u] | •-------------------•-------------•-------•-----•----------•----------------------------------------------------------• | [:blank:] | [:h:] | \P{blank} | \P{h} | \Ph | \H | [^\t\x20\xA0] | •-------------------•-------------•-------•-----•----------•----------------------------------------------------------• | [:v:] | \P{v} | \P{v} | \Pv | \V | [^\n\x0B\f\r\x85\x{2028}\x{2029}] | •-------------------•-------------•-------•-----•----------•----------------------------------------------------------• | [:alnum:] | \P{alnum} | | | | [^\d\l\u] = [\D\L\U] | •-------------------•-------------•-------•-----•----------•----------------------------------------------------------• | [:alpha:] | \P{alpha} | | | | [^\l\u] = [\L\U] | •-------------------•-------------•-------•-----•----------•----------------------------------------------------------• | [:cntrl:] | \P{cntrl} | | | | [^\x00-\x1F\x7F\x81\x8D\x8F\x90\x9D] | •-------------------•-------------•-------•-----•----------•----------------------------------------------------------• | [:graph:] | \P{graph} | | | | [\x00-\x20\x7F-\x81ˆ\x8D\x8F\x90˜™\x9D\xA0] | •-------------------•-------------•-------•-----•----------•----------------------------------------------------------• | [:print:] | \P{print} | | | | [\x00-\x08\x0E-\x1f\x7F-\x81ˆ\x8D\x8F\x90˜™\x9D] | •-------------------•-------------•-------•-----•----------•----------------------------------------------------------• | [:punct:] | \P{punct} | | | | [^]\x21-\x2F:;<=>?@\\^_`{|}~‚„…†‡‰‹‘’“”•–—›\xA1-\xBF×÷[] | •-------------------•-------------•-------•-----•----------•----------------------------------------------------------• [:xdigit:] | \P{xdigit} | | | | [^0-9A-Fa-f] | •-------------------•-------------•-------•-----•----------•----------------------------------------------------------• | [:unicode:] | \P{unicode} | | | | (?-i)[^\x{0100}-\x{FFFF}] | •===================•=============•=======•=====•==========•==========================================================•
So, your regex
^([a-z \u])
matches any single character, at beginning of a line, which is, either :-
A lowercase letter, between
a
andz
, or a Space char or an uppercase letter ( accentuated or not ), if theMatch case
option is set -
A letter, between
a
andz
or betweenA
andZ
, or a Space char or a letter ( accentuated or not ), if theMatch case
option is unset
As for me, I would,simply, rewrite it as
^([\w ])
or, more strictly,^([[:alpha:]\x20])
And, in order to match any list of word characters, separated by, only, one horizontal blank character, use the regex
(\w+\h)*\w+
, which finds any group, below :Bayern Bayern München Bayern München Test_123_END Test Bayern_München München
Best Regards,
guy038
-
-
@MAPJe71 Hey mapje! Is er een mogelijkheid om hulp te krijgen bij een opdracht in Notepad++? Ik zit met een deadline en geen uitweg. Grt
-
@Esboutique
General remarks:- This is a forum that is used worldwide so please keep any communication in english :-)
- Any help requested here should (IMO) be related to Notepad++ i.e. its functionality or its source;
- Prior to asking for help make sure you have searched the forum and/or the GitHub repository for simular and related issues;
- Keep posts in a topic related to that topic i.e. don’t piggyback;
Remarks on questions addressed to me directly:
- I’m not going to do any work you are supposed to do free-of-charge;
- I’m not going to do your homework (you’re supposed to learn something from doing the work yourself so go do it!);
- For most Regular Expression related questions there are more suitable forums and informational websites e.g. Regular-Expressions.info
- Regular Expressions related to FunctionList, just open a topic and ask the question.
Thnx
-
@MAPJe71 Hi, you are quite right! Sorry for being such an IT-nitwit. But I will remember this for future questions, so thanks.
Wasn’t my intention to get anyone to do my job, but I have spend to many days searching how to work with Notepadd++ that I have run out of time (so I put in my time and effort to get it down, but I cannot get aquinted with npp so quickly).
I am really not good at forums and asking for help. I guess this was just a random panic reaction (also speaking the same language did it for me).
I just found out that I can delete my messages so I will do so in order not to disturb the topic that I am posting this on. But just for getting this out to you, I quickly put it here.Thanks again and sorry for messing this topic up a little!
Bye