RegExpr search for utf8 chars not working anymore?
-
Hi, as of today the search for words including utf8 chars isn’t working as expected.
The following search string:^([a-z \u])
applied to, say ‘Bayern München’, now returns something like:
Bayern M
instead of the full text as it did until a week ago. The same search string works as expected in C# (with \u = \p{L}, of course!). Did you change something related to regular expression?
-
Boost::RegEx
\u: An “uppercase character” (any uppercase letter in the active code page).
C#\p{L}: A character from the Unicode category “letter” (any kind of letter from any language). -
How would you write \p{L} for notepad++ regex engine?
-
Unchecking the option to distinguish between uppercase and lowercase letters solved my issue.
@MAPJe71 : Without the \u identifier the text search stops at unicode chars, so…
-
-
Hello, @pasquale-cerullo, and All :
Here are, below, two tables, which recapitulates the different “Escape” characters ( Shorthand character classes ), available with the C++ Boost library
v1.55.0, in Notepad++ :
A) BOOST syntaxes, of a single character,
INand orOUTof Character Classes ( TheMatch caseis SET ) :•===================•===========================•==========•==========================================================• | IN Class | OUT Class [...] | IN / OUT | Character Class Contents for "Windows-1252" Encoding | •===================•=============•=======•=====•==========•==========================================================• | [:space:] | [:s:] | \p{space} | \p{s} | \ps | \s | [\t\n\x0B\f\r\x20\x85\xA0] | •-------------------•-------------•-------•-----•----------•----------------------------------------------------------• | [:digit:] | [:d:] | \p{digit} | \p{d} | \pd | \d | [0-9¹²³] | •-------------------•-------------•-------•-----•----------•----------------------------------------------------------• | [:lower:] | [:l:] | \p{lower} | \p{l} | \pl | \l | (?-i)[a-zƒšœžªµºßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ] | •-------------------•-------------•-------•-----•----------•----------------------------------------------------------• | [:upper:] | [:u:] | \p{upper} | \p{u} | \pu | \u | (?-i)[A-ZŠŒŽŸÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞ] | •-------------------•-------------•-------•-----•----------•----------------------------------------------------------• | [:word:] | [:w:] | \p{word} | \p{w} | \pw | \w | [_\d\l\u] | •-------------------•-------------•-------•-----•----------•----------------------------------------------------------• | [:blank:] | [:h:] | \p{blank} | \p{h} | \ph | \h | [\t\x20\xA0] | •-------------------•-------------•-------•-----•----------•----------------------------------------------------------• | [:v:] | \p{v} | \p{v} | \pv | \v | [\n\x0B\f\r\x85] | •-------------------•-------------•-------•-----•----------•----------------------------------------------------------• | [:alnum:] | \p{alnum} | | | | [\d\l\u] | •-------------------•-------------•-------•-----•----------•----------------------------------------------------------• | [:alpha:] | \p{alpha} | | | | [\l\u] | •-------------------•-------------•-------•-----•----------•----------------------------------------------------------• | [:cntrl:] | \p{cntrl} | | | | [\x00-\x1F\x7F\x81\x8D\x8F\x90\x9D] | •-------------------•-------------•-------•-----•----------•----------------------------------------------------------• | [:graph:] | \p{graph} | | | | [^\x00-\x20\x7F-\x81ˆ\x8D\x8F\x90˜™\x9D\xA0] | •-------------------•-------------•-------•-----•----------•----------------------------------------------------------• | [:print:] | \p{print} | | | | [\s[:graph:]] | •-------------------•-------------•-------•-----•----------•----------------------------------------------------------• | [:punct:] | \p{punct} | | | | []\x21-\x2F:;<=>?@\\^_`{|}~‚„…†‡‰‹‘’“”•–—›\xA1-\xBF×÷[] | •-------------------•-------------•-------•-----•----------•----------------------------------------------------------• | [:xdigit:] | \p{xdigit} | | | | [0-9A-Fa-f] | •-------------------•-------------•-------•-----•----------•----------------------------------------------------------• | [:unicode:] | \p{unicode} | | | | (?-i)[\x{0100}-\x{FFFF}] | •===================•=============•=======•=====•==========•==========================================================•
B) BOOST syntaxes, of a single character,
INand orOUTof NEGATED Character Classes ( TheMatch caseis SET ) :•===================•===========================•==========•==========================================================• | IN Class [^..] | OUT Class [^..] | IN / OUT | Character Class Contents for "Windows-1252" Encoding | •===================•=============•=======•=====•==========•==========================================================• | [:space:] | [:s:] | \P{space} | \P{s} | \Ps | \S | [^\t\n\x0B\f\r\x20\xa0\x{2028}\x{2029}] | •-------------------•-------------•-------•-----•----------•----------------------------------------------------------• | [:digit:] | [:d:] | \P{digit} | \P{d} | \Pd | \D | [^0-9¹²³] | •-------------------•-------------•-------•-----•----------•----------------------------------------------------------• | [:lower:] | [:l:] | \P{lower} | \P{l} | \Pl | \L | (?-i)[^a-zƒšœžªµºßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ] | •-------------------•-------------•-------•-----•----------•----------------------------------------------------------• | [:upper:] | [:u:] | \P{upper} | \P{u} | \Pu | \U | (?-i)[^A-ZŠŒŽŸÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞ] | •-------------------•-------------•-------•-----•----------•----------------------------------------------------------• | [:word:] | [:w:] | \P{word} | \P{w} | \Pw | \W | [^_\d\l\u] | •-------------------•-------------•-------•-----•----------•----------------------------------------------------------• | [:blank:] | [:h:] | \P{blank} | \P{h} | \Ph | \H | [^\t\x20\xA0] | •-------------------•-------------•-------•-----•----------•----------------------------------------------------------• | [:v:] | \P{v} | \P{v} | \Pv | \V | [^\n\x0B\f\r\x85\x{2028}\x{2029}] | •-------------------•-------------•-------•-----•----------•----------------------------------------------------------• | [:alnum:] | \P{alnum} | | | | [^\d\l\u] = [\D\L\U] | •-------------------•-------------•-------•-----•----------•----------------------------------------------------------• | [:alpha:] | \P{alpha} | | | | [^\l\u] = [\L\U] | •-------------------•-------------•-------•-----•----------•----------------------------------------------------------• | [:cntrl:] | \P{cntrl} | | | | [^\x00-\x1F\x7F\x81\x8D\x8F\x90\x9D] | •-------------------•-------------•-------•-----•----------•----------------------------------------------------------• | [:graph:] | \P{graph} | | | | [\x00-\x20\x7F-\x81ˆ\x8D\x8F\x90˜™\x9D\xA0] | •-------------------•-------------•-------•-----•----------•----------------------------------------------------------• | [:print:] | \P{print} | | | | [\x00-\x08\x0E-\x1f\x7F-\x81ˆ\x8D\x8F\x90˜™\x9D] | •-------------------•-------------•-------•-----•----------•----------------------------------------------------------• | [:punct:] | \P{punct} | | | | [^]\x21-\x2F:;<=>?@\\^_`{|}~‚„…†‡‰‹‘’“”•–—›\xA1-\xBF×÷[] | •-------------------•-------------•-------•-----•----------•----------------------------------------------------------• [:xdigit:] | \P{xdigit} | | | | [^0-9A-Fa-f] | •-------------------•-------------•-------•-----•----------•----------------------------------------------------------• | [:unicode:] | \P{unicode} | | | | (?-i)[^\x{0100}-\x{FFFF}] | •===================•=============•=======•=====•==========•==========================================================•
So, your regex
^([a-z \u])matches any single character, at beginning of a line, which is, either :-
A lowercase letter, between
aandz, or a Space char or an uppercase letter ( accentuated or not ), if theMatch caseoption is set -
A letter, between
aandzor betweenAandZ, or a Space char or a letter ( accentuated or not ), if theMatch caseoption is unset
As for me, I would,simply, rewrite it as
^([\w ])or, more strictly,^([[:alpha:]\x20])And, in order to match any list of word characters, separated by, only, one horizontal blank character, use the regex
(\w+\h)*\w+, which finds any group, below :Bayern Bayern München Bayern München Test_123_END Test Bayern_München MünchenBest Regards,
guy038
-
-
@MAPJe71 Hey mapje! Is er een mogelijkheid om hulp te krijgen bij een opdracht in Notepad++? Ik zit met een deadline en geen uitweg. Grt
-
@Esboutique
General remarks:- This is a forum that is used worldwide so please keep any communication in english :-)
- Any help requested here should (IMO) be related to Notepad++ i.e. its functionality or its source;
- Prior to asking for help make sure you have searched the forum and/or the GitHub repository for simular and related issues;
- Keep posts in a topic related to that topic i.e. don’t piggyback;
Remarks on questions addressed to me directly:
- I’m not going to do any work you are supposed to do free-of-charge;
- I’m not going to do your homework (you’re supposed to learn something from doing the work yourself so go do it!);
- For most Regular Expression related questions there are more suitable forums and informational websites e.g. Regular-Expressions.info
- Regular Expressions related to FunctionList, just open a topic and ask the question.
Thnx
-
@MAPJe71 Hi, you are quite right! Sorry for being such an IT-nitwit. But I will remember this for future questions, so thanks.
Wasn’t my intention to get anyone to do my job, but I have spend to many days searching how to work with Notepadd++ that I have run out of time (so I put in my time and effort to get it down, but I cannot get aquinted with npp so quickly).
I am really not good at forums and asking for help. I guess this was just a random panic reaction (also speaking the same language did it for me).
I just found out that I can delete my messages so I will do so in order not to disturb the topic that I am posting this on. But just for getting this out to you, I quickly put it here.Thanks again and sorry for messing this topic up a little!
Bye
Hello! It looks like you're interested in this conversation, but you don't have an account yet.
Getting fed up of having to scroll through the same posts each visit? When you register for an account, you'll always come back to exactly where you were before, and choose to be notified of new replies (either via email, or push notification). You'll also be able to save bookmarks and upvote posts to show your appreciation to other community members.
With your input, this post could be even better 💗
Register Login