RegExpr search for utf8 chars not working anymore?



  • Hi, as of today the search for words including utf8 chars isn’t working as expected.
    The following search string:

    ^([a-z \u])

    applied to, say ‘Bayern München’, now returns something like:

    Bayern M

    instead of the full text as it did until a week ago. The same search string works as expected in C# (with \u = \p{L}, of course!). Did you change something related to regular expression?



  • Boost::RegEx \u: An “uppercase character” (any uppercase letter in the active code page).
    C# \p{L}: A character from the Unicode category “letter” (any kind of letter from any language).



  • How would you write \p{L} for notepad++ regex engine?



  • Unchecking the option to distinguish between uppercase and lowercase letters solved my issue.

    @MAPJe71 : Without the \u identifier the text search stops at unicode chars, so…



  • Have a look here and/or here.



  • Hello, @pasquale-cerullo, and All :

    Here are, below, two tables, which recapitulates the different “Escape” characters ( Shorthand character classes ), available with the C++ Boost library v1.55.0, in Notepad++ :


    A) BOOST syntaxes, of a single character, IN and or OUT of Character Classes ( The Match case is SET ) :

    •===================•===========================•==========•==========================================================•
    |   IN Class [...]  |      OUT Class [...]      | IN / OUT |   Character Class Contents for "Windows-1252" Encoding   |
    •===================•=============•=======•=====•==========•==========================================================•
    | [:space:] | [:s:] | \p{space}   | \p{s} | \ps |    \s    | [\t\n\x0B\f\r\x20\x85\xA0]                               |
    •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
    | [:digit:] | [:d:] | \p{digit}   | \p{d} | \pd |    \d    | [0-9¹²³]                                                 |
    •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
    | [:lower:] | [:l:] | \p{lower}   | \p{l} | \pl |    \l    | (?-i)[a-zƒšœžªµºßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ]        |
    •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
    | [:upper:] | [:u:] | \p{upper}   | \p{u} | \pu |    \u    | (?-i)[A-ZŠŒŽŸÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞ]             |
    •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
    | [:word:]  | [:w:] | \p{word}    | \p{w} | \pw |    \w    | [_\d\l\u]                                                |
    •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
    | [:blank:] | [:h:] | \p{blank}   | \p{h} | \ph |    \h    | [\t\x20\xA0]                                             |
    •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
    |      [:v:]        |  \p{v}      | \p{v} | \pv |    \v    | [\n\x0B\f\r\x85]                                         |
    •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
    |    [:alnum:]      | \p{alnum}   |       |     |          | [\d\l\u]                                                 |
    •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
    |    [:alpha:]      | \p{alpha}   |       |     |          | [\l\u]                                                   |
    •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
    |    [:cntrl:]      | \p{cntrl}   |       |     |          | [\x00-\x1F\x7F\x81\x8D\x8F\x90\x9D]                      |
    •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
    |    [:graph:]      | \p{graph}   |       |     |          | [^\x00-\x20\x7F-\x81ˆ\x8D\x8F\x90˜™\x9D\xA0]             |
    •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
    |    [:print:]      | \p{print}   |       |     |          | [\s[:graph:]]                                            |
    •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
    |    [:punct:]      | \p{punct}   |       |     |          | []\x21-\x2F:;<=>?@\\^_`{|}~‚„…†‡‰‹‘’“”•–—›\xA1-\xBF×÷[]  |
    •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
    |    [:xdigit:]     | \p{xdigit}  |       |     |          | [0-9A-Fa-f]                                              |
    •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
    |    [:unicode:]    | \p{unicode} |       |     |          | (?-i)[\x{0100}-\x{FFFF}]                                 |
    •===================•=============•=======•=====•==========•==========================================================•
    

    B) BOOST syntaxes, of a single character, IN and or OUT of NEGATED Character Classes ( The Match case is SET ) :

    •===================•===========================•==========•==========================================================•
    |   IN Class [^..]  |      OUT Class [^..]      | IN / OUT |   Character Class Contents for "Windows-1252" Encoding   |
    •===================•=============•=======•=====•==========•==========================================================•
    | [:space:] | [:s:] | \P{space}   | \P{s} | \Ps |    \S    | [^\t\n\x0B\f\r\x20\xa0\x{2028}\x{2029}]                  |
    •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
    | [:digit:] | [:d:] | \P{digit}   | \P{d} | \Pd |    \D    | [^0-9¹²³]                                                |
    •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
    | [:lower:] | [:l:] | \P{lower}   | \P{l} | \Pl |    \L    | (?-i)[^a-zƒšœžªµºßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ]       |
    •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
    | [:upper:] | [:u:] | \P{upper}   | \P{u} | \Pu |    \U    | (?-i)[^A-ZŠŒŽŸÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞ]            |
    •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
    | [:word:]  | [:w:] | \P{word}    | \P{w} | \Pw |    \W    | [^_\d\l\u]                                               |
    •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
    | [:blank:] | [:h:] | \P{blank}   | \P{h} | \Ph |    \H    | [^\t\x20\xA0]                                            |
    •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
    |      [:v:]        | \P{v}       | \P{v} | \Pv |    \V    | [^\n\x0B\f\r\x85\x{2028}\x{2029}]                        |
    •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
    |    [:alnum:]      | \P{alnum}   |       |     |          | [^\d\l\u] = [\D\L\U]                                     |
    •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
    |    [:alpha:]      | \P{alpha}   |       |     |          | [^\l\u]   = [\L\U]                                       |
    •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
    |    [:cntrl:]      | \P{cntrl}   |       |     |          | [^\x00-\x1F\x7F\x81\x8D\x8F\x90\x9D]                     |
    •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
    |    [:graph:]      | \P{graph}   |       |     |          | [\x00-\x20\x7F-\x81ˆ\x8D\x8F\x90˜™\x9D\xA0]              |
    •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
    |    [:print:]      | \P{print}   |       |     |          | [\x00-\x08\x0E-\x1f\x7F-\x81ˆ\x8D\x8F\x90˜™\x9D]         |
    •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
    |    [:punct:]      | \P{punct}   |       |     |          | [^]\x21-\x2F:;<=>?@\\^_`{|}~‚„…†‡‰‹‘’“”•–—›\xA1-\xBF×÷[] |
    •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
         [:xdigit:]     | \P{xdigit}  |       |     |          | [^0-9A-Fa-f]                                             |
    •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
    |    [:unicode:]    | \P{unicode} |       |     |          | (?-i)[^\x{0100}-\x{FFFF}]                                |
    •===================•=============•=======•=====•==========•==========================================================•
    

    So, your regex ^([a-z \u]) matches any single character, at beginning of a line, which is, either :

    • A lowercase letter, between a and z, or a Space char or an uppercase letter ( accentuated or not ), if the Match case option is set

    • A letter, between a and z or between A and Z, or a Space char or a letter ( accentuated or not ), if the Match case option is unset


    As for me, I would,simply, rewrite it as ^([\w ]) or, more strictly, ^([[:alpha:]\x20])

    And, in order to match any list of word characters, separated by, only, one horizontal blank character, use the regex (\w+\h)*\w+, which finds any group, below :

    Bayern
    
    Bayern München
    
               Bayern	München
    
    Test_123_END Test
    
    Bayern_München
    
    	München
    

    Best Regards,

    guy038



  • @MAPJe71 Hey mapje! Is er een mogelijkheid om hulp te krijgen bij een opdracht in Notepad++? Ik zit met een deadline en geen uitweg. Grt



  • @Esboutique
    General remarks:

    1. This is a forum that is used worldwide so please keep any communication in english :-)
    2. Any help requested here should (IMO) be related to Notepad++ i.e. its functionality or its source;
    3. Prior to asking for help make sure you have searched the forum and/or the GitHub repository for simular and related issues;
    4. Keep posts in a topic related to that topic i.e. don’t piggyback;

    Remarks on questions addressed to me directly:

    1. I’m not going to do any work you are supposed to do free-of-charge;
    2. I’m not going to do your homework (you’re supposed to learn something from doing the work yourself so go do it!);
    3. For most Regular Expression related questions there are more suitable forums and informational websites e.g. Regular-Expressions.info
    4. Regular Expressions related to FunctionList, just open a topic and ask the question.

    Thnx



  • @MAPJe71 Hi, you are quite right! Sorry for being such an IT-nitwit. But I will remember this for future questions, so thanks.

    Wasn’t my intention to get anyone to do my job, but I have spend to many days searching how to work with Notepadd++ that I have run out of time (so I put in my time and effort to get it down, but I cannot get aquinted with npp so quickly).

    I am really not good at forums and asking for help. I guess this was just a random panic reaction (also speaking the same language did it for me).
    I just found out that I can delete my messages so I will do so in order not to disturb the topic that I am posting this on. But just for getting this out to you, I quickly put it here.

    Thanks again and sorry for messing this topic up a little!

    Bye


Log in to reply