Community
    • Login

    RegExpr search for utf8 chars not working anymore?

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    9 Posts 4 Posters 3.2k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Pasquale CerulloP
      Pasquale Cerullo
      last edited by

      Hi, as of today the search for words including utf8 chars isn’t working as expected.
      The following search string:

      ^([a-z \u])

      applied to, say ‘Bayern München’, now returns something like:

      Bayern M

      instead of the full text as it did until a week ago. The same search string works as expected in C# (with \u = \p{L}, of course!). Did you change something related to regular expression?

      1 Reply Last reply Reply Quote 0
      • MAPJe71M
        MAPJe71
        last edited by

        Boost::RegEx \u: An “uppercase character” (any uppercase letter in the active code page).
        C# \p{L}: A character from the Unicode category “letter” (any kind of letter from any language).

        1 Reply Last reply Reply Quote 0
        • Pasquale CerulloP
          Pasquale Cerullo
          last edited by

          How would you write \p{L} for notepad++ regex engine?

          1 Reply Last reply Reply Quote 0
          • Pasquale CerulloP
            Pasquale Cerullo
            last edited by

            Unchecking the option to distinguish between uppercase and lowercase letters solved my issue.

            @MAPJe71 : Without the \u identifier the text search stops at unicode chars, so…

            1 Reply Last reply Reply Quote 0
            • MAPJe71M
              MAPJe71
              last edited by

              Have a look here and/or here.

              EsboutiqueE 1 Reply Last reply Reply Quote 0
              • guy038G
                guy038
                last edited by guy038

                Hello, @pasquale-cerullo, and All :

                Here are, below, two tables, which recapitulates the different “Escape” characters ( Shorthand character classes ), available with the C++ Boost library v1.55.0, in Notepad++ :


                A) BOOST syntaxes, of a single character, IN and or OUT of Character Classes ( The Match case is SET ) :

                •===================•===========================•==========•==========================================================•
                |   IN Class [...]  |      OUT Class [...]      | IN / OUT |   Character Class Contents for "Windows-1252" Encoding   |
                •===================•=============•=======•=====•==========•==========================================================•
                | [:space:] | [:s:] | \p{space}   | \p{s} | \ps |    \s    | [\t\n\x0B\f\r\x20\x85\xA0]                               |
                •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
                | [:digit:] | [:d:] | \p{digit}   | \p{d} | \pd |    \d    | [0-9¹²³]                                                 |
                •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
                | [:lower:] | [:l:] | \p{lower}   | \p{l} | \pl |    \l    | (?-i)[a-zƒšœžªµºßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ]        |
                •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
                | [:upper:] | [:u:] | \p{upper}   | \p{u} | \pu |    \u    | (?-i)[A-ZŠŒŽŸÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞ]             |
                •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
                | [:word:]  | [:w:] | \p{word}    | \p{w} | \pw |    \w    | [_\d\l\u]                                                |
                •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
                | [:blank:] | [:h:] | \p{blank}   | \p{h} | \ph |    \h    | [\t\x20\xA0]                                             |
                •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
                |      [:v:]        |  \p{v}      | \p{v} | \pv |    \v    | [\n\x0B\f\r\x85]                                         |
                •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
                |    [:alnum:]      | \p{alnum}   |       |     |          | [\d\l\u]                                                 |
                •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
                |    [:alpha:]      | \p{alpha}   |       |     |          | [\l\u]                                                   |
                •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
                |    [:cntrl:]      | \p{cntrl}   |       |     |          | [\x00-\x1F\x7F\x81\x8D\x8F\x90\x9D]                      |
                •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
                |    [:graph:]      | \p{graph}   |       |     |          | [^\x00-\x20\x7F-\x81ˆ\x8D\x8F\x90˜™\x9D\xA0]             |
                •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
                |    [:print:]      | \p{print}   |       |     |          | [\s[:graph:]]                                            |
                •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
                |    [:punct:]      | \p{punct}   |       |     |          | []\x21-\x2F:;<=>?@\\^_`{|}~‚„…†‡‰‹‘’“”•–—›\xA1-\xBF×÷[]  |
                •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
                |    [:xdigit:]     | \p{xdigit}  |       |     |          | [0-9A-Fa-f]                                              |
                •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
                |    [:unicode:]    | \p{unicode} |       |     |          | (?-i)[\x{0100}-\x{FFFF}]                                 |
                •===================•=============•=======•=====•==========•==========================================================•
                

                B) BOOST syntaxes, of a single character, IN and or OUT of NEGATED Character Classes ( The Match case is SET ) :

                •===================•===========================•==========•==========================================================•
                |   IN Class [^..]  |      OUT Class [^..]      | IN / OUT |   Character Class Contents for "Windows-1252" Encoding   |
                •===================•=============•=======•=====•==========•==========================================================•
                | [:space:] | [:s:] | \P{space}   | \P{s} | \Ps |    \S    | [^\t\n\x0B\f\r\x20\xa0\x{2028}\x{2029}]                  |
                •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
                | [:digit:] | [:d:] | \P{digit}   | \P{d} | \Pd |    \D    | [^0-9¹²³]                                                |
                •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
                | [:lower:] | [:l:] | \P{lower}   | \P{l} | \Pl |    \L    | (?-i)[^a-zƒšœžªµºßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ]       |
                •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
                | [:upper:] | [:u:] | \P{upper}   | \P{u} | \Pu |    \U    | (?-i)[^A-ZŠŒŽŸÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞ]            |
                •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
                | [:word:]  | [:w:] | \P{word}    | \P{w} | \Pw |    \W    | [^_\d\l\u]                                               |
                •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
                | [:blank:] | [:h:] | \P{blank}   | \P{h} | \Ph |    \H    | [^\t\x20\xA0]                                            |
                •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
                |      [:v:]        | \P{v}       | \P{v} | \Pv |    \V    | [^\n\x0B\f\r\x85\x{2028}\x{2029}]                        |
                •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
                |    [:alnum:]      | \P{alnum}   |       |     |          | [^\d\l\u] = [\D\L\U]                                     |
                •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
                |    [:alpha:]      | \P{alpha}   |       |     |          | [^\l\u]   = [\L\U]                                       |
                •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
                |    [:cntrl:]      | \P{cntrl}   |       |     |          | [^\x00-\x1F\x7F\x81\x8D\x8F\x90\x9D]                     |
                •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
                |    [:graph:]      | \P{graph}   |       |     |          | [\x00-\x20\x7F-\x81ˆ\x8D\x8F\x90˜™\x9D\xA0]              |
                •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
                |    [:print:]      | \P{print}   |       |     |          | [\x00-\x08\x0E-\x1f\x7F-\x81ˆ\x8D\x8F\x90˜™\x9D]         |
                •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
                |    [:punct:]      | \P{punct}   |       |     |          | [^]\x21-\x2F:;<=>?@\\^_`{|}~‚„…†‡‰‹‘’“”•–—›\xA1-\xBF×÷[] |
                •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
                     [:xdigit:]     | \P{xdigit}  |       |     |          | [^0-9A-Fa-f]                                             |
                •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
                |    [:unicode:]    | \P{unicode} |       |     |          | (?-i)[^\x{0100}-\x{FFFF}]                                |
                •===================•=============•=======•=====•==========•==========================================================•
                

                So, your regex ^([a-z \u]) matches any single character, at beginning of a line, which is, either :

                • A lowercase letter, between a and z, or a Space char or an uppercase letter ( accentuated or not ), if the Match case option is set

                • A letter, between a and z or between A and Z, or a Space char or a letter ( accentuated or not ), if the Match case option is unset


                As for me, I would,simply, rewrite it as ^([\w ]) or, more strictly, ^([[:alpha:]\x20])

                And, in order to match any list of word characters, separated by, only, one horizontal blank character, use the regex (\w+\h)*\w+, which finds any group, below :

                Bayern
                
                Bayern München
                
                           Bayern	München
                
                Test_123_END Test
                
                Bayern_München
                
                	München
                

                Best Regards,

                guy038

                1 Reply Last reply Reply Quote 0
                • EsboutiqueE
                  Esboutique @MAPJe71
                  last edited by

                  @MAPJe71 Hey mapje! Is er een mogelijkheid om hulp te krijgen bij een opdracht in Notepad++? Ik zit met een deadline en geen uitweg. Grt

                  MAPJe71M 1 Reply Last reply Reply Quote 0
                  • MAPJe71M
                    MAPJe71 @Esboutique
                    last edited by

                    @Esboutique
                    General remarks:

                    1. This is a forum that is used worldwide so please keep any communication in english :-)
                    2. Any help requested here should (IMO) be related to Notepad++ i.e. its functionality or its source;
                    3. Prior to asking for help make sure you have searched the forum and/or the GitHub repository for simular and related issues;
                    4. Keep posts in a topic related to that topic i.e. don’t piggyback;

                    Remarks on questions addressed to me directly:

                    1. I’m not going to do any work you are supposed to do free-of-charge;
                    2. I’m not going to do your homework (you’re supposed to learn something from doing the work yourself so go do it!);
                    3. For most Regular Expression related questions there are more suitable forums and informational websites e.g. Regular-Expressions.info
                    4. Regular Expressions related to FunctionList, just open a topic and ask the question.

                    Thnx

                    EsboutiqueE 1 Reply Last reply Reply Quote 0
                    • EsboutiqueE
                      Esboutique @MAPJe71
                      last edited by

                      @MAPJe71 Hi, you are quite right! Sorry for being such an IT-nitwit. But I will remember this for future questions, so thanks.

                      Wasn’t my intention to get anyone to do my job, but I have spend to many days searching how to work with Notepadd++ that I have run out of time (so I put in my time and effort to get it down, but I cannot get aquinted with npp so quickly).

                      I am really not good at forums and asking for help. I guess this was just a random panic reaction (also speaking the same language did it for me).
                      I just found out that I can delete my messages so I will do so in order not to disturb the topic that I am posting this on. But just for getting this out to you, I quickly put it here.

                      Thanks again and sorry for messing this topic up a little!

                      Bye

                      1 Reply Last reply Reply Quote 1
                      • First post
                        Last post
                      The Community of users of the Notepad++ text editor.
                      Powered by NodeBB | Contributors