Community
    • Login

    RegExpr search for utf8 chars not working anymore?

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    9 Posts 4 Posters 3.8k Views 3 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Pasquale CerulloP Offline
      Pasquale Cerullo
      last edited by

      Hi, as of today the search for words including utf8 chars isn’t working as expected.
      The following search string:

      ^([a-z \u])

      applied to, say ‘Bayern München’, now returns something like:

      Bayern M

      instead of the full text as it did until a week ago. The same search string works as expected in C# (with \u = \p{L}, of course!). Did you change something related to regular expression?

      1 Reply Last reply Reply Quote 0
      • MAPJe71M Offline
        MAPJe71
        last edited by

        Boost::RegEx \u: An “uppercase character” (any uppercase letter in the active code page).
        C# \p{L}: A character from the Unicode category “letter” (any kind of letter from any language).

        1 Reply Last reply Reply Quote 0
        • Pasquale CerulloP Offline
          Pasquale Cerullo
          last edited by

          How would you write \p{L} for notepad++ regex engine?

          1 Reply Last reply Reply Quote 0
          • Pasquale CerulloP Offline
            Pasquale Cerullo
            last edited by

            Unchecking the option to distinguish between uppercase and lowercase letters solved my issue.

            @MAPJe71 : Without the \u identifier the text search stops at unicode chars, so…

            1 Reply Last reply Reply Quote 0
            • MAPJe71M Offline
              MAPJe71
              last edited by

              Have a look here and/or here.

              EsboutiqueE 1 Reply Last reply Reply Quote 0
              • guy038G Offline
                guy038
                last edited by guy038

                Hello, @pasquale-cerullo, and All :

                Here are, below, two tables, which recapitulates the different “Escape” characters ( Shorthand character classes ), available with the C++ Boost library v1.55.0, in Notepad++ :


                A) BOOST syntaxes, of a single character, IN and or OUT of Character Classes ( The Match case is SET ) :

                •===================•===========================•==========•==========================================================•
                |   IN Class   |      OUT Class [...]      | IN / OUT |   Character Class Contents for "Windows-1252" Encoding   |
                •===================•=============•=======•=====•==========•==========================================================•
                | [:space:] | [:s:] | \p{space}   | \p{s} | \ps |    \s    | [\t\n\x0B\f\r\x20\x85\xA0]                               |
                •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
                | [:digit:] | [:d:] | \p{digit}   | \p{d} | \pd |    \d    | [0-9¹²³]                                                 |
                •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
                | [:lower:] | [:l:] | \p{lower}   | \p{l} | \pl |    \l    | (?-i)[a-zƒšœžªµºßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ]        |
                •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
                | [:upper:] | [:u:] | \p{upper}   | \p{u} | \pu |    \u    | (?-i)[A-ZŠŒŽŸÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞ]             |
                •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
                | [:word:]  | [:w:] | \p{word}    | \p{w} | \pw |    \w    | [_\d\l\u]                                                |
                •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
                | [:blank:] | [:h:] | \p{blank}   | \p{h} | \ph |    \h    | [\t\x20\xA0]                                             |
                •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
                |      [:v:]        |  \p{v}      | \p{v} | \pv |    \v    | [\n\x0B\f\r\x85]                                         |
                •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
                |    [:alnum:]      | \p{alnum}   |       |     |          | [\d\l\u]                                                 |
                •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
                |    [:alpha:]      | \p{alpha}   |       |     |          | [\l\u]                                                   |
                •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
                |    [:cntrl:]      | \p{cntrl}   |       |     |          | [\x00-\x1F\x7F\x81\x8D\x8F\x90\x9D]                      |
                •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
                |    [:graph:]      | \p{graph}   |       |     |          | [^\x00-\x20\x7F-\x81ˆ\x8D\x8F\x90˜™\x9D\xA0]             |
                •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
                |    [:print:]      | \p{print}   |       |     |          | [\s[:graph:]]                                            |
                •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
                |    [:punct:]      | \p{punct}   |       |     |          | []\x21-\x2F:;<=>?@\\^_`{|}~‚„…†‡‰‹‘’“”•–—›\xA1-\xBF×÷[]  |
                •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
                |    [:xdigit:]     | \p{xdigit}  |       |     |          | [0-9A-Fa-f]                                              |
                •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
                |    [:unicode:]    | \p{unicode} |       |     |          | (?-i)[\x{0100}-\x{FFFF}]                                 |
                •===================•=============•=======•=====•==========•==========================================================•
                

                B) BOOST syntaxes, of a single character, IN and or OUT of NEGATED Character Classes ( The Match case is SET ) :

                •===================•===========================•==========•==========================================================•
                |   IN Class [^..]  |      OUT Class [^..]      | IN / OUT |   Character Class Contents for "Windows-1252" Encoding   |
                •===================•=============•=======•=====•==========•==========================================================•
                | [:space:] | [:s:] | \P{space}   | \P{s} | \Ps |    \S    | [^\t\n\x0B\f\r\x20\xa0\x{2028}\x{2029}]                  |
                •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
                | [:digit:] | [:d:] | \P{digit}   | \P{d} | \Pd |    \D    | [^0-9¹²³]                                                |
                •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
                | [:lower:] | [:l:] | \P{lower}   | \P{l} | \Pl |    \L    | (?-i)[^a-zƒšœžªµºßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ]       |
                •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
                | [:upper:] | [:u:] | \P{upper}   | \P{u} | \Pu |    \U    | (?-i)[^A-ZŠŒŽŸÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞ]            |
                •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
                | [:word:]  | [:w:] | \P{word}    | \P{w} | \Pw |    \W    | [^_\d\l\u]                                               |
                •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
                | [:blank:] | [:h:] | \P{blank}   | \P{h} | \Ph |    \H    | [^\t\x20\xA0]                                            |
                •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
                |      [:v:]        | \P{v}       | \P{v} | \Pv |    \V    | [^\n\x0B\f\r\x85\x{2028}\x{2029}]                        |
                •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
                |    [:alnum:]      | \P{alnum}   |       |     |          | [^\d\l\u] = [\D\L\U]                                     |
                •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
                |    [:alpha:]      | \P{alpha}   |       |     |          | [^\l\u]   = [\L\U]                                       |
                •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
                |    [:cntrl:]      | \P{cntrl}   |       |     |          | [^\x00-\x1F\x7F\x81\x8D\x8F\x90\x9D]                     |
                •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
                |    [:graph:]      | \P{graph}   |       |     |          | [\x00-\x20\x7F-\x81ˆ\x8D\x8F\x90˜™\x9D\xA0]              |
                •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
                |    [:print:]      | \P{print}   |       |     |          | [\x00-\x08\x0E-\x1f\x7F-\x81ˆ\x8D\x8F\x90˜™\x9D]         |
                •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
                |    [:punct:]      | \P{punct}   |       |     |          | [^]\x21-\x2F:;<=>?@\\^_`{|}~‚„…†‡‰‹‘’“”•–—›\xA1-\xBF×÷[] |
                •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
                     [:xdigit:]     | \P{xdigit}  |       |     |          | [^0-9A-Fa-f]                                             |
                •-------------------•-------------•-------•-----•----------•----------------------------------------------------------•
                |    [:unicode:]    | \P{unicode} |       |     |          | (?-i)[^\x{0100}-\x{FFFF}]                                |
                •===================•=============•=======•=====•==========•==========================================================•
                

                So, your regex ^([a-z \u]) matches any single character, at beginning of a line, which is, either :

                • A lowercase letter, between a and z, or a Space char or an uppercase letter ( accentuated or not ), if the Match case option is set

                • A letter, between a and z or between A and Z, or a Space char or a letter ( accentuated or not ), if the Match case option is unset


                As for me, I would,simply, rewrite it as ^([\w ]) or, more strictly, ^([[:alpha:]\x20])

                And, in order to match any list of word characters, separated by, only, one horizontal blank character, use the regex (\w+\h)*\w+, which finds any group, below :

                Bayern
                
                Bayern München
                
                           Bayern	München
                
                Test_123_END Test
                
                Bayern_München
                
                	München
                

                Best Regards,

                guy038

                1 Reply Last reply Reply Quote 0
                • EsboutiqueE Offline
                  Esboutique @MAPJe71
                  last edited by

                  @MAPJe71 Hey mapje! Is er een mogelijkheid om hulp te krijgen bij een opdracht in Notepad++? Ik zit met een deadline en geen uitweg. Grt

                  MAPJe71M 1 Reply Last reply Reply Quote 0
                  • MAPJe71M Offline
                    MAPJe71 @Esboutique
                    last edited by

                    @Esboutique
                    General remarks:

                    1. This is a forum that is used worldwide so please keep any communication in english :-)
                    2. Any help requested here should (IMO) be related to Notepad++ i.e. its functionality or its source;
                    3. Prior to asking for help make sure you have searched the forum and/or the GitHub repository for simular and related issues;
                    4. Keep posts in a topic related to that topic i.e. don’t piggyback;

                    Remarks on questions addressed to me directly:

                    1. I’m not going to do any work you are supposed to do free-of-charge;
                    2. I’m not going to do your homework (you’re supposed to learn something from doing the work yourself so go do it!);
                    3. For most Regular Expression related questions there are more suitable forums and informational websites e.g. Regular-Expressions.info
                    4. Regular Expressions related to FunctionList, just open a topic and ask the question.

                    Thnx

                    EsboutiqueE 1 Reply Last reply Reply Quote 0
                    • EsboutiqueE Offline
                      Esboutique @MAPJe71
                      last edited by

                      @MAPJe71 Hi, you are quite right! Sorry for being such an IT-nitwit. But I will remember this for future questions, so thanks.

                      Wasn’t my intention to get anyone to do my job, but I have spend to many days searching how to work with Notepadd++ that I have run out of time (so I put in my time and effort to get it down, but I cannot get aquinted with npp so quickly).

                      I am really not good at forums and asking for help. I guess this was just a random panic reaction (also speaking the same language did it for me).
                      I just found out that I can delete my messages so I will do so in order not to disturb the topic that I am posting this on. But just for getting this out to you, I quickly put it here.

                      Thanks again and sorry for messing this topic up a little!

                      Bye

                      1 Reply Last reply Reply Quote 1

                      Hello! It looks like you're interested in this conversation, but you don't have an account yet.

                      Getting fed up of having to scroll through the same posts each visit? When you register for an account, you'll always come back to exactly where you were before, and choose to be notified of new replies (either via email, or push notification). You'll also be able to save bookmarks and upvote posts to show your appreciation to other community members.

                      With your input, this post could be even better 💗

                      Register Login
                      • First post
                        Last post
                      The Community of users of the Notepad++ text editor.
                      Powered by NodeBB | Contributors