Community
    • Login

    notepad++ url processing cyrillic symbols

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    29 Posts 4 Posters 20.6k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Александр КорженевскийА
      Александр Корженевский
      last edited by

      tell me where i can read about how to open an issue on github?

      Claudia FrankC 1 Reply Last reply Reply Quote 0
      • Claudia FrankC
        Claudia Frank @Александр Корженевский
        last edited by

        @Александр-Корженевский

        it has been already addressed from someone recently
        https://github.com/notepad-plus-plus/notepad-plus-plus/issues/2746

        Cheers
        Claudia

        1 Reply Last reply Reply Quote 0
        • Александр КорженевскийА
          Александр Корженевский
          last edited by

          https://github.com/notepad-plus-plus/notepad-plus-plus/issues/2746
          it is not about cyrillic url.
          it is about - “Don’t check at launch time” does not work
          can you give me right lnk?

          Claudia FrankC 1 Reply Last reply Reply Quote 0
          • Claudia FrankC
            Claudia Frank @Александр Корженевский
            last edited by Claudia Frank

            @Александр-Корженевский

            I have to excuse twice,
            first I copied the link which wasn’t for your post instead it was for another one and
            second I forgot to follow up on this.

            The issue is within the regex being used at the moment.

            [A-Za-z]+://[A-Za-z0-9_\-\+~.:?&@=/%#,;\{\}\(\)\[\]\|\*\!\\]+
            

            I’m currently trying to find out if a more simplified version like

            [A-Za-z]+://.+?(?= )
            

            can be used. The latter wouldn’t care about the chars after ://,
            it would treat everything as a part of an url until a literal space appears.

            Currently, I don’t see why it can’t be used. If someone knows why it shouldn’t
            be used I would appreciate an info.

            Cheers
            Claudia

            1 Reply Last reply Reply Quote 0
            • guy038G
              guy038
              last edited by

              Hello, Александр Корженевский and Claudia

              Claudia, the regex being used for Internet addresses, below, given in your last post :

              [A-Za-z]+://[A-Za-z0-9_\-\+~.:?&@=/%#,;\{\}\(\)\[\]\|\*\!\\]+
              

              could be nicely shortened to the regex :

              [A-Za-z]+://[]A-Za-z_!#%&(-;=?@[\\{|}~-]+
              
              • I didn’t change the first part [A-Za-z]+://

              • I just simplify the character class []A-Za-z_!#%&(-;=?@[\\{|}~-]+ :

                • I put the ending square bracket, ], as first character of the class, [....]. So, I do not need to escape it !

                • On the same way, I put the dash sign, -, as last character of the class. So the antislash is not needed too !

                • The part (-; is the range, by ascending code-point, of all the characters, between the opening round bracket, (, and the semicolon sign, ;, which includes the digits 0 to 9 !

                • Only, the anti-slash symbol need to be escaped \\ !


              Of course, Claudia, your regex would work ! But, I think we would rather use the \w syntax which matches any word character, from any Unicode script ( Greek, Cyrillic, Arab, Hebrew,… )

              Then, this regex would become, as below :

              [A-Za-z]+://[]\w!#%&(-;=?@[\\{|}~-]+
              

              Cheers,

              guy038

              Claudia FrankC 1 Reply Last reply Reply Quote 0
              • Claudia FrankC
                Claudia Frank @guy038
                last edited by Claudia Frank

                @guy038

                thank you for giving an alternative - tried it with the calls npp uses and it works as well.
                Just out of interest, what is the benefit of using this regex?

                Cheers
                Claudia

                1 Reply Last reply Reply Quote 0
                • guy038G
                  guy038
                  last edited by guy038

                  Hi, Claudia,

                  I didn’t understand exactly what you meant ! Did you speak about :

                  • A) The benefit of adding the \w syntax, instead of A-Za-z0-9_ ?

                  • B) The benefit of shortening the original regex, that you gave :

                    [A-Za-z]+://[A-Za-z0-9_-+~.:?&@=/%#,;{}()[]|*!\]+

                  to my regex version ?

                  [A-Za-z]+://[]A-Za-z_!#%&(-;=?@[\\{|}~-]+
                  

                  • Concerning the first option, the benefit would be that any non-latin character could be part of a Net address !

                  • As for the second option, it’s just that it “hurts” my eyes to see such a regex, which is far from being irreducible :!! But, anyway, these two regexes match, exactly, the same set of characters.

                  Cheers,

                  guy038

                  Claudia FrankC 1 Reply Last reply Reply Quote 0
                  • Claudia FrankC
                    Claudia Frank
                    last edited by

                    Hi Guy,

                    no, it was the third option ;-)

                    [A-Za-z]+://.+?(?= )  vs.  [A-Za-z]+://[]A-Za-z_!#%&(-;=?@[\\{|}~-]+
                    

                    Cheers
                    Claudia

                    1 Reply Last reply Reply Quote 0
                    • Claudia FrankC
                      Claudia Frank @guy038
                      last edited by

                      @guy038

                      after doing some tests I figured out that my version

                      [A-Za-z]+://.+?(?= )
                      

                      has an issue, it expects an space, as separator, at the end but what if the lines end?
                      So I assume it should be

                      [A-Za-z]+://.*?(?=\s)
                      

                      which works well for the szearios I’ve tested so far.

                      Cheers
                      Claudia

                      1 Reply Last reply Reply Quote 0
                      • guy038G
                        guy038
                        last edited by guy038

                        Hi, Александр Корженевский and Claudia,

                        • From the contents of the Internet Standard document ( January 2005 ), below :

                        https://tools.ietf.org/html/rfc3986

                        • from the end of its sections :

                        https://tools.ietf.org/html/rfc3986#section-2.3

                        https://tools.ietf.org/html/rfc3986#section-2.5

                        • From the section :

                        https://tools.ietf.org/html/rfc3986#section-2.4

                        • And from the Wikipedia description of the URI syntax, below :

                        https://en.wikipedia.org/wiki/Uniform_Resource_Identifier#Syntax


                        I understand that, for handling the validity of an Uniform Resource Identifier, it is necessary to :

                        • 1) Firstly, change any “per-cent” encoded character, < %80, which is an unreserved character, into its corresponding unreserved form. For example, %61 should be rewritten A and %7f should be rewritten ~

                        • 2) Secondly, “Per-cent” encode any character, with code > \x007f, which is NOT an unreserved character, according to its UTF-8 format. For example, the À would be “percent” encoded %C3%80 and the ア ( KATAKANA LETTER A ) would be “percent” encoded %e3%82%a2

                        ( At this point, every character of an URI, should be a true ASCII character, with code-point < \x0080 ! )

                        • 3) Thirdly, verify, if the resulting address is a valid URI, according to the different rules, below, that describes the generic syntax of an Uniform Resource Identifier ( URI )

                        So, given the definition of the following entities, below :

                        ALPHA          =  "A" / "B" / "C" / ..... / "X" / "Y" / "Z" / "a" / "b" / "c" / ..... / "x" / "y" / "z"
                        
                        
                        DIGIT          =  "0" / "1" / "2" / "3" / "4" / "5" / "6" / "7" / "8" / "9"
                        
                        
                        HEXDIG         =  DIGIT / "A" / "B" / "C" / "D" / "E" / "F" / "a" / "b" / "c" / "d" / "e" / "f"
                        
                        
                        dec-octet      =  DIGIT  /  %x31-39 DIGIT  /  "1" DIGIT DIGIT  /  "2" %x30-34 DIGIT  /  "25" %x30-35
                        ;                  0-9         10-99             100-199              200-249             250-255
                        
                        
                        h16            =  1*4HEXDIG                                          ;  16 bits of address represented in hexadecimal
                        
                        
                        ls32           =  ( h16 ":" h16 ) / IPv4address                      ;  least-significant 32 bits of address
                        
                        
                        IPv6address    =                               6( h16 ":" ) ls32
                                                          /                       "::" 5( h16 ":" ) ls32
                                                          / [               h16 ] "::" 4( h16 ":" ) ls32
                                                          / [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32
                                                          / [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32
                                                          / [ *3( h16 ":" ) h16 ] "::"    h16 ":"   ls32
                                                          / [ *4( h16 ":" ) h16 ] "::"              ls32
                                                          / [ *5( h16 ":" ) h16 ] "::"              h16
                                                          / [ *6( h16 ":" ) h16 ] "::"
                        
                        
                        pct-encoded    =  "%" HEXDIG HEXDIG      ; Percent encoding
                        
                        
                        unreserved     =  ALPHA / DIGIT / "-" / "." / "_" / "~"
                        
                        
                        gen-delims     =  ":" / "/" / "?" / "#" / "[" / "]" / "@"
                        
                        
                        sub-delims     =  "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="
                        
                        
                        reserved       =  gen-delims / sub-delims
                        
                        
                        IPvFuture      =  "v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" )
                        
                        
                        pchar          =  unreserved / pct-encoded / sub-delims / ":" / "@"
                        
                        
                        segment        =  *pchar
                        
                        
                        segment-nz     =  1*pchar
                        
                        
                        segment-nz-nc  =  1*( unreserved / pct-encoded / sub-delims / "@" )  ;  NON-zero-length segment, without any colon ":"
                        

                        Then, the generic syntax, of any URI ( Uniform Ressource Identifier ), is :

                        URI = scheme ":" [ authority ] path [ "?" query ] [ "#" fragment ]

                        with :

                        scheme                 =  ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )
                        
                        
                        authority              =  [ userinfo "@" ] host [ ":" port ]
                        
                        
                            userinfo           =  *( unreserved / pct-encoded / sub-delims / ":" )
                        
                        
                            host               =  IP-literal / IPv4address / reg-name
                        
                        
                                IP-literal     =  "[" ( IPv6address / IPvFuture  ) "]"
                        
                        
                                IPv4address    =  dec-octet "." dec-octet "." dec-octet "." dec-octet
                        
                        
                                reg-name       =  *( unreserved / pct-encoded / sub-delims )
                        
                        
                            port               =  *DIGIT
                        
                        
                        path                   =  path-abempty / path-absolute / path-noscheme / path-rootless / path-empty
                        
                        
                            path-abempty       =  *( "/" segment )                      ;  begins with a "/" or is EMPTY. Used only if Authority is PRESENT
                        
                        
                            path-absolute      =  "/" [ segment-nz *( "/" segment ) ]   ;  beginns with a "/", but NOT "//"
                        
                        
                            path-noscheme      =  segment-nz-nc *( "/" segment )        ;  begins with a NON-colons segment
                        
                        
                            path-rootless      =  segment-nz *( "/" segment )           ;  begins with a SEGMENT
                        
                        
                            path-empty         =  0<pchar>                              ;  O character
                        
                        
                        query                  =  *( pchar / "/" / "?" )
                        
                        
                        fragment               =  *( pchar / "/" / "?" )
                        

                        From these different rules, above, and from the most popular schemes, used in URI, I ended to that awful regex, below ( which is far from being complete, anyway ! )

                        (https?|ftp|mailto|file|data|irc):(//([A-Za-z0-9_.~!$&'()*+,;=-]|%[0-9A-Fa-f][0-9A-Fa-f])+(:[0-9]+)?(/([A-Za-z0-9_.~!$&'()*+,;=:@-]|%[0-9A-Fa-f][0-9A-Fa-f])*)*|/?(([A-Za-z0-9_.~!$&'()*+,;=:@-]|%[0-9A-Fa-f][0-9A-Fa-f])+(/([A-Za-z0-9_.~!$&'()*+,;=:@-]|%[0-9A-Fa-f][0-9A-Fa-f])+)*)?)(\?([A-Za-z0-9_.~!$&'()*+,;=:@/?-]|%[0-9A-Fa-f][0-9A-Fa-f])+)?(#([A-Za-z0-9_.~!$&'()*+,;=:@/?-]|%[0-9A-Fa-f][0-9A-Fa-f])+)?

                        Practically, the normal use of such a validation regex would be quite ridiculous ! So we need to find an other practical regex, in order to help Notepad++ to recognize and properly underline an Internet address !


                        Note that the default Notepad++ behaviour, about underlining Internet addresses, just follows the URI standards !

                        For instance, the first address that Александр Корженевский gave, in its first post :

                        https://www.google.ru/?qws_rd=ssl#newwindow=1&q=notepad%2B%2B+ссылки+русские+символы
                        

                        should be previously, rewritten as

                        https://www.google.ru/?gws_rd=ssl#newwindow=1&q=notepad%2B%2B+%D1%81%D1%81%D1%8B%D0%BB%D0%BA%D0%B8+%D1%80%D1%83%D1%81%D1%81%D0%BA%D0%B8%D0%B5+%D1%81%D0%B8%D0%BC%D0%B2%D0%BE%D0%BB%D1%8B
                        

                        Referring to the table, below :

                        *------------*--------------*---------------------*
                        |  Cyrillic  |   Unicode    |  Per-Cent Encoding  |
                        |  Charcter  |  Code-Point  |   based on UTF-8    |
                        *------------*--------------*---------------------*
                        |     с      |    0441      |      %D1%81         |
                        |     с      |    0441      |      %D1%81         |
                        |     ы      |    044b      |      %D1%8B         |
                        |     л      |    043b      |      %D0%BB         |
                        |     к      |    043a      |      %D0%BA         |
                        |     и      |    0438      |      %D0%B8         |
                        |            |              |                     |
                        |     р      |    0440      |      %D1%80         |
                        |     у      |    0443      |      %D1%83         |
                        |     с      |    0441      |      %D1%81         |
                        |     с      |    0441      |      %D1%81         |
                        |     к      |    043a      |      %D0%BA         |
                        |     и      |    0438      |      %D0%B8         |
                        |     е      |    0435      |      %D0%B5         |
                        |            |              |                     |
                        |     с      |    0441      |      %D1%81         |
                        |     и      |    0438      |      %D0%B8         |
                        |     м      |    043c      |      %D0%BC         |
                        |     в      |    0432      |      %D0%B2         |
                        |     о      |    043e      |      %D0%BE         |
                        |     л      |    043b      |      %D0%BB         |
                        |     ы      |    044b      |      %D1%8B         |
                        *------------*--------------*---------------------*
                        

                        You’ll easily notice, that, once pasted in Notepad++ :

                        • The first address does not underline all the Cyrillic characters

                        • The second address is totally underlined, due to the per-cent encoding of all these Cyrillic characters

                        However, we should force the first address to be a valid one, allowing any word character to be part of an address, as most the addresses are not “well-formed”, with the per-cent mechanism !

                        But this condition is NOT sufficient ! Indeed, contrary to what I said, in my previous post, we meed to match NON-word characters, too ! Just imagine the simple address below :

                        https://www.google.fr/?gws_rd=ssl#newwindow=1&q=€
                        

                        When you copy this address, in a new tab, Notepad++ underlined all this link, except for the single € sign. Nevertheless, if you select all that link, with the Euro sign, and paste it, for instance, in the Firefox address field, it does correctly display the Google results, for the Euro sign ( If, of course, the Goggle site is your default site, on opening Firefox )

                        So, my regex is wrong. In conclusion, Claudia, we could merge the exact regex for the Scheme frist component of an URI with your general regex, for the remainder of an URI, giving the final regex :

                        (?-s)[A-Za-z][A-Za-z0-9+.-]+://.*?(?=\s)

                        We could, also, use the more restrictive Wikipedia form :

                        (?-s)(https?ftp|mailto|file|data|irc)://.*?(?=\s)

                        => These regexes should be used, in N++ code, to underline any Internet address :-))

                        Notes :

                        • The use of \s is the best syntax, as it matches, either, any horizontal or any vertical blank character, as the limit of the address

                        • However, Claudia and All, just be aware that this regex is, really, NOT restrictive, for the four last components of an URI ( Authority, Path, Query and Fragment ) !!

                        Best Regards,

                        guy038

                        P.S.

                        If you would like to break down a WELL-formed URI reference, in order to find its five components, you can use the S/R, below.

                        Why is this S/R so simple, compared to the enormous regex, used for Internet address validation ? Well, because we, simply, suppose that the matched address is a well-formed URI and that we just want to split it, in some parts !

                        So, the S/R, below, replace any correct Internet address, by the description of its five main components. Note that some parts may be undefined.

                        SEARCH ^([^\r\n:/?#]+):(?://([^\r\n/?#]*))?([^?#]*)(?:\?([^\r\n#]*))?(?:#([^\r\n]*))?

                        REPLACE Scheme = \1\r\nAuthority = \2\r\nPath = \3\r\nQuery = \4\r\nFragment = \5

                        Claudia FrankC 1 Reply Last reply Reply Quote 1
                        • Claudia FrankC
                          Claudia Frank @guy038
                          last edited by Claudia Frank

                          Hello to everyone,

                          thank you guy for doing this research, good as always.
                          Yesterday I did some checks and create a python script
                          which let me test every unicode code point in the range
                          of 0x0 to ox10FFFF.
                          I will redo the tests with the new regex and see how it behaves.

                          Two comments:

                          a) we cannot use the more restrictive version as there are much more schemes available.
                          See https://www.iana.org/assignments/uri-schemes/uri-schemes.xhtml.

                          b) the regex needs an $ at the end as the url can be the last entry in a document
                          and then we do not have any whitespace char at all.

                          Keep you updated.

                          Cheers
                          Claudia

                          1 Reply Last reply Reply Quote 0
                          • Claudia FrankC
                            Claudia Frank
                            last edited by

                            Hello,

                            did some further tests to identify which unicode chars would “break” the regex.
                            “Break” like would match until the char appears.

                            The obvious ones

                            0x0    = NULL
                            0x9    = TAB
                            0xA    = LineFeed
                            0xB    = VerticalTab
                            0xC    = FormatFeed
                            0xD    = CarriageRetrun
                            0x20   = Space
                            0x85   = Next Line 
                            0xa0   = No-Break Space
                            0x1680 = Ogham Space Makr
                            0x2000 = En Quad 
                            0x2001 = Em Quad 
                            0x2002 = En Space
                            0x2003 = Em Space 
                            0x2004 = Three-Per-Em Space 
                            0x2005 = Four-Per-Em Space 
                            0x2006 = Six-Per-Em Space 
                            0x2007 = Figure Space
                            0x2008 = Punctuation Space
                            0x2009 = Thin Space 
                            0x200A = Hair Space
                            0x200C = Zero Width Non-Joiner
                            0x200D = Zero Width Joiner
                            0x200E = Left-To-Right Mark
                            0x200F = Right-To-Left Mark
                            0x2028 = Line Separator
                            0x2029 = Paragraph Separator 
                            0x202f = Narrow No-Break Space
                            0x205f = Medium Mathematical Space 
                            0x3000 = Ideographic Space
                            

                            and some unusual ones.
                            To be more precise a strange pattern started.
                            Every 0x?0085, 0x?2028 and 0x?2029 would “break” the regex.

                            0x10085 LINEAR B IDEOGRAM B105M STALLION
                            0x12028 Cuneiform Sign Al Times Ush     
                            0x12029 Cuneiform Sign Alan             
                            
                            0x20085 CJK UNIFIED IDEOGRAPH  =  𠂅       
                            0x22028 CJK UNIFIED IDEOGRAPH  =  𠂅       
                            0x22029 CJK UNIFIED IDEOGRAPH  =  𢀨
                            

                            and reported as Unknown - Unknown Script by unicode.org
                            (I guess means that those a valid values but reserved for future use)

                            0x30085, 0x32028, 0x32029, 
                            0x40085, 0x42028, 0x42029,
                            0x50085, 0x52028, 0x52029,
                            0x60085, 0x62028, 0x62029,
                            0x70085, 0x72028, 0x72029,
                            0x80085, 0x82028, 0x82029,
                            0x90085, 0x92028, 0x92029,
                            0xa0085, 0xa2028, 0xa2029,
                            0xb0085, 0xb2028, 0xb2029,
                            0xc0085, 0xc2028, 0xc2029,
                            0xd0085, 0xd2028, 0xd2029,
                            0xe0085, 0xe2028, 0xe2029,
                            0xf0085, 0xf2028, 0xf2029,
                            0x100085, 0x102028, 0x102029
                            

                            I can’t really explain why this happened.
                            Maybe someone has an idea or insight?

                            Also, are these three Han Script symbols valid symbols in terms of used in text?

                            Nevertheless, keeping in mind that, currently,
                            no “unicode” url gets formatted as link,
                            I would still ask for replacing the currently used regex with this one

                            (?-s)[A-Za-z][A-Za-z0-9+.-]+://.*?(?=\s|$)
                            

                            A couple of further tests outstanding - hope to get it done in the next days.

                            Cheers
                            Claudia

                            1 Reply Last reply Reply Quote 0
                            • Claudia FrankC
                              Claudia Frank
                              last edited by

                              The following tests have been successfully passed
                              Tested within the range of 0x0-0x10FFFF.

                              start of file tests
                              -url at start of file (no additional text) (also end of file test)
                              -url at start of file (followed by tab)
                              -url at start of file (followed by space)
                              -url at start of file (followed by eol)
                              -url at start of file (followed by tab and text)
                              -url at start of file (followed by space and text)
                              -url at start of file (followed by eol and text)

                              end of file tests
                              -url at end of file (preceded by tab)
                              -url at end of file (preceded by space)
                              -url at end of file (preceded by eol)
                              -url at end of file (preceded by text and tab)
                              -url at end of file (preceded by text and space)
                              -url at end of file (preceded by text and eol)

                              in the middle of a file tests
                              -url in the middle of a file (preceded and followed by tab)
                              -url in the middle of a file (preceded and followed by space)
                              -url in the middle of a file (preceded and followed by eol)
                              -url in the middle of a file (preceded and followed by text and tab)
                              -url in the middle of a file (preceded and followed by text and space)
                              -url in the middle of a file (preceded and followed by text and eol)

                              From my point of view it looks ok.
                              I’m going to open an enhancement request at github.

                              Cheers
                              Claudia

                              1 Reply Last reply Reply Quote 0
                              • rddimR
                                rddim
                                last edited by

                                I have the same problem with cyrillic. Please open an issue in github - go to https://github.com/notepad-plus-plus/notepad-plus-plus/issues sign in with your account and create a New issue. Also it is good to put this disqus url.

                                Claudia FrankC 1 Reply Last reply Reply Quote 0
                                • Claudia FrankC
                                  Claudia Frank @rddim
                                  last edited by

                                  @rddim

                                  :-) has been already done ;-)
                                  https://github.com/notepad-plus-plus/notepad-plus-plus/issues/2798

                                  Cheers
                                  Claudia

                                  1 Reply Last reply Reply Quote 0
                                  • guy038G
                                    guy038
                                    last edited by guy038

                                    Hi Claudia and All,

                                    Remainder :

                                    Unicode is organized, within 17 planes, each composed of 65536 code-points => 1,114,112 possible values ! Only FIVE planes are defined. These are :

                                    - The BMP  ( BASIC MULTILINGUAL Plane )             =  Plane 0, from code-point   U+0000 to code-point   U+FFFF
                                    
                                    - The SMP  ( SUPPLEMENTARY MULTILINGUAL Plane )     =  Plane 1, from code-point  U+10000 to code-point  U+1FFFF
                                    
                                    - The SIP  ( SUPPLEMENTARY IDEOGRAPHIC Plane )      =  Plane 2, from code-point  U+20000 to code-point  U+2FFFF
                                    
                                    - The SSP  ( SUPPLEMENTARY SPECIAL-PURPOSE Plane )  = Plane 14, from code-point  U+E0000 to code-point  U+EFFFF
                                    
                                    - The SPUA ( SUPPLEMENTARY PRIVATE USE Area  )      = Plane 15, from code-point  U+F0000 to code-point  U+FFFFF
                                    
                                    - The SPUA ( SUPPLEMENTARY PRIVATE USE Area  )      = Plane 16, from code-point U+100000 to code-point U+10FFFF
                                    

                                    Up to now, even with the recent Unicode 9.0 version, all the other planes, from 3 to 13, are NOT used and all the corresponding code-points, from U+30000 to U+DFFFF are NOT assigned, except for the last two code-points of each place, which are assigned as NON characters

                                    So, Claudia :

                                    • From your first list : the range \x{0000}, \x{0009}… \x{205F}, \x{3000} ( 30 values )

                                    • From the second one : the values U+10085, U+12028, U+12029, U+20085, U+22028 and U+22029 ( 6 values )

                                    • From your last list : the range U+30085…U+102029 ( 42 values )

                                    I built a test file, containing all these characters, preceded by the letter a and followed by the letter z


                                    Then, I tried to determine all the 3-characters string aXz, which was matched by the regex a\sz. After some tests, I can affirm that the \s regex, in a file with UNICODE encoding, matches any single character of the following list, ONLY :

                                    - TABULATION              ( \t )
                                    
                                    - NEW LINE                ( \n )
                                    
                                    - VERTICAL TABULATION     ( \x0B )
                                    
                                    - FORM FEED               ( \f )
                                    
                                    - CARRIAGE RETRUN         ( \r )
                                    
                                    - SPACE                   ( \x20 )
                                    
                                    - NEXT LINE               ( \x85 )
                                    
                                    - NO BREAK SPACE          ( \xA0 )
                                    
                                    - OGHAM SPACE MARK        ( \x{1680} )
                                    
                                    - EN QUAD                 ( \x{2000} )
                                    
                                    - EM QUAD                 ( \x{2001} )
                                    
                                    - EN SPACE                ( \x{2002} )
                                    
                                    - EM SPACE                ( \x{2003} )
                                    
                                    - THREE-PER-EM SPACE      ( \x{2004} )
                                    
                                    - FOUR-PER-EM SPACE       ( \x{2005} )
                                    
                                    - SIX-PER-EM SPACE        ( \x{2006} )
                                    
                                    - FIGURE SPACE            ( \x{2007} )
                                    
                                    - PUNCTUATION SPACE       ( \x{2008} )
                                    
                                    - THIN SPACE              ( \x{2009} )
                                    
                                    - HAIR SPACE              ( \x{200A} )
                                    
                                    - LINE SEPARATOR          ( \x{2028} )
                                    
                                    - PARAGRAPH SEPARATOR     ( \x{2029} )
                                    
                                    - NARROW NO-BREAK SPACE   ( \x{202F} )
                                    
                                    - IDEOGRAPHIC SPACE       ( \x{3000} )
                                    

                                    And, except for the MEDIUM MATHEMATICAL SPACE ( \x205F ), which is NOT matched by the \s regex, this list is identical to the list of characters, that the UNICODE Consortium considers as White_Space characters. Refer to the link, below :

                                    http://www.unicode.org/Public/UCD/latest/ucd/PropList.txt


                                    UPDATE on 02-17-2018 : just looks the definitive list of Unicode BLANK characters, below :

                                    https://notepad-plus-plus.org/community/topic/15279/unicode-blank-characters-and-the-regexes-h-v-and-s/1


                                    Finally, as most of these “White_Space” characters are quite exotic and very rarely used, in normal writing, the idea to use \s syntax, in a look-ahead, as a limit to an Internet address, seems quite pertinent !


                                    Claudia, the new regex, to determine all the contents of an address, could, also, be written :

                                    (?-s)[A-Za-z][A-Za-z0-9+.-]+://.*?(?=\s|\z)

                                    Indeed, the case (?=\s) always happens, except when an Internet address would end the last line of a file, without any line-break ! And this specific case is just matched with the second (?=\z) syntax ;-)

                                    Best Regards,

                                    guy038

                                    P.S. :

                                    Claudia, I haven’t find some spare time, yet, to have a look to your new version of the RegexTexter script, with the Time regex test option. Just be patient a couple of days :-)

                                    1 Reply Last reply Reply Quote 0
                                    • Claudia FrankC
                                      Claudia Frank
                                      last edited by

                                      Hi Guy,

                                      thank you for doing and researching this and the confirmation about the test.
                                      But I don’t get the same result for \x205f

                                      So, as you see I used python script to add the char

                                      editor.appendText('a'+unichr(0x205f)+'z')
                                      

                                      and it looks like it matched as well.

                                      In regards to the time regex option, take your time, you don’t even have to waste your time doing it - if you find it useful, use it, otherwise chuck it into the bin. ;-)

                                      Cheers
                                      Claudia

                                      1 Reply Last reply Reply Quote 0
                                      • Александр КорженевскийА
                                        Александр Корженевский
                                        last edited by

                                        Please explain what I need to do with the regexp
                                        For notepad++ processing with Cyrillic characters in the url?
                                        https://lh3.googleusercontent.com/-Rcx51vbIw0U/WGphx4PJ_MI/AAAAAAAAEV0/znXcaeFVKZE/s0/screenshot%25202017-01-02%2520001.jpg
                                        thanks in advance.
                                        sorry for the stupid question.
                                        smile

                                        Claudia FrankC 1 Reply Last reply Reply Quote 0
                                        • Claudia FrankC
                                          Claudia Frank @Александр Корженевский
                                          last edited by

                                          @Александр-Корженевский

                                          You can’t do anything. It was just a discussion for a probably new regex between guy038 and me.
                                          There has been an issue addressed at github and now it is up to Don to decide if it gets changed or not.
                                          Or if you familiar with C/C++ and using Visual Studio you could compile npp yourself with the changed regex.

                                          Cheers
                                          Claudia

                                          1 Reply Last reply Reply Quote 0
                                          • Александр КорженевскийА
                                            Александр Корженевский
                                            last edited by

                                            I hope these corrections will be made
                                            Cheers
                                            Alexandr

                                            1 Reply Last reply Reply Quote 0
                                            • First post
                                              Last post
                                            The Community of users of the Notepad++ text editor.
                                            Powered by NodeBB | Contributors