Community
    • Login

    notepad++ url processing cyrillic symbols

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    29 Posts 4 Posters 20.6k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • guy038G
      guy038
      last edited by guy038

      Hi Claudia and All,

      Remainder :

      Unicode is organized, within 17 planes, each composed of 65536 code-points => 1,114,112 possible values ! Only FIVE planes are defined. These are :

      - The BMP  ( BASIC MULTILINGUAL Plane )             =  Plane 0, from code-point   U+0000 to code-point   U+FFFF
      
      - The SMP  ( SUPPLEMENTARY MULTILINGUAL Plane )     =  Plane 1, from code-point  U+10000 to code-point  U+1FFFF
      
      - The SIP  ( SUPPLEMENTARY IDEOGRAPHIC Plane )      =  Plane 2, from code-point  U+20000 to code-point  U+2FFFF
      
      - The SSP  ( SUPPLEMENTARY SPECIAL-PURPOSE Plane )  = Plane 14, from code-point  U+E0000 to code-point  U+EFFFF
      
      - The SPUA ( SUPPLEMENTARY PRIVATE USE Area  )      = Plane 15, from code-point  U+F0000 to code-point  U+FFFFF
      
      - The SPUA ( SUPPLEMENTARY PRIVATE USE Area  )      = Plane 16, from code-point U+100000 to code-point U+10FFFF
      

      Up to now, even with the recent Unicode 9.0 version, all the other planes, from 3 to 13, are NOT used and all the corresponding code-points, from U+30000 to U+DFFFF are NOT assigned, except for the last two code-points of each place, which are assigned as NON characters

      So, Claudia :

      • From your first list : the range \x{0000}, \x{0009}… \x{205F}, \x{3000} ( 30 values )

      • From the second one : the values U+10085, U+12028, U+12029, U+20085, U+22028 and U+22029 ( 6 values )

      • From your last list : the range U+30085…U+102029 ( 42 values )

      I built a test file, containing all these characters, preceded by the letter a and followed by the letter z


      Then, I tried to determine all the 3-characters string aXz, which was matched by the regex a\sz. After some tests, I can affirm that the \s regex, in a file with UNICODE encoding, matches any single character of the following list, ONLY :

      - TABULATION              ( \t )
      
      - NEW LINE                ( \n )
      
      - VERTICAL TABULATION     ( \x0B )
      
      - FORM FEED               ( \f )
      
      - CARRIAGE RETRUN         ( \r )
      
      - SPACE                   ( \x20 )
      
      - NEXT LINE               ( \x85 )
      
      - NO BREAK SPACE          ( \xA0 )
      
      - OGHAM SPACE MARK        ( \x{1680} )
      
      - EN QUAD                 ( \x{2000} )
      
      - EM QUAD                 ( \x{2001} )
      
      - EN SPACE                ( \x{2002} )
      
      - EM SPACE                ( \x{2003} )
      
      - THREE-PER-EM SPACE      ( \x{2004} )
      
      - FOUR-PER-EM SPACE       ( \x{2005} )
      
      - SIX-PER-EM SPACE        ( \x{2006} )
      
      - FIGURE SPACE            ( \x{2007} )
      
      - PUNCTUATION SPACE       ( \x{2008} )
      
      - THIN SPACE              ( \x{2009} )
      
      - HAIR SPACE              ( \x{200A} )
      
      - LINE SEPARATOR          ( \x{2028} )
      
      - PARAGRAPH SEPARATOR     ( \x{2029} )
      
      - NARROW NO-BREAK SPACE   ( \x{202F} )
      
      - IDEOGRAPHIC SPACE       ( \x{3000} )
      

      And, except for the MEDIUM MATHEMATICAL SPACE ( \x205F ), which is NOT matched by the \s regex, this list is identical to the list of characters, that the UNICODE Consortium considers as White_Space characters. Refer to the link, below :

      http://www.unicode.org/Public/UCD/latest/ucd/PropList.txt


      UPDATE on 02-17-2018 : just looks the definitive list of Unicode BLANK characters, below :

      https://notepad-plus-plus.org/community/topic/15279/unicode-blank-characters-and-the-regexes-h-v-and-s/1


      Finally, as most of these “White_Space” characters are quite exotic and very rarely used, in normal writing, the idea to use \s syntax, in a look-ahead, as a limit to an Internet address, seems quite pertinent !


      Claudia, the new regex, to determine all the contents of an address, could, also, be written :

      (?-s)[A-Za-z][A-Za-z0-9+.-]+://.*?(?=\s|\z)

      Indeed, the case (?=\s) always happens, except when an Internet address would end the last line of a file, without any line-break ! And this specific case is just matched with the second (?=\z) syntax ;-)

      Best Regards,

      guy038

      P.S. :

      Claudia, I haven’t find some spare time, yet, to have a look to your new version of the RegexTexter script, with the Time regex test option. Just be patient a couple of days :-)

      1 Reply Last reply Reply Quote 0
      • Claudia FrankC
        Claudia Frank
        last edited by

        Hi Guy,

        thank you for doing and researching this and the confirmation about the test.
        But I don’t get the same result for \x205f

        So, as you see I used python script to add the char

        editor.appendText('a'+unichr(0x205f)+'z')
        

        and it looks like it matched as well.

        In regards to the time regex option, take your time, you don’t even have to waste your time doing it - if you find it useful, use it, otherwise chuck it into the bin. ;-)

        Cheers
        Claudia

        1 Reply Last reply Reply Quote 0
        • Александр КорженевскийА
          Александр Корженевский
          last edited by

          Please explain what I need to do with the regexp
          For notepad++ processing with Cyrillic characters in the url?
          https://lh3.googleusercontent.com/-Rcx51vbIw0U/WGphx4PJ_MI/AAAAAAAAEV0/znXcaeFVKZE/s0/screenshot%25202017-01-02%2520001.jpg
          thanks in advance.
          sorry for the stupid question.
          smile

          Claudia FrankC 1 Reply Last reply Reply Quote 0
          • Claudia FrankC
            Claudia Frank @Александр Корженевский
            last edited by

            @Александр-Корженевский

            You can’t do anything. It was just a discussion for a probably new regex between guy038 and me.
            There has been an issue addressed at github and now it is up to Don to decide if it gets changed or not.
            Or if you familiar with C/C++ and using Visual Studio you could compile npp yourself with the changed regex.

            Cheers
            Claudia

            1 Reply Last reply Reply Quote 0
            • Александр КорженевскийА
              Александр Корженевский
              last edited by

              I hope these corrections will be made
              Cheers
              Alexandr

              1 Reply Last reply Reply Quote 0
              • Александр КорженевскийА
                Александр Корженевский
                last edited by Александр Корженевский

                Please give instructions on how to compile notepadd++ with support url processing cyrillic symbols.
                Thanks in advance.

                Claudia FrankC 1 Reply Last reply Reply Quote 0
                • Claudia FrankC
                  Claudia Frank @Александр Корженевский
                  last edited by

                  @Александр-Корженевский

                  Here is described how to build notepad++. Please use Visual Studio 2015 or 2017 as there was a commit that this has been changed lately.
                  In …\notepad-plus-plus\PowerEditor\src\Notepad_plus.h source file you need to replace

                  #define URL_REG_EXPR "[A-Za-z]+://[A-Za-z0-9_\\-\\+~.:?&@=/%#,;\\{\\}\\(\\)\\[\\]\\|\\*\\!\\\\]+"
                  

                  with a different regex, like the one from here. Make sure you do proper escaping.

                  So the steps needed are

                  1. Install Visual Studio 2015 or VS2017 and the SDK (Software Development Kit)
                  2. Install git software
                  3. Clone the repo from https://github.com/notepad-plus-plus/notepad-plus-plus.git
                  4. Modify the Notepad_plus.h file using Visual Studio
                  5. Follow the instruction to compile npp like given on github page
                  6. Copy the scilexer.dll from an official distribution (otherwise integrity check will fail)
                  7. Cross fingers.

                  Hope I didn’t forget anything.

                  Cheers
                  Claudia

                  1 Reply Last reply Reply Quote 0
                  • Александр КорженевскийА
                    Александр Корженевский
                    last edited by

                    Please tell me the correct line ready for replacement.
                    For Notepad to accept Russian characters in the url.
                    Sorry for the stupid question. smile
                    Why the creators can’t add fixes to the code for all?

                    Claudia FrankC 1 Reply Last reply Reply Quote 0
                    • Claudia FrankC
                      Claudia Frank @Александр Корженевский
                      last edited by

                      @Александр-Корженевский

                      file Notepad_plus.h and change the following line

                      //#define URL_REG_EXPR "[A-Za-z]+://[A-Za-z0-9_\\-\\+~.:?&@=/%#,;\\{\\}\\(\\)\\[\\]\\|\\*\\!\\\\]+"
                      #define URL_REG_EXPR "(?-s)[A-Za-z][A-Za-z0-9+.-]+://[^\\s]+?(?=\\s|\\z)"
                      

                      Why the creators can’t add fixes to the code for all?

                      It is still an issue only so as long as no one makes a proper pull request there
                      is little chance that it gets implemented. Unfortunately, my working agreements
                      do not allow me to share code on github, sourceforge …, so I can’t do, it at least
                      for the moment.

                      Cheers
                      Claudia

                      1 Reply Last reply Reply Quote 0
                      • Александр КорженевскийА
                        Александр Корженевский
                        last edited by

                        Maybe the developers can make a correction?
                        What about to move definition of this regexp to config file?
                        That anybody, who need to, can change it without recompilation!
                        And update FAQ how to add support of national symbols to url recognation
                        I very much hope that correction will be made.

                        1 Reply Last reply Reply Quote 1
                        • First post
                          Last post
                        The Community of users of the Notepad++ text editor.
                        Powered by NodeBB | Contributors