Community
    • Login

    Unicode BLANK characters and the regexes \h , \v and \s

    Scheduled Pinned Locked Moved General Discussion
    4 Posts 2 Posters 4.0k Views 2 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • guy038G Online
      guy038
      last edited by guy038

      Hi, All,

      As I’m about to reply to someone, regarding some regex explanations, involving the \s syntax, I created, below, the definitive list of blank characters, in Unicode 10.0, and the way to get these blank characters, with the \h, \v and \s regexes, in the Boost regex engine, presently used in Notepad++


      The regex \h matches any of these 19 HORIZONTAL Blank characters, below :

      - TABULATION                   ( \x{0009} -\t )
      
      - SPACE                        ( \x{0020} )
      
      - NO BREAK SPACE               ( \x{00A0} )
      
      - OGHAM SPACE MARK             ( \x{1680} )
      
      - MONGOLIAN VOYEL SAPARATOR    ( \x{180E} )
      
      - EN QUAD                      ( \x{2000} )
      
      - EM QUAD                      ( \x{2001} )
      
      - EN SPACE                     ( \x{2002} )
      
      - EM SPACE                     ( \x{2003} )
      
      - THREE-PER-EM SPACE           ( \x{2004} )
      
      - FOUR-PER-EM SPACE            ( \x{2005} )
      
      - SIX-PER-EM SPACE             ( \x{2006} )
      
      - FIGURE SPACE                 ( \x{2007} )
      
      - PUNCTUATION SPACE            ( \x{2008} )
      
      - THIN SPACE                   ( \x{2009} )
      
      - HAIR SPACE                   ( \x{200A} )
      
      - NARROW NO-BREAK SPACE        ( \x{202F} )
      
      - MEDIUM MATHEMATICAL SPACE    ( \x{205F} )
      
      - IDEOGRAPHIC SPACE            ( \x{3000} )
      

      The regex \v matches any of these 07 VERTICAL Blank characters :

      - NEW LINE                     ( \x{000A} - \n )
      
      - VERTICAL TABULATION          ( \x{000B} )
      
      - FORM FEED                    ( \x{000C} - \f )
      
      - CARRIAGE RETRUN              ( \x{000D} - \r )
      
      - NEXT LINE                    ( \x{0085} )
      
      - LINE SEPARATOR               ( \x{2028} )
      
      - PARAGRAPH SEPARATOR          ( \x{2029} )
      

      Finally, the regex \s matches any of the 26 SPACE Blank characters, listed above

      REMARK : The regex \s is equivalent to the regex (\h|\v), but is different from the regex [\h\v] !


      In practise, the regex \s matches, principally, a single blank character from the list, below :

      - TABULATION                   ( \x{0009} -\t )
      
      - SPACE                        ( \x{0020} )
      
      - NEW LINE                     ( \x{000A} - \n )
      
      - CARRIAGE RETRUN              ( \x{000D} - \r )
      

      Best Regards,

      guy038

      1 Reply Last reply Reply Quote 5
      • MAPJe71M Offline
        MAPJe71
        last edited by

        The regex \s is equivalent to the regex (\h|\v), but is different from the regex [\h\v] !

        Please explain.

        1 Reply Last reply Reply Quote 2
        • guy038G Online
          guy038
          last edited by guy038

          Hello, @mapje71, and All,

          Ah yes, I apologize because I should have written additional information, in my last post !

          I, simply, noticed that the regex [\v] doesn’t match the same characters than the simple \v regex does :-((

          • The \v regex, as said in my first post, matches any of the 7 vertical blank characters

          • The [\v] regex just matches the vertical tabulation control character ( VT ), ONLY, of Unicode code \x{000B} ( or \x{0B} or \x0B )


          I don’t know if it’s a bug of the Boost regex engine, used by N++ or if it’s a normal regex restriction when used in a character class ! I should investigate on the http://www.regular-expressions.info/ site ;-))

          Cheers,

          guy038

          1 Reply Last reply Reply Quote 4
          • guy038G Online
            guy038
            last edited by

            Hi, @mapje71, and All,

            In the web page, below :

            http://www.regular-expressions.info/refcharclass.html

            It is said that the regex [\v] adds the “vertical tab” control character (ASCII 0x0B) to the character class, without adding any other vertical whitespace, which is confirmed by the given example !

            So, seemingly, it’s a current restriction of the \v regex, in a character class !

            Best Regards,

            guy038

            1 Reply Last reply Reply Quote 4

            Hello! It looks like you're interested in this conversation, but you don't have an account yet.

            Getting fed up of having to scroll through the same posts each visit? When you register for an account, you'll always come back to exactly where you were before, and choose to be notified of new replies (either via email, or push notification). You'll also be able to save bookmarks and upvote posts to show your appreciation to other community members.

            With your input, this post could be even better 💗

            Register Login
            • First post
              Last post
            The Community of users of the Notepad++ text editor.
            Powered by NodeBB | Contributors