Unicode BLANK characters and the regexes \h , \v and \s



  • Hi, All,

    As I’m about to reply to someone, regarding some regex explanations, involving the \s syntax, I created, below, the definitive list of blank characters, in Unicode 10.0, and the way to get these blank characters, with the \h, \v and \s regexes, in the Boost regex engine, presently used in Notepad++


    The regex \h matches any of these 19 HORIZONTAL Blank characters, below :

    - TABULATION                   ( \x{0009} -\t )
    
    - SPACE                        ( \x{0020} )
    
    - NO BREAK SPACE               ( \x{00A0} )
    
    - OGHAM SPACE MARK             ( \x{1680} )
    
    - MONGOLIAN VOYEL SAPARATOR    ( \x{180E} )
    
    - EN QUAD                      ( \x{2000} )
    
    - EM QUAD                      ( \x{2001} )
    
    - EN SPACE                     ( \x{2002} )
    
    - EM SPACE                     ( \x{2003} )
    
    - THREE-PER-EM SPACE           ( \x{2004} )
    
    - FOUR-PER-EM SPACE            ( \x{2005} )
    
    - SIX-PER-EM SPACE             ( \x{2006} )
    
    - FIGURE SPACE                 ( \x{2007} )
    
    - PUNCTUATION SPACE            ( \x{2008} )
    
    - THIN SPACE                   ( \x{2009} )
    
    - HAIR SPACE                   ( \x{200A} )
    
    - NARROW NO-BREAK SPACE        ( \x{202F} )
    
    - MEDIUM MATHEMATICAL SPACE    ( \x{205F} )
    
    - IDEOGRAPHIC SPACE            ( \x{3000} )
    

    The regex \v matches any of these 07 VERTICAL Blank characters :

    - NEW LINE                     ( \x{000A} - \n )
    
    - VERTICAL TABULATION          ( \x{000B} )
    
    - FORM FEED                    ( \x{000C} - \f )
    
    - CARRIAGE RETRUN              ( \x{000D} - \r )
    
    - NEXT LINE                    ( \x{0085} )
    
    - LINE SEPARATOR               ( \x{2028} )
    
    - PARAGRAPH SEPARATOR          ( \x{2029} )
    

    Finally, the regex \s matches any of the 26 SPACE Blank characters, listed above

    REMARK : The regex \s is equivalent to the regex (\h|\v), but is different from the regex [\h\v] !


    In practise, the regex \s matches, principally, a single blank character from the list, below :

    - TABULATION                   ( \x{0009} -\t )
    
    - SPACE                        ( \x{0020} )
    
    - NEW LINE                     ( \x{000A} - \n )
    
    - CARRIAGE RETRUN              ( \x{000D} - \r )
    

    Best Regards,

    guy038



  • The regex \s is equivalent to the regex (\h|\v), but is different from the regex [\h\v] !

    Please explain.



  • Hello, @mapje71, and All,

    Ah yes, I apologize because I should have written additional information, in my last post !

    I, simply, noticed that the regex [\v] doesn’t match the same characters than the simple \v regex does :-((

    • The \v regex, as said in my first post, matches any of the 7 vertical blank characters

    • The [\v] regex just matches the vertical tabulation control character ( VT ), ONLY, of Unicode code \x{000B} ( or \x{0B} or \x0B )


    I don’t know if it’s a bug of the Boost regex engine, used by N++ or if it’s a normal regex restriction when used in a character class ! I should investigate on the http://www.regular-expressions.info/ site ;-))

    Cheers,

    guy038



  • Hi, @mapje71, and All,

    In the web page, below :

    http://www.regular-expressions.info/refcharclass.html

    It is said that the regex [\v] adds the “vertical tab” control character (ASCII 0x0B) to the character class, without adding any other vertical whitespace, which is confirmed by the given example !

    So, seemingly, it’s a current restriction of the \v regex, in a character class !

    Best Regards,

    guy038


Log in to reply