Community
    • Login

    Columns++ version 1.2: better Unicode search

    Scheduled Pinned Locked Moved Notepad++ & Plugin Development
    15 Posts 3 Posters 723 Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • CoisesC
      Coises @guy038
      last edited by

      @guy038 said in Columns++ version 1.2: better Unicode search:

      It’s just the start of the cycle among the matches which is different !

      While testing things, I kept making the mistake of placing the caret just before something I wanted to check, then opening search, clicking Find (not noticing that it said Find First and not Find Next) and having it bounce to the start of the document. I figured if it’s counter-intuitive to me, it’s surely surprising to everyone else. Losing one’s place in a large document seems much more annoying than having to press Ctrl+Home if you want to start from the beginning, so I figured this to be a change that will do more good than harm.

      Now, may I request for one useful improvement ? The font, used in the two drop-down lists Find what : and Replace with, is visibly a proportional font. To be convinced of this fact, enter the string WWWWWIIIII in the Find what zone !

      To my mind , it would be nice, like within Notepad++, to choose, instead, a mono-spaced font ( maybe an option ! ).

      A second possibility would be to allow the selection from a drop-down list of all the installed fonts ?

      A third possibility would be to have an option, in the dialog, to enlarge, temporarily or not, these two zones. I suppose that this last solution would be more difficult to implement !

      All good ideas. I hadn’t thought about the monospaced font. (I forgot that Notepad++ has that option — I remember that I liked it except that it takes up more space, so I can see less of what I’ve typed without making the dialog obscure even more of the document.)

      A thought I’ve had for some time is to have a button that opens a second dialog, or an extended “pane” attached to the search dialog, that’s just for entering a regular expression or a replacement. My “vision” (and it’s only that — I’ve done no coding or even a mock-up yet) is that the expression entry areas would be Scintilla controls which would, at least by default, reflect the font and size used in the document; they could contain multiple lines and possibly have appropriate syntax highlighting. Ideally there would be some kind of a “builder” to help people who are less familiar with regular expressions know what they can enter (escapes, class names, symbolic character names, quantifiers — and those formulas I process in the replacement), and an area where users could save frequently-used expressions.

      I’ve also wondered if search should be a dockable panel — so results of a find don’t get hidden behind the dialog, which I find an annoying occurrence. Dockable dialogs are kind of strange, though, and from what I’ve seen (I’m still learning), some of the control one has with an ordinary dialog is lost when it becomes dockable (such as that setting height and width constraints don’t seem to work, even when the dialog is undocked).

      Either of those ideas are getting so far from the nominal purposes of Columns++, though, that it seems it would really be time to make a separate plugin. (Yes, @Alan-Kilborn, hoping someday it could be part of the main program. But far less “aggressive” changes have caused consternation when made to Notepad++; at the least, I think anything so dramatic should have a considerable test period to demonstrate its value and stability before I would dream of suggesting it as a replacement for existing functionality.)

      Alan KilbornA 1 Reply Last reply Reply Quote 2
      • guy038G
        guy038
        last edited by

        Hi, @Coises and All,

        Luckily, I do not need the Microsoft magnifier in my everyday work on my Windows-10 laptop !! But sometimes, as the size of your search dialog font seems a bit small, it helped me to clearly see which kind or regex I typed in, during the tests of your experimental versions. However, for example, I just use the N++ default zoom to prepare this post !

        Note that regular expressions use a lot of chars not easy to distinguish, like the . char, the ( and ) chars, the [ and ] chars, the { and } chars, and so on…, which look very thin, with the present proportional font !

        So, whatever you plan to do in the future, regarding my request, it should be better than the present situation. No doubt about it !

        Best Regards,

        guy038

        1 Reply Last reply Reply Quote 0
        • Alan KilbornA
          Alan Kilborn @Coises
          last edited by

          @Coises said:

          it seems it would really be time to make a separate plugin

          I would go so far as to suggest something that looks and operates like the Notepad++ Find dialog and its tabs.
          Then someone using the plugin would not have to learn anything new, and would feel “right at home”.
          You know why I suggest this, right? :-)

          CoisesC 1 Reply Last reply Reply Quote 1
          • CoisesC
            Coises @Alan Kilborn
            last edited by

            @Alan-Kilborn said in Columns++ version 1.2: better Unicode search:

            I would go so far as to suggest something that looks and operates like the Notepad++ Find dialog and its tabs.
            Then someone using the plugin would not have to learn anything new, and would feel “right at home”.
            You know why I suggest this, right? :-)

            I think I do, but to be honest, if and when I take on such a project, non-trivial user interface changes would be the whole point. Given that, I’m not sure I’d want to tie myself to recreating a legacy user interface and using it as an underlying model. Familiarity would be a plus, but I am unlikely to impose it on myself as a constraint.

            This is all far enough down the road that someone else might well get to it before I do, anyway. I have at least two other self-assignments that would come first, and that’s just in the realm of computer programming.

            1 Reply Last reply Reply Quote 0
            • guy038G guy038 referenced this topic on
            • guy038G
              guy038
              last edited by

              Hello, @coises and All,

              Refer to this FAQ that I’ve just updated with references to your last Columns++-1.2 release :

              https://community.notepad-plus-plus.org/topic/15765/faq-where-to-find-regular-expressions-regex-documentation

              Best Regards,

              guy038

              CoisesC 1 Reply Last reply Reply Quote 0
              • CoisesC
                Coises @guy038
                last edited by

                @guy038 said in Columns++ version 1.2: better Unicode search:

                Refer to this FAQ that I’ve just updated with references to your last Columns++-1.2 release :

                https://community.notepad-plus-plus.org/topic/15765/faq-where-to-find-regular-expressions-regex-documentation

                Thank you for mentioning Columns++. Might I suggest a couple things?

                • Columns++ version 1.2 is available using Plugins Admin in Notepad++ 8.7.8; there is no need to do the more complicated installation procedure.

                • I fear that your FAQ might make some readers expect that the search in Columns++ is a replacement for Notepad++ search, when it really isn’t. There are many things Notepad++ search can do (finding in all open files, finding and replacing in multiple files, etc.) that Columns++ search does not do and almost certainly never will. Its original, and still primary, reason for existence is to make it possible to find and replace within a rectangular selection — something Notepad++ search cannot do. There is also the extension of using mathematical formulas in replacements. I would recommend perhaps a link to the online help file sections about Search and Regular Expressions to clarify when Columns++ might be useful.

                • It might be unclear that while the progress dialog change applies to all Count, Select and Replace All actions in Columns++ search, the other changes you mentioned apply only to searches in Unicode files. Searches in “ANSI” files work the same as always. (The ability to search in regions based on a rectangular or multiple selection also applies to all searches, and the ability to use formulas in replacements applies to all regular expression searches.)

                1 Reply Last reply Reply Quote 0
                • guy038G
                  guy038
                  last edited by guy038

                  Hello, @coises and All,

                  You said :

                  Columns++ version 1.2 is available using Plugins Admin in Notepad++ 8.7.8; there is no need to do the more complicated installation procedure.

                  Well, I updated the regex documentation with the N++ release v8.7.6 and, at that time, Columns++ did not seem to belong to the plugins’s list !?


                  You said :

                  I fear that your FAQ might make some readers expect that the search in Columns++ is a replacement for Notepad++ search, when it really isn’t.

                  I agree that I did not presented your plugin the right way. So, I did some modifications and I hope you’ll agree with the new phrasing !


                  You said :

                  … the other changes you mentioned apply only to searches in Unicode files. Searches in “ANSI” files work the same as always.

                  Well, your assertion is a bit paradoxal, regarding the title of this post ! Indeed, your title says :

                  Columns++ version 1.2: better Unicode search !!

                  And anyway, against an ANSI file, any search of an UNICODE property triggers an Invalid Regex message ! So, the benefit of this improved version is not so obvious for ANSI files. However, I did add a mention which clearly says that the search and replace are correct with ANSI files,too.

                  However, I noticed a odd thing :

                  • Write these five characters ,¼½¾, in a new UTF-8 tab

                  • Ask Columns++ to select all the punctuation characters with the [[:punct:]] regex

                  => It correctly find the two commas only, as the fractions has the UNICODE \p{other Number} property and are not punctuation chars

                  • Now, convert this UTF-8 file to an ANSI file, with the Encoding > Convert to ANSI option

                  • Re-try the [[:punct:]] regex against this, from now on, ANSI file

                  => This time, the five characters are selected !?

                  If you try the \p{other Number} regex it returns, as expected, an error message !


                  In your documentation, regarding your last sequence [[.x80.]]–[[.xff.]], right before the Search file section :

                  At first sight, it’s not a class of characters and it seems to be an invalid regular expression. Surprisingly, it’s not ! Actually, it’s a sequence of three consecutive characters :

                  • An invalid \x80 UTF-8 byte

                  • An EN DASH character ( \x{2013} )

                  • An invalid \xff UTF-8 byte

                  So, @Coises, just modify this regex as [[.x80.]-[.xff.]], with an Hyphen-Minus character which, indeed, finds any invalid UTF-8 character !

                  BTW, I like your two buttons, at the bottom right of your documentation (txt- TXT+), which allows us to zoom in or out. It surely help a lot of people !


                  Best Regards

                  guy038

                  P.S. :

                  In the first of my three consecutive posts which ended up my test’s period of your plugin ( https://community.notepad-plus-plus.org/post/100087 ) I wrote :

                  \p{Ascii} = (?s)\o => 128 when applied against my Total_Chars.txt file ! Now, I understand that the (?s) modifier does not change anything for the Count results. Indeed, the (?s) or (?-s) modifiers are ONLY needed if there is, at least, one . regex character in the entire regex !

                  So, if we want to omit the \r and \n in the above regex, we must use the (?![\r\n])\p{Ascii} or the (?![\r\n])\o syntax, which correctly return 126 matches

                  Note that this it only true for a NON ANSI file. For an ANSI file :

                  • The regex (?![\r\n])\p{Ascii} is invalid, as explained above.

                  • The regex (?![\r\n])\o does work but returns just one match : the lower-case letter o !! ( the Match case option was set )

                  CoisesC 1 Reply Last reply Reply Quote 1
                  • CoisesC
                    Coises @guy038
                    last edited by

                    @guy038 said in Columns++ version 1.2: better Unicode search:

                    Well, I updated the regex documentation with the N++ release v8.7.6 and, at that time, Columns++ did not seem to belong to the plugins’s list !?

                    The previous “stable” version was there, but I got the pull request to update to version 1.2 in just barely in time to make it into Notepad++ 8.7.8.

                    I did some modifications and I hope you’ll agree with the new phrasing !

                    Thank you. I like that. I just didn’t want people to install it and then be disappointed that it’s no help if they want to use one of the many features of Notepad++ search that Columns++ does not attempt to replicate.

                    However, I noticed a odd thing :

                    • Write these five characters ,¼½¾, in a new UTF-8 tab

                    • Ask Columns++ to select all the punctuation characters with the [[:punct:]] regex

                    => It correctly find the two commas only, as the fractions has the UNICODE \p{other Number} property and are not punctuation chars

                    • Now, convert this UTF-8 file to an ANSI file, with the Encoding > Convert to ANSI option

                    • Re-try the [[:punct:]] regex against this, from now on, ANSI file

                    => This time, the five characters are selected !?

                    Yes, that is something I don’t like about my own work: there are now inconsistencies between ANSI and UTF-8, because I changed nothing about ANSI regular expressions. For example, (?i)\u still matches all alphabetic characters in ANSI files. (For obscure technical reasons involving C++ template specialization and how Boost::regex is implemented, it may prove to be more difficult to make the corresponding changes to ANSI than it was to make them to Unicode. So far, I haven’t even tried.)

                    In your documentation, regarding your last sequence [[.x80.]]–[[.xff.]], right before the Search file section :

                    At first sight, it’s not a class of characters and it seems to be an invalid regular expression. Surprisingly, it’s not ! Actually, it’s a sequence of three consecutive characters

                    It looks like I failed to convey what I meant in that entry. What I was trying to say was that you can use [[.xhh.]] as a symbolic character reference to find an invalid byte; so that, for example, [[.xB2.]] will find any byte 0xB2 that is part of an invalid UTF-8 sequence. (There is no way to isolate bytes 0xB2 that are parts of valid UTF-8 sequences, though; for that, you’d have to reinterpret as — not convert to — ANSI.) I added those as I was updating the documentation, because I thought it was less confusing than telling people they could use expressions like \x{DCB2} to find specific invalid bytes. This mirrors how control and invisible characters have symbolic names that match the way Scintilla displays them.

                    1 Reply Last reply Reply Quote 1
                    • guy038G
                      guy038
                      last edited by guy038

                      Hi, @coises,

                      You said :

                      It looks like I failed to convey what I meant in that entry

                      Ah…, now I understand what you meant ! Thus, may be the two following entries would just mean what you expected to :

                      •-----------------------------•-----------•------------------------------------------------------•
                      | From [[.x00.]] to [[.xff.]] | [[.x##.]] | The invalid UTF-8 byte [[.x##.]]                     |
                      | [[.x80.]-[.xff.]]           |           | Any invalid UTF-8 byte                               | 
                      •-----------------------------•-----------•------------------------------------------------------•
                      

                      Like you, I’m a bit upset about the differences of behavior, of your Columns++ plugin, between ANSI and UNICODE files. So, I will do additional tests to narrow down where these differences occur ! Like my Total_Chars.txt UNICODE file, I’ll create an ANSI file containing the 256 characters of the Winsows-1252 encoding to this purpose !

                      https://en.wikipedia.org/wiki/Windows-1252

                      See you later,

                      BR

                      guy038

                      1 Reply Last reply Reply Quote 0
                      • guy038G
                        guy038
                        last edited by guy038

                        Hello, @coises and All,

                        I’ve decided to use the same canvas to describe results with an ANSI file as I did for results with a UNICODE file. This description will spread over two posts !

                        So, I first created this ANSI file, named Total_ANSI.txt :

                        •---------------•-----------------•------------•-----------------------------•-----------•-----------------•-----------•
                        |     Range     |  Description    |   Status   |  COUNT / MARK of ALL chars  |  # Chars  |  ANSI Encoding  |  # Bytes  |
                        •---------------•-----------------•------------•-----------------------------•-----------•-----------------•- ---------•
                        |  0000 - 007F  |  PLANE 0 - BMP  |  Included  |  [\x00-\x7F]                |      128  |                 |      128  |
                        |               |                 |            |                             |           |     1 Byte      |           |
                        |  0080 - 00FF  |  PLANE 0 - BMP  |  Included  |  [\x80-\xFF]                |      128  |                 |      128  |
                        •---------------•-----------------•------------•-----------------------------•-----------•-----------------•-----------•
                        

                        Against this file, the following results are correct :

                            [\x00-\xFF]    =>  256 chars, coded with one byte = TOTAL of characters
                        
                            [[:unicode:]]  =>    0 char                       = Total chars OVER \x{00FF}
                        

                        I tried some expressions with look-aheads and look-behinds, containing overlapping zones !

                        For instance, against this text aaaabaaababbbaabbabb, pasted in a new ANSI tab, with a final line-break, all the regexes, below, give the correct number of matches :

                        ba*(?=a)   =>  4 matches
                        ba*(?!a)   =>  9 matches
                        ba*(?=b)   =>  8 matches
                        ba*(?!b)   =>  5 matches
                        
                        (?<=a)ba*  =>  5 matches
                        (?<!b)ba*  =>  5 matches
                        
                        (?<=b)ba*  =>  4 matches
                        (?<!a)ba*  =>  4 matches
                        

                        But, on the other hand, the search of the regex :

                        [[.NUL.][.SOH.][.STX.][.ETX.][.EOT.][.ENQ.][.ACK.][.BEL.][.BS.][.HT.][.VT.][.FF.][.SO.][.SI.][.DLE.][.DC1.][.DC2.][.DC3.][.DC4.][.NAK.][.SYN.][.ETB.][.CAN.][.EM.][.SUB.][.ESC.][.FS.][.GS.][.RS.][.US.][.DEL.][.HOP.][.RI.][.SS3.][.DCS.][.OCS.][.SHY.]]

                        Leads to an Invalid Regex message. Logical, as this kind of search concerns Unicode files, only.


                        Now, against the Total_ANSI.txt file, all the following results are correct :

                        (?s).        =  [\x00-\xFF]      =>  256     Total =  256
                        
                        (?-s).       =  [^\x0A\x0C\x0D]  =>  253
                        
                        
                        \p{Unicode}  =  [[:Unicode:]]    =>    0  |
                                                                  |  Total =  256
                        \P{Unicode}  =  [[:^Unicode:]]   =>  256  |
                        
                        
                        \X                               =>  256  |
                                                                  |  Total =  256
                        (?!\X).                          =>    0  |
                        

                        Here are the correct results, concerning all the Posix character classes, against the Total_ANSI.txt file

                        [[:unicode:]]  =  \p{unicode}                                     an OVER  \x{00FF}         character        0  =  [^\x00-\xFF}]
                        
                        [[:space:]]   =  \p{space}  =  [[:s:]]  =  \p{s}  =  \ps  =  \s   a             WHITE-SPACE character        7  =  [\t\n\x0B\f\r\x20\xA0]
                                                       [[:h:]]  =  \p{h}  =  \ph  =  \h   an HORIZONTAL white space character        3  =  [\t\x20\xA0]
                        [[:blank:]]   =  \p{blank}                                        a  BLANK                  character        3  =  [\t\x20\xA0]
                                                       [[:v:]]  =  \p{v}  =  \pv  =  \v   a  VERTICAL   white space character        4  =  [\n\x0B\f\r]
                        
                        [[:cntrl:]]   =  \p{cntrl}                                        a  CONTROL code           character       39  =  [\x00-\x1F\x7F\x81\x8D\x8F\x90\x9D\xAD]
                        
                        [[:upper:]]   =  \p{upper}  =  [[:u:]]  =  \p{u}  =  \pu  =  \u   an  UPPER case    letter                  60  =  [A-ZŠŒŽÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞß]
                        [[:lower:]]   =  \p{lower}  =  [[:l:]]  =  \p{l}  =  \pl  =  \l   a   LOWER case    letter                  65  =  [a-zƒšœžŸªµºàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ]
                        [[:digit:]]   =  \p{digit}  =  [[:d:]]  =  \p{d}  =  \pd   = \d   a   DECIMAL       number                  13  =  [0-9²³¹]
                         _            =  \x{005F}                                         the LOW_LINE      character                1
                                                                                                                                  -------
                        [[:word:]]    =  \p{word}   =  [[:w:]]  =  \p{w}  =  \pw   = \w   a   WORD                  character      139  =  [[:alnum:]]|\x5F  =  \p{alnum}|\x5F
                        
                        
                        (?i)[[:upper:]]  = (?i)[[:lower:]]                                a   LETTER, whatever its CASE            125  =  (?-i)[[:upper:][:lower:]]
                        
                        
                        [[:alnum:]]   =  \p{alnum}                                        an  ALPHANUMERIC          character      138  =  (?-i)[[:upper:][:lower:][:digit:]]
                        
                        [[:alpha:]]   =  \p{alpha}                                        any LETTER                character      125  =  (?-i)[[:upper:][:lower:]]
                        
                        
                        [[:graph:]]   =  \p{graph}                                        any VISIBLE               character      212  =  [^\x00-\x1F\x20\x7F\x80\x81\x88\x8D\x8F\x90\x98\x99\x9D\xA0]
                        
                        [[:print:]]   =  \p{print}                                        any PRINTABLE             character      219  =  [[:graph:]]|\s
                        
                        
                        [[:punct:]]   =  \p{punct}                                        any PUNCTUATION or SYMBOL character       80  =  [\x21-\x2F\x3A-\x40\x5B-\x60\x7B-\x7E\x82\x84-\x87\x89\x8B\x91-\x97\x9B\xA1-\xBF\xD7\xF7]
                        
                        
                        [[:xdigit:]]                                                      an HEXADECIMAL            character       22  =  [0-9A-Fa-f]  =  (?i)[0-9A-F]
                        

                        NO results regarding the Unicode character classes, against the Total_ANSI.txt file because it logically returns an Invalid Regular Expression message for any class

                        Remark :

                        • A negative POSIX character class can be expressed as [^[:........:]] or [[:^........:]]

                        No INVALID UTF-8 chars can be found as we’re dealing with an ANSI file !


                        I tested ALL the Equivalence classes feature :

                        You can use any other equivalent character of the a letter to get the 15 matches ( for instance : ((=ª=]], [[=Å=]], [[=ã=]], … )

                        Here is below the list of all the equivalences of any char of the Windows-1252 code-page, from \x00 till \xDE against the Total_ANSI.txt file. Note that I did not consider the equivalence classes which returns only one match !

                        [[==]]                        =>    33   [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD]
                        [[==]]                        =>    33   [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD]
                        [[==]]                        =>    33   [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD]
                        [[==]]                        =>    33   [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD]
                        [[==]]                        =>    33   [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD]
                        [[==]]                        =>    33   [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD]
                        [[==]]  =  [[=alert=]]        =>    33   [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD]
                        [[==]]   =  [[=backspace=]]    =>    33   [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD]
                        
                        [[==]]                         =>    33   [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD]
                        [[==]]                         =>    33   [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD]
                        [[==]]                        =>    33   [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD]
                        [[==]]                        =>    33   [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD]
                        [[==]]                        =>    33   [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD]
                        [[==]]                        =>    33   [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD]
                        [[==]]                        =>    33   [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD]
                        [[==]]                        =>    33   [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD]
                        [[==]]                        =>    33   [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD]
                        [[==]]                        =>    33   [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD]
                        [[==]]                        =>    33   [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD]
                        [[==]]                         =>    33   [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD]
                        [[==]]                        =>    33   [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD]
                        [[==]]                        =>    33   [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD]
                        [[==]]   =   [[=IS4=]]         =>    33   [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD]
                        [[==]]   =   [[=IS3=]]         =>    33   [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD]
                        [[==]]   =   [[=IS2=]]         =>    33   [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD]
                        [[==]]   =   [[=IS1=]]         =>    33   [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD]
                        
                        [[='=]]    =   [[=apostrophe=]]  =>    33   [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD]
                        [[=-=]]    =   [[=hyphen=]]      =>    33   [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD]
                        
                        [[==]]                          =>    33   [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD]
                        [[=–=]]                          =>    33   [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD]
                        [[=—=]]                          =>    33   [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD]
                        [[=­=]]                          =>    33   [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD]
                        
                        [[=1=]]    =   [[=one=]]         =>     2   [1¹]
                        [[=2=]]    =   [[=two=]]         =>     2   [2²]
                        [[=3=]]    =   [[=three=]]       =>     2   [3³]
                        
                        [[=A=]]                          =>    15   [AaªÀÁÂÃÄÅàáâãäå]
                        [[=B=]]                          =>     2   [Bb]
                        [[=C=]]                          =>     4   [CcÇç]
                        [[=D=]]                          =>     4   [DdÐð]
                        [[=E=]]                          =>    10   [EeÈÉÊËèéêë]
                        [[=F=]]                          =>     3   [Ffƒ]
                        [[=G=]]                          =>     2   [Gg]
                        [[=H=]]                          =>     2   [Hh]
                        [[=I=]]                          =>    10   [IiÌÍÎÏìíîï]
                        [[=J=]]                          =>     2   [Jj]
                        [[=K=]]                          =>     2   [Kk]
                        [[=L=]]                          =>     2   [Ll]
                        [[=M=]]                          =>     2   [Mm]
                        [[=N=]]                          =>     4   [NnÑñ]
                        [[=O=]]                          =>    15   [OoºÒÓÔÕÖØòóôõöø]
                        [[=P=]]                          =>     2   [Pp]
                        [[=Q=]]                          =>     2   [Qq]
                        [[=R=]]                          =>     2   [Rr]
                        [[=S=]]                          =>     4   [SsŠš]
                        [[=T=]]                          =>     2   [Tt]
                        [[=U=]]                          =>    10   [UuÙÚÛÜùúûü]
                        [[=V=]]                          =>     2   [Vv]
                        [[=W=]]                          =>     2   [Ww]
                        [[=X=]]                          =>     2   [Xx]
                        [[=Y=]]                          =>     6   [YyÝýÿŸ]
                        [[=Z=]]                          =>     4   [ZzŹź]
                        
                        [[=^=]]    =  [[=circumflex=]]   =>     2   [^ˆ]
                        [[=Œ=]]                          =>     2   [Œœ]
                        [[=Þ=]]                          =>     2   [Þþ]
                        

                        Some double-letter characters give some equivalences which allow you to get the right single char to use, instead of the two trivial letters :

                        [[=AE=]] = [[=Ae=]] = [[=ae=]]   =>   2   [Ææ]
                        [[=SS=]] = [[=Ss=]] = [[=ss=]]   =>   1   [ß]
                        

                        An example : let’s suppose that we run this regex (?-i)[A-F[:lower:]], against my Total_ANSI.txt file. It does give 71 matches, so 6 UPPER letters + 65 LOWER letters

                        As, in an ANSI file, the Match case option or the ?i) modifier is effective for POSIX character classes, if we run the same regex, in an insensitive way, the (?i)[A-F[:lower:]] regex returns, this time, 125 matches.

                        And note that the regex (?-i)[[:upper:][:lower:]] or (?i)[[:upper:][:lower:]] acts as an insensitive regex and return 125 matches ( So 60 UPPER letters + 65 LOWER letters )

                        The regexes (?-i)\u(?<=\l) and (?-i)(?=\l)\u do not find any match. This implies that the sets of UPPER and LOWER letters are totally disjoint


                        Finally, for ANSI files, the regex syntax \X is rather useless. Indeed, the UNICODE block of Combining diacritical marks is cannot be used, anyway and the Emoji are UNICODE characters are totally inaccessible to ANSI files. Thus, \X regex is just equivalent to the simple regex (?s).


                        So, from this set of ANSI results, which ones seem quite odd, compared with the UNICODE results ?

                        • Maybe, the regex (?-s). should just be equal to [^\x0A\x0D] and return 254 matches

                        • The [[:cntrl:]] or \p{cntrl} should be equal to [\x00-\x1F\x7F\x81\x8D\x8F\x90\x9D] and returns 38 characters or, maybe, [\x00-\x1F\x7F] so 33 chars only


                        Regarding the [[:graph:]] character class, I created an identical UTF-8 file, named Total_UTF-8.txt. Here are the results for the characters between \x80 and \xBF, in both files ( all the other chars being identical ) :

                        •---------•---------•--------------------•
                        |  ANSI   |  UTF-8  |  UNICODE Category  |
                        •---------•---------•--------------------•
                        |         |    €    |   Sc
                        |    ‚    |    ‚    |
                        |    ƒ    |    ƒ    |
                        |    „    |    „    |
                        |    …    |    …    |
                        |    †    |    †    |
                        |    ‡    |    ‡    |
                        |         |    ˆ    |   Po
                        |    ‰    |    ‰    |
                        |    Š    |    Š    |
                        |    ‹    |    ‹    |
                        |    Œ    |    Œ    |
                        |    Ž    |    Ž    |
                        |    ‘    |    ‘    |
                        |    ’    |    ’    |
                        |    “    |    “    |
                        |    ”    |    ”    |
                        |    •    |    •    |
                        |    –    |    –    |
                        |    —    |    —    |
                        |         |    ˜    |   Sk
                        |         |    ™    |   So
                        |    š    |    š    |
                        |    ›    |    ›    |
                        |    œ    |    œ    |
                        |    ž    |    ž    |
                        |    Ÿ    |    Ÿ    |
                        |    ¡    |    ¡    |
                        |    ¢    |    ¢    |
                        |    £    |    £    |
                        |    ¤    |    ¤    |
                        |    ¥    |    ¥    |
                        |    ¦    |    ¦    |
                        |    §    |    §    |
                        |    ¨    |    ¨    |
                        |    ©    |    ©    |
                        |    ª    |    ª    |
                        |    «    |    «    |
                        |    ¬    |    ¬    |
                        |    ­    |         |   Cf
                        |    ®    |    ®    |
                        |    ¯    |    ¯    |
                        |    °    |    °    |
                        |    ±    |    ±    |
                        |    ²    |    ²    |
                        |    ³    |    ³    |
                        |    ´    |    ´    |
                        |    µ    |    µ    |
                        |    ¶    |    ¶    |
                        |    ·    |    ·    |
                        |    ¸    |    ¸    |
                        |    ¹    |    ¹    |
                        |    º    |    º    |
                        |    »    |    »    |
                        |    ¼    |    ¼    |
                        |    ½    |    ½    |
                        |    ¾    |    ¾    |
                        |    ¿    |    ¿    |
                        •---------•---------•
                        

                        Surprisingly, the ANSI chars \x80, \x88, \x98, \x99 are not supposed to be part of the [[:graph:]] which represents the class of visible characters !?

                        So, to harmonize the results, the rule should be :

                        • When using the [[:graph:]] POSIX character class, against an ANSI file :

                          • The [\x80\x88\x98\x99] ANSI list of characters ( corresponding the [\x{20AC}\x{02C6}\x{02DC}\x{2122}] UTF-8 list ) should be included in that class

                          • The \xAD character ( or \x{00AD} ) should be excluded from that class !

                        Now, as the [[:print:]] POSIX character class is simply identical to the regex [[:graph:]]|\s, no need to investigate about that character class !

                        See next post

                        CoisesC 1 Reply Last reply Reply Quote 0
                        • guy038G
                          guy038
                          last edited by guy038

                          Hi @Coises and All,

                          End of my reply :

                          In the same way, regarding the [[:punct:]] character class, here are the results for both, the Total_ANSI.txt and Total_UTF-8.txt files :

                          •---------•---------•--------------------•
                          |  ANSI   |  UTF-8  |  UNICODE Category  |
                          •---------•---------•--------------------•
                          |    !    |    !    |    Po
                          |    "    |    "    |    Po
                          |    #    |    #    |    Po
                          |    $    |    $    |    Sc
                          |    %    |    %    |    Po
                          |    &    |    &    |    Po
                          |    '    |    '    |    Po
                          |    (    |    (    |    Ps
                          |    )    |    )    |    Pe
                          |    *    |    *    |    Po
                          |    +    |    +    |    Sm
                          |    ,    |    ,    |    Po
                          |    -    |    -    |    Pd
                          |    .    |    .    |    Po
                          |    /    |    /    |    Po
                          |    :    |    :    |    Po
                          |    ;    |    ;    |    Po
                          |    <    |    <    |    Sm
                          |    =    |    =    |    Sm
                          |    >    |    >    |    Sm
                          |    ?    |    ?    |    Po
                          |    @    |    @    |    Po
                          |    [    |    [    |    Ps
                          |    \    |    \    |    Po
                          |    ]    |    ]    |    Pe
                          |    ^    |    ^    |    Sk
                          |    _    |    _    |    Pc
                          |    `    |    `    |    Sk
                          |    {    |    {    |    Ps
                          |    |    |    |    |    Sm
                          |    }    |    }    |    Pe
                          |    ~    |    ~    |    Sm
                          •---------•---------•-----------•
                          |         |    €    |    Sc
                          |    ‚    |    ‚    |    Ps
                          |    „    |    „    |    Ps
                          |    …    |    …    |    Po
                          |    †    |    †    |    Po
                          |    ‡    |    ‡    |    Po
                          |    ‰    |    ‰    |    Po
                          |    ‹    |    ‹    |    Pi
                          |    ‘    |    ‘    |    Pi
                          |    ’    |    ’    |    Pf
                          |    “    |    “    |    Pi
                          |    ”    |    ”    |    Pf
                          |    •    |    •    |    Po
                          |    –    |    –    |    Pd
                          |    —    |    —    |    Pd
                          |         |    ˜    |    Sk
                          |         |    ™    |    So
                          |    ›    |    ›    |    Pf
                          |    ¡    |    ¡    |    Po
                          |    ¢    |    ¢    |    Sc
                          |    £    |    £    |    Sc
                          |    ¤    |    ¤    |    Sc
                          |    ¥    |    ¥    |    Sc
                          |    ¦    |    ¦    |    So
                          |    §    |    §    |    Po
                          |    ¨    |    ¨    |    Sk
                          |    ©    |    ©    |    So
                          |    ª    |         |    Lo
                          |    «    |    «    |    Pi
                          |    ¬    |    ¬    |    Sm
                          |    ­    |         |    Cf
                          |    ®    |    ®    |    So
                          |    ¯    |    ¯    |    Sk
                          |    °    |    °    |    So
                          |    ±    |    ±    |    Sm
                          |    ²    |         |    No
                          |    ³    |         |    No
                          |    ´    |    ´    |    Sk
                          |    µ    |         |    Ll
                          |    ¶    |    ¶    |    Po
                          |    ·    |    ·    |    Po
                          |    ¸    |    ¸    |    Sk
                          |    ¹    |         |    No
                          |    º    |         |    Lo
                          |    »    |    »    |    Pf
                          |    ¼    |         |    No
                          |    ½    |         |    No
                          |    ¾    |         |    No
                          |    ¿    |    ¿    |    Po
                          |    ×    |    ×    |    Sm
                          |    ÷    |    ÷    |    Sm
                          •---------•---------•-----------•
                          

                          And, as we know that the [[:punct:]] POSIX character class is the union of the TWO Unicode classes \p{P*} and \p{S*}, this means that all the [[:punct:]] characters, found in Total_UTF-8.txt, are exact !

                          However, it’s obvious that it’s not the case for the [[:punct:]] characters found in Total_ANSI.txt :

                          So, again, to harmonize the results, the rule should be :

                          • When using the [[:punct:]] POSIX character class, against an ANSI file :

                            • The [\xAA\xAD\xB2\xB3\xB5\xB9\xBA\xBC\xBD\xBE] list of characters should be excluded from that class !

                            • The [\x80\x98\x99] ANSI list of characters ( corresponding the [\x{20AC}\x{02DC}\x{2122}] UTF-8 list ) should be included in that class

                          And this result would confirm that the POSIX [[:punct:]] character class is equal to the \p{P*}|\p{S*} regex, in all cases !


                          • Regarding the Equivalence classes whose results are presently 33, the rule should be :

                            • All the Control codes should just match their own character. For example [[==]] should return 1 match [x7F]

                            • [[='=]] = [[=apostrophe=]] should return 1 match [\x27]

                            • [[=-=]] = [[=hyphen=]] should return 1 match [\x2D]

                            • [[=–=]] should return 1 match [\x96]

                            • [[=—=]] should return 1 match [\x97]

                            • [[=­=]] should return 1 match [\xAD]


                          Now, when doing tests with UNICODE files, I forgot the equivalence classes of the Control C0/C1 and Control Format characters !. So the results, against my Total_Chars.txt UTF-8 file, are :

                          [[=nul=]]                            => 3,309  [\x{0000}\X{00AD}....]       Cc
                          
                          [[=soh=]]                            =>     1  [\x{0001}]                   Cc
                          [[=stx=]]                            =>     1  [\x{0002}]                   Cc
                          [[=etx=]]                            =>     1  [\x{0003}]                   Cc
                          [[=eot=]]                            =>     1  [\x{0004}]                   Cc
                          [[=enq=]]                            =>     1  [\x{0005}]                   Cc
                          [[=ack=]]                            =>     1  [\x{0006}]                   Cc
                          [[=bel=]]  =  [[=alert=]]            =>     1  [\x{0007}]                   Cc
                          [[=bs=]]   =  [[=backspace=]]        =>     1  [\x{0008}]                   Cc
                          [[=ht=]]   =  [[=tab=]]              =>     1  [\x{0009}]                   Cc
                          [[=lf=]]   =  [[=newline=]]          =>     1  [\x{000A}]                   Cc
                          [[=vt=]]   =  [[=vertical-tab=]]     =>     1  [\x{000B}]                   Cc
                          [[=ff=]]   =  [[=form-feed=]]        =>     1  [\x{000C}]                   Cc
                          [[=cr=]]   =  [[=carriage-return=]]  =>     1  [\x{000D}]                   Cc
                          [[=so=]]                             =>     1  [\x{000E}]                   Cc
                          [[=si=]]                             =>     1  [\x{000F}]                   Cc
                          [[=dle=]]                            =>     1  [\x{0010}]                   Cc
                          [[=dc1=]]                            =>     1  [\x{0011}]                   Cc
                          [[=dc2=]]                            =>     1  [\x{0012}]                   Cc
                          [[=dc3=]]                            =>     1  [\x{0013}]                   Cc
                          [[=dc4=]]                            =>     1  [\x{0014}]                   Cc
                          [[=nak=]]                            =>     1  [\x{0015}]                   Cc
                          [[=syn=]]                            =>     1  [\x{0016}]                   Cc
                          [[=etb=]]                            =>     1  [\x{0017}]                   Cc
                          [[=can=]]                            =>     1  [\x{0018}]                   Cc
                          [[=em=]]                             =>     1  [\x{0019}]                   Cc
                          [[=sub=]]                            =>     1  [\x{001A}]                   Cc
                          [[=esc=]]                            =>     1  [\x{001B}]                   Cc
                          [[=fs=]]                             =>     1  [\x{001C}]                   Cc
                          [[=gs=]]                             =>     1  [\x{001D}]                   Cc
                          [[=rs=]]                             =>     1  [\x{001E}]                   Cc
                          [[=us=]]                             =>     1  [\x{001F}]                   Cc
                          
                          [[= =]]                              =>     3  [\x{0020}\x{205F}\x{3000}]   Zs
                          
                          [[=del=]]                            =>     1  [\x{007F}]                   Cc
                          [[=pad=]]                            =>     1  [\x{0080}]                   Cc
                          [[=hop=]]                            =>     1  [\x{0081}]                   Cc
                          [[=bph=]]                            =>     1  [\x{0082}]                   Cc
                          [[=nbh=]]                            =>     1  [\x{0083}]                   Cc
                          [[=ind=]]                            =>     1  [\x{0084}]                   Cc
                          [[=nel=]]                            =>     1  [\x{0085}]                   Cc
                          [[=ssa=]]                            =>     1  [\x{0086}]                   Cc
                          [[=esa=]]                            =>     1  [\x{0087}]                   Cc
                          [[=hts=]]                            =>     1  [\x{0088}]                   Cc
                          [[=htj=]]                            =>     1  [\x{0089}]                   Cc
                          [[=lts=]]                            =>     1  [\x{008A}]                   Cc
                          [[=pld=]]                            =>     1  [\x{008B}]                   Cc
                          [[=plu=]]                            =>     1  [\x{008C}]                   Cc
                          [[=ri=]]                             =>     1  [\x{008D}]                   Cc
                          [[=ss2=]]                            =>     1  [\x{008E}]                   Cc
                          [[=ss3=]]                            =>     1  [\x{008F}]                   Cc
                          [[=dcs=]]                            =>     1  [\x{0090}]                   Cc
                          [[=pu1=]]                            =>     1  [\x{0091}]                   Cc
                          [[=pu2=]]                            =>     1  [\x{0092}]                   Cc
                          [[=sts=]]                            =>     1  [\x{0093}]                   Cc
                          [[=cch=]]                            =>     1  [\x{0094}]                   Cc
                          [[=mw=]]                             =>     1  [\x{0095}]                   Cc
                          [[=spa=]]                            =>     1  [\x{0096}]                   Cc
                          [[=epa=]]                            =>     1  [\x{0097}]                   Cc
                          [[=sos=]]                            =>     1  [\x{0098}]                   Cc
                          [[=sgci=]]                           =>     1  [\x{0099}]                   Cc
                          [[=sci=]]                            =>     1  [\x{009A}]                   Cc
                          [[=csi=]]                            =>     1  [\x{009B}]                   Cc
                          [[=st=]]                             =>     1  [\x{009C}]                   Cc
                          [[=osc=]]                            =>     1  [\x{009D}]                   Cc
                          [[=pm=]]                             =>     1  [\x{009E}]                   Cc
                          [[=apc=]]                            =>     1  [\x{009F}]                   Cc
                          [[=nbsp=]]                           =>     1  [\x{00A0}]                   Cc
                          
                          [[=shy=]]                            => 3,309  [\x{0000}\X{00AD}....]       Cf
                          [[=alm=]]                            => 3,309  [\x{0000}\X{00AD}....]       Cf
                          
                          [[=sam=]]                            =>     2  [\x{070F}\x{2E1A}]           Po
                          [[=ospm=]]                           =>     1  [\x{1680}]                   Zs
                          [[=mvs=]]                            =>     1  [\x{180E}]                   Cf
                          [[=nqsp=]]                           =>     2  [\x{2000}\X[2002}]           Zs
                          [[=mqsp=]]                           =>     2  [\x{2001}\X{2003}]           Zs
                          [[=ensp=]]                           =>     2  [\x{2000}\X[2002}]           Zs
                          [[=emsp=]]                           =>     2  [\x{2001}\X{2003}]           Zs
                          [[=3/msp=]]                          =>     1  [\x{2004}]                   Zs
                          [[=4/msp=]]                          =>     1  [\x{2005}]                   Zs
                          [[=6/msp=]]                          =>     1  [\x{2006}]                   Zs
                          [[=fsp=]]                            =>     1  [\x{2007}]                   Zs
                          [[=psp=]]                            =>     1  [\x{2008}]                   Zs
                          [[=thsp=]]                           =>     1  [\x{2009}]                   Zs
                          [[=hsp=]]                            =>     1  [\x{200A}]                   Zs
                          [[=zwsp=]]                           =>     1  [\x{200B}]                   Cf
                          
                          [[=zwnj=]]                           => 3,309  [\x{0000}\X{00AD}....]       Cf
                          [[=zwj=]]                            => 3,309  [\x{0000}\X{00AD}....]       Cf
                          [[=lrm=]]                            => 3,309  [\x{0000}\X{00AD}....]       Cf
                          [[=rlm=]]                            => 3,309  [\x{0000}\X{00AD}....]       Cf
                          
                          [[=ls=]]                             =>     2  [\x{2028}\x{FE47}]           Zl
                          [[=ps=]]                             =>     1  [\x{2029}]                   Zp
                          
                          [[=lre=]]                            => 3,309  [\x{0000}\X{00AD}....]       Cf
                          [[=rle=]]                            => 3,309  [\x{0000}\X{00AD}....]       Cf
                          [[=pdf=]]                            => 3,309  [\x{0000}\X{00AD}....]       Cf
                          [[=lro=]]                            => 3,309  [\x{0000}\X{00AD}....]       Cf
                          [[=rlo=]]                            => 3,309  [\x{0000}\X{00AD}....]       Cf
                          
                          [[=nnbsp=]]                          =>     1  [\x{202F}]                   Zs
                          [[=mmsp=]]                           =>     3  [\x{0020}\x{205F}\x{3000}]   Zs
                          
                          [[=wj=]]                             => 3,309  [\x{0000}\X{00AD}....]       Cf
                          [[=(fa)=]]                           => 3,309  [\x{0000}\X{00AD}....]       Cf
                          [[=(it)=]]                           => 3,309  [\x{0000}\X{00AD}....]       Cf
                          [[=(is)=]]                           => 3,309  [\x{0000}\X{00AD}....]       Cf
                          [[=(ip)=]]                           => 3,309  [\x{0000}\X{00AD}....]       Cf
                          [[=lri=]]                            => 3,309  [\x{0000}\X{00AD}....]       Cf
                          [[=rli=]]                            => 3,309  [\x{0000}\X{00AD}....]       Cf
                          [[=fsi=]]                            => 3,309  [\x{0000}\X{00AD}....]       Cf
                          [[=pdi=]]                            => 3,309  [\x{0000}\X{00AD}....]       Cf
                          [[=iss=]]                            => 3,309  [\x{0000}\X{00AD}....]       Cf
                          [[=ass=]]                            => 3,309  [\x{0000}\X{00AD}....]       Cf
                          [[=iafs=]]                           => 3,309  [\x{0000}\X{00AD}....]       Cf
                          [[=aafs=]]                           => 3,309  [\x{0000}\X{00AD}....]       Cf
                          [[=nads=]]                           => 3,309  [\x{0000}\X{00AD}....]       Cf
                          [[=nods=]]                           => 3,309  [\x{0000}\X{00AD}....]       Cf
                          
                          [[=idsp=]]                           =>     3  [\x{0020}\x{205F}\x{3000}]   Zs
                          
                          [[=zwnbsp=]]                         => 3,309  [\x{0000}\X{00AD}....]       Cf
                          [[=iaa=]]                            => 3,309  [\x{0000}\X{00AD}....]       Cf
                          [[=ias=]]                            => 3,309  [\x{0000}\X{00AD}....]       Cf
                          [[=iat=]]                            => 3,309  [\x{0000}\X{00AD}....]       Cf
                          
                          [[=sflo=]]                           =>     1  [\x{1BCA0}]                  Cf
                          [[=sfco=]]                           =>     1  [\x{1BCA1}]                  Cf
                          [[=sfds=]]                           =>     1  [\x{1BCA2}]                  Cf
                          [[=sfus=]]                           =>     1  [\x{1BCA3}]                  Cf
                          

                          As you can see, a lot of Format characters give the erroneous result of 3,309 occurrences. But we’re not going to bother about these wrong equivalence classes, as long as the similar collating names, with the [[.XXX.]] syntax, are totally correct !

                          Luckily, all the other equivalence classes are quite correct, except for [[=ls=]] which returns 2 matches \x{2028} and \x{FE47} ??

                          Also a detail !

                          Best Regards,

                          guy038

                          1 Reply Last reply Reply Quote 0
                          • CoisesC
                            Coises @guy038
                            last edited by

                            @guy038 said in Columns++ version 1.2: better Unicode search:

                            But, on the other hand, the search of the regex :

                            [[.NUL.][.SOH.][.STX.][.ETX.][.EOT.][.ENQ.][.ACK.][.BEL.][.BS.][.HT.][.VT.][.FF.][.SO.][.SI.][.DLE.][.DC1.][.DC2.][.DC3.][.DC4.][.NAK.][.SYN.][.ETB.][.CAN.][.EM.][.SUB.][.ESC.][.FS.][.GS.][.RS.][.US.][.DEL.][.HOP.][.RI.][.SS3.][.DCS.][.OCS.][.SHY.]]

                            Leads to an Invalid Regex message. Logical, as this kind of search concerns Unicode files, only.

                            There is a typo in that. The next-to-last symbolic name should be OSC, not OCS. (See list at the end of this help section.)

                            However, it still won’t work in ANSI search, because ANSI search only supports these POSIX symbolic names as defined by Boost::regex.

                            The regular expression language for ANSI files is exactly the same as it is in Notepad++ search, because I have not changed the underlying Boost::regex engine’s behavior for ANSI files. I only changed the way the engine works for UTF-8 files.

                            Some things, like stepwise find and replace with \K, formulas in replacement strings and counting null matches (my Count counts them, Notepad++’s doesn’t) differ for both ANSI and UTF-8 because I changed the surrounding code that uses the Boost::regex engine; but the matching itself is unchanged for ANSI.

                            This is why the character classes behave differently as well. Boost::regex relies on GetStringTypeExA (which is similar to GetStringTypeExW except for the third argument being char* instead of wchar_t*) to classify 8-bit characters according to the Ctype 1 list here. The classification depends on the current locale (which should imply the system default code page, which is the only code page Notepad++ ever uses as ANSI — documents in other code pages are converted to UTF-8). ANSI regular expressions, per Boost::regex design, are using whatever information Windows gives them.

                            1 Reply Last reply Reply Quote 0
                            • guy038G
                              guy038
                              last edited by guy038

                              Hi, @coises and All,

                              I think this will be the last answer concerning your Columns++_v1.2 plugin !

                              Here is the recapitulation of the way to access the invisible characters, whatever the file type :


                              For ANSI files : just one possible syntax for these collating names :

                              [[.NUL.][.SOH.][.STX.][.ETX.][.EOT.][.ENQ.][.ACK.][.alert.][.backspace.][.tab.][.newline.][.vertical-tab.][.form-feed.][.carriage-return.][.SO.][.SI.][.DLE.][.DC1.][.DC2.][.DC3.][.DC4.][.NAK.][.SYN.][.ETB.][.CAN.][.EM.][.SUB.][.ESC.][.IS4.][.IS3.][.IS2.][.IS1.][.DEL.]] which returns 33 matches, against the Total_ANSI.txt file, wihich contains the 256 characters of the Win-1252 encoding

                              • Note that the lowercase syntax is NOT allowed, in ANSI files, for ANY collating names, presently in UPPER case

                              • Note also that the four chars, from \x1c to \x1f must be referred as from IS4 to IS1, in UPPER case ( and NOT from fs to us ! )


                              For UTF-8 files : two possible syntaxes for these collating names :

                              [[.nul.][.soh.][.stx.][.etx.][.eot.][.enq.][.ack.][.bel.][.bs.][.ht.][.lf.][.vt.][.ff.][.cr.][.so.][.si.][.dle.][.dc1.][.dc2.][.dc3.][.dc4.][.nak.][.syn.][.etb.][.can.][.em.][.sub.][.esc.][.fs.][.gs.][.rs.][.us.][.del.][.pad.][.hop.][.bph.][.nbh.][.ind.][.nel.][.ssa.][.esa.][.hts.][.htj.][.lts.][.pld.][.plu.][.ri.][.ss2.][.ss3.][.dcs.][.pu1.][.pu2.][.sts.][.cch.][.mw.][.spa.][.epa.][.sos.][.sgci.][.sci.][.csi.][.st.][.osc.][.pm.][.apc.][.nbsp.][.shy.][.alm.][.sam.][.ospm.][.mvs.][.nqsp.][.mqsp.][.ensp.][.emsp.][.3/msp.][.4/msp.][.6/msp.][.fsp.][.psp.][.thsp.][.hsp.][.zwsp.][.zwnj.][.zwj.][.lrm.][.rlm.][.ls.][.ps.][.lre.][.rle.][.pdf.][.lro.][.rlo.][.nnbsp.][.mmsp.][.wj.][.(fa).][.(it).][.(is).][.(ip).][.lri.][.rli.][.fsi.][.pdi.][.iss.][.ass.][.iafs.][.aafs.][.nads.][.nods.][.idsp.][.zwnbsp.][.iaa.][.ias.][.iat.][.sflo.][.sfco.][.sfds.][.sfus.]] which return 120 matches, against the Total_Chars.txt file

                              [[.nul.][.soh.][.stx.][.etx.][.eot.][.enq.][.ack.][.alert.][.backspace.][.tab.][.newline.][.vertical-tab.][.form-feed.][.carriage-return.][.so.][.si.][.dle.][.dc1.][.dc2.][.dc3.][.dc4.][.nak.][.syn.][.etb.][.can.][.em.][.sub.][.esc.][.fs.][.gs.][.rs.][.us.][.del.][.pad.][.hop.][.bph.][.nbh.][.ind.][.nel.][.ssa.][.esa.][.hts.][.htj.][.lts.][.pld.][.plu.][.ri.][.ss2.][.ss3.][.dcs.][.pu1.][.pu2.][.sts.][.cch.][.mw.][.spa.][.epa.][.sos.][.sgci.][.sci.][.csi.][.st.][.osc.][.pm.][.apc.][.nbsp.][.shy.][.alm.][.sam.][.ospm.][.mvs.][.nqsp.][.mqsp.][.ensp.][.emsp.][.3/msp.][.4/msp.][.6/msp.][.fsp.][.psp.][.thsp.][.hsp.][.zwsp.][.zwnj.][.zwj.][.lrm.][.rlm.][.ls.][.ps.][.lre.][.rle.][.pdf.][.lro.][.rlo.][.nnbsp.][.mmsp.][.wj.][.(fa).][.(it).][.(is).][.(ip).][.lri.][.rli.][.fsi.][.pdi.][.iss.][.ass.][.iafs.][.aafs.][.nads.][.nods.][.idsp.][.zwnbsp.][.iaa.][.ias.][.iat.][.sflo.][.sfco.][.sfds.][.sfus.]] which returns 120 matches, against the Total_Chars.txt file

                              -Note that the Uppercase syntax is allowed, in UTF-8 files, for ANY collating name, presently in LOWER case


                              Finally, for an ANSI file, containing the 256 chars of the Win-1252 encoding and converted as an UTF-8 file ( Encoding > Convert to UTF-8 ), two syntaxes are possible :

                              [[.nul.][.soh.][.stx.][.etx.][.eot.][.enq.][.ack.][.bel.][.bs.][.ht.][.lf.][.vt.][.ff.][.cr.][.so.][.si.][.dle.][.dc1.][.dc2.][.dc3.][.dc4.][.nak.][.syn.][.etb.][.can.][.em.][.sub.][.esc.][.fs.][.gs.][.rs.][.us.][.del.][.pad.][.hop.][.ri.][.ss3.][.dcs.][.osc.][.nbsp.][.shy.]] which returns 40 matches, agasint the Total_UTF-8.txt file

                              [[.nul.][.soh.][.stx.][.etx.][.eot.][.enq.][.ack.][.alert.][.backspace.][.tab.][.newline.][.vertical-tab.][.form-feed.][.carriage-return.][.so.][.si.][.dle.][.dc1.][.dc2.][.dc3.][.dc4.][.nak.][.syn.][.etb.][.can.][.em.][.sub.][.esc.][.fs.][.gs.][.rs.][.us.][.del.][.hop.][.ri.][.ss3.][.dcs.][.osc.][.nbsp.][.shy.]] which returns 40 matches, against the Total_UTF-8.txt file

                              • Note that the Uppercase syntax is allowed, in UTF-8 files, for ANY collating name, presently in LOWER case

                              Now, against the Total_ANSI.txt file, containing the first 256 UNICODE characters, we get these results :

                              (?s).                          ANY character                              =>  256
                              
                              (?-s).                         ANY character different from LIKE-BREAKS   =>  253  =  [^\x0A\x0C\x0D]
                              
                              [[:unicode:]]  =  \p{unicode}  an  OVER  \x{00FF}        character        =>    0  =  [^\x00-\xFF}]
                              
                              [[:cntrl:]]    =  \p{cntrl}    a   CONTROL code          character        =>   39  =  [\x00-\x1F\x7F\x81\x8D\x8F\x90\x9D\xAD]
                              
                              
                              [[:space:]]    =  \p{space}    a   WHITE-SPACE           character        =>    7  =  [\t\n\x0B\f\r\x20\xA0]
                              [[:blank:]]    =  \p{blank}    a   BLANK                 character        =>    3  =  [\t\x20\xA0]
                              
                              [[:upper:]]    =  \p{upper}    an  UPPER case            letter           =>   60  =  [A-ZŠŒŽŸÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞ]
                              [[:lower:]]    =  \p{lower}    a   LOWER case            letter           =>   65  =  [a-zƒšœžªµºßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ]
                              [[:digit:]]    =  \p{digit}    a   DECIMAL               number           =>   13  =  [0-9²³¹]
                              
                              [[:word:]]     =  \p{word}     a   WORD                  character        =>  139  =  [[:alnum:]]|\x5F  =  \p{alnum}|\x5F
                              
                              [[:punct:]]    =  \p{punct}    any PUNCTUATION or SYMBOL character        =>   80  =  [\x21-\x2F\x3A-\x40\x5B-\x60\x7B-\x7E\x82\x84-\x87\x89\x8B\x91-\x97\x9B\xA1-\xBF\xD7\xF7]
                              
                              [[:alpha:]]    =  \p{alpha}    any LETTER                character        =>  125  =  (?-i)[[:upper:][:lower:]]
                              [[:alnum:]]    =  \p{alnum}    an  ALPHANUMERIC          character        =>  138  =  (?-i)[[:upper:][:lower:][:digit:]]
                              
                              [[:graph:]]    =  \p{graph}    any VISIBLE               character        =>  212  =  [^\x00-\x1F\x20\x7F\x80\x81\x88\x8D\x8F\x90\x98\x99\x9D\xA0]
                              
                              [[:print:]]    =  \p{print}    any PRINTABLE             character        =>  219  =  [[:graph:][:space:]]  =  [^\x00-\x1F\x20\x7F\x80\x81\x88\x8D\x8F\x90\x98\x99\x9D\xA0]|[[:space:]]
                              
                              [[:xdigit:]]                   an  HEXADECIMAL           character        =>   22  =  [0-9A-Fa-f]  =  (?i)[0-9A-F]
                              

                              Remark : the [[:unicode:]] class; for characters OVER \x{00FF}, must correspond to the C1_DEFINED type from Ctype 1 list here.


                              From this same article, and after I realized that the POSIX classes are not totally independent, I deduced this layout :

                              C1_DEFINED  Other characters     0
                              C1_CNTRL    Control characters  39
                              C1_SPACE    Space characters     2  ( only the SPACE and NBSP chars, OUT of 7, as ALL other are ALREADY included in the CNTRL chars class )
                              C1_UPPER    Uppercase           60
                              C1_LOWER    Lowercase           65
                              C1_DIGIT    Decimal digits      13
                              C1_PUNCT    Punctuation         73  ( and NOT 80, because the \xAD char           is  ALREADY included in the CNTRL chars class
                                                                                because the \xAA, \xB5 and \xBA are ALREADY included in the LOWER chars class
                                                                                because the \xB2, \xB3 and \xB9 are ALREADY included in the DIGIT chars class )
                                                            -----
                                                  TOTAL :    252 chars
                              

                              So, if I exclude, from my Total_ANSI.txt file, all the following classes with the S/R :

                              FIND [[:cntrl:][:space:][:upper:][:lower:][:digit:][:punct:]]

                              REPLACE Leave EMPTY

                              Either, with your plugin or with native N++, it remains 4 characters ( 256 - 252 ) which are the € ( \x{20AC} ), ˆ ( \x{02C6} ), ˜ ( \x{02DC} ) and ™ ( \x{2122} ) characters

                              Moreover, absolutely no POSIX character class and no UNICODE character class, of course, can find these 4 characters !

                              Thus, the only way to find out one of these 4 characters, in an ANSI file, is to use the regex [\x80\x88\x98\x99] or to use the characters themselves :-((


                              In this article, it is also said :

                              Printable | Graphic characters and blanks (all C1_* types except C1_CNTRL). Thus …

                              So, from the previous total of chars of my Total_ANSI.txt file, the [[:print:]] class should detect 252 - 39, so 213 matches.

                              Thus, as [[graph:]] = [[:print:]] - [[space:]], this means that [[:graph:]] should be : 213 - 2, so 211 matches.

                              But current result is 212 matches. The difference of one unit comes from the \xAD char whith is, both, part of the [[:cntrl:]] and [[graph:]] POSIX character classes !

                              If we remember of the 4 lacking chars, which, obviously, are visible and printable, this means that [[:graph:]] and [[:print:] should return, respectively 215 ( 211 + 4 ) and 217 ( 213 + 4 ) matches, for ANSI files.

                              And it easy to verify that [[:print:]] + [[:cntrl:]] = 217 + 39 = 256 !


                              Just for info : from the Total_UTF-8.txt file, containing these same chars, we get these results :

                              (?s).                          ANY character                              =>  256
                              
                              (?-s).                         ANY character different from LIKE-BREAKS   =>  254  =  [^\x0A\x0D]
                              
                              [[:ascii:]]                    an  UNDER \x{0080}        character        =>  128  =  [\x{0000}-\x{007F}]  =  \p{ascii}
                              [[:unicode:]]  =  \p{unicode}  an  OVER  \x{00FF}        character        =>   27  =  [^\x00-\xFF}]  =  [\x{20AC}\x{201A}\x{0192}\x{201E}\x{2026}\x{2020}\x{2021}\x{02C6}\x{2030}\x{0160}\x{2039}\x{0152}\x{017D}\x{2018}\x{2019}\x{201C}\x{201D}\x{2022}\x{2013}\x{2014}\x{02DC}\x{2122}\x{0161}\x{203A}\x{0153}\x{017E}\x{0178}]
                              
                              [[:cntrl:]]    =  \p{cntrl}    a  CONTROL code character                  =>   38  =  [\x00-\x1F\x7F\x81\x8D\x8F\x90\x9D]  =  \p{Cc}
                              
                              [[:space:]]    =  \p{space}    a WHITE-SPACE character                    =>    7  =  [\t\n\x0B\f\r\x20\xA0]
                              [[:blank:]]    =  \p{blank}    a   BLANK                 character        =>    3  =  [\t\x{0020}\x{00A0}]  =  \p{Zs}|\t
                              
                              [[:upper:]]    =  \p{upper}    an  UPPER case    letter                   =>   60  =  [A-ZŠŒŽŸÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞ]  =  \p{Lu}
                              [[:lower:]]    =  \p{lower}    a   LOWER case    letter                   =>   63  =  [a-zƒšœžµßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ]  =  \p{Ll}
                              [[:digit:]]    =  \p{digit}    a   DECIMAL       number                   =>   10  =  [0-9]  =  \p{Nd}
                              
                              [[:word:]]     =  \p{word}     a   WORD                  character        =>  137  =  \p{L*}|\p{Nd}|_
                              
                              [[:graph:]]    =  \p{graph}    any VISIBLE     character                  =>  215  =  [^\x00-\x1F\x20\x7F\x81\x8D\x8F\x90\x9D\xA0\xAD]  =  (?![\x20\xA0\xAD])\P{Cc}
                              
                              [[:print:]]    =  \p{print}    any PRINTABLE   character                  =>  222  =  [[:graph:][:space:]] = [^\x00-\x1F\x20\x7F\x81\x8D\x8F\x90\x9D\xA0\xAD]|[[:space:]]
                              
                              [[:punct:]]    =  \p{punct}    any PUNCTUATION or SYMBOL character        =>   73  =  \p{P*}|\p{S*}  =  [\x21-\x2F\x3A-\x40\x5B-\x60\x7B-\x7E\x{20AC}\x{201A}\x{201E}\x{2026}\x{2020}\x{2021}\x{2030}\x{2039}\x{2018}\x{2019}\x{201C}\x{201D}\x{2022}\x{2013}\x{2014}\x{02DC}\x{2122}\x{203A}\xA1-\xA9\xAB\xAC\xAE-\xB1\xB4\xB6-\xB8\xBB\xBF\xD7\xF7]
                              
                              [[:alpha:]]    =  \p{alpha}    any LETTER                character        =>  126  =  \p{L*}  =  \p{Lu}|\p{Ll}|[ˆªº]
                              [[:alnum:]]    =  \p{alnum}    an  ALPHANUMERIC          character        =>  136  =  \p{L*}|\p{Nd}
                              
                              [[:xdigit:]]                   an  HEXADECIMAL           character        =>   22  =  [0-9A-Fa-f]  =  (?i)[0-9A-F]
                              

                              Best regards,

                              guy038

                              1 Reply Last reply Reply Quote 1
                              • First post
                                Last post
                              The Community of users of the Notepad++ text editor.
                              Powered by NodeBB | Contributors