Community
    • Login

    Columns++ version 1.3: All Unicode, all the time

    Scheduled Pinned Locked Moved Notepad++ & Plugin Development
    2 Posts 2 Posters 136 Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • CoisesC
      Coises
      last edited by Coises

      Columns++ version 1.3 brings the enhancements for regular expressions in Unicode documents to ANSI documents as well:

      • Regular expressions now match based on Unicode code points in all documents, so the syntax and semantics of regular expressions are no longer dependent on the underlying representation in Scintilla. The features added in version 1.2 for Unicode documents now work in all documents.

      • Regular expressions did not work properly in ANSI documents for the system default code pages 932 (Japanese, Shift-JIS), 936 (Chinese Simplified, GB2312), 949 (Korean, Windows-949 / Unified Hangul Code) or 950 (Chinese Traditional, Big5) in version 1.2. Regular expressions now match these documents based on Unicode code points.

      • Character information tables for regular expressions were updated to reflect Unicode 17.0.0 (released on 2025-09-09).

      I haven’t yet set this as “stable” (and of course it isn’t yet in the plugins admin list). Anyone feeling adventuresome is welcome to try it and see if it behaves. There is documentation describing how Unicode-based matching in Columns++ differs from Notepad++ matching.

      If anyone reading this routinely has one of the CJK code pages as their system default code page, I would be very interested to know if regular expressions in this version appear to work as expected on your locale’s ANSI files.

      1 Reply Last reply Reply Quote 3
      • guy038G
        guy038
        last edited by

        Hello, @coises and All,

        I’ve just tried your last ColumnsPlusPlus v1.3 release and indeed, the search is now considered as a true Unicode search, whatever the individual encoding of each file !

        Let’s consider this simple UTF-8 text :

        This ‟ is a † very • small ‰ text ‱ for › test
           201F    2020   2022   2030    2031    203A  in Unicode UTF-8 enoding
        

        And this ANSI text :

        This ? is a † very • small ‰ text ? for › test
             ?     0086   0095    0089    ?    009B   in Windows-1252 encoding
        

        IMPORTANT Don’t forget, when this second text is opened in N++, to run the Encoding > Convert to ANSI option, first !


        Now, we can create the following table, which recapitulates the Non-ASCII characters used in my examples :

            •--------•-----------------•-----------------•
            |        |   Windows-1252  |     Unicode     |
            |        •--------•--------•--------•--------•
            |  Char  |   Dec  |   Hex  |   Dec  |   Hex  |
            •--------•--------•--------•--------•--------•
            |   ‟    |   ?    |   ?    |  8223  |  201F  |
            |        |        |        |        |        |
            |   †    |  0134  |  0086  |  8224  |  2020  |
            |        |        |        |        |        |
            |   •    |  0149  |  0095  |  8226  |  2022  |
            |        |        |        |        |        |
            |   ‰    |  0137  |  0089  |  8240  |  2030  |
            |        |        |        |        |        |
            |   ‱  |   ?    |   ?    |  8241  |  2031  |
            |        |        |        |        |        |
            |   ›    |  0155  |  009B  |  8250  |  203A  |
            •--------•--------•--------•--------•--------•
        

        • In Notepad++ :

          • Within an ANSI file, the regexes [†-‰] or [\x86-\x89] would only find the characters † and ‰ but not the • whose Win-1252 code ( \x95 ) is after \x89

          • Within an UTF8 file, the regexes [†-‰] or [\x{2020}-\x{2030}] would find the characters † and ‰ and also the • whose Unicode code-point is between 2020 and 2030

        • In Columns++ :

          • Within an ANSI file, the regexes [†-‰] or [\x{2020}-\x{2030}] would find the characters † and ‰ and also the • whose Unicode code-point is between 2020 and 2030

          • Within an UTF8 file, the regexes [†-‰] or [\x{2020}-\x{2030}] would find the characters † and ‰ and also the • whose Unicode code-point is between 2020 and 2030

        Note that using the range [†-›] within an ANSI file, a N++ search of the • char would have been successful as its code-point ( 2022 ) lies within the 2020 and 203A range !


        Now, @coises, I cannot test easily the CJK behaviour of your new search engine as it’s obvious that I do not a default CJK code-page, needed for such a study ! However, I do not see why your new search behavior couln’t be applied to any kind of Unicode chars ;-)

        Best Regards,

        guy038

        1 Reply Last reply Reply Quote 0
        • First post
          Last post
        The Community of users of the Notepad++ text editor.
        Powered by NodeBB | Contributors