Columns++ version 1.3: All Unicode, all the time

Coises

Columns++ version 1.3 brings the enhancements for regular expressions in Unicode documents to ANSI documents as well:

Regular expressions now match based on Unicode code points in all documents, so the syntax and semantics of regular expressions are no longer dependent on the underlying representation in Scintilla. The features added in version 1.2 for Unicode documents now work in all documents.
Regular expressions did not work properly in ANSI documents for the system default code pages 932 (Japanese, Shift-JIS), 936 (Chinese Simplified, GB2312), 949 (Korean, Windows-949 / Unified Hangul Code) or 950 (Chinese Traditional, Big5) in version 1.2. Regular expressions now match these documents based on Unicode code points.
Character information tables for regular expressions were updated to reflect Unicode 17.0.0 (released on 2025-09-09).

I haven’t yet set this as “stable” (and of course it isn’t yet in the plugins admin list). Anyone feeling adventuresome is welcome to try it and see if it behaves. There is documentation describing how Unicode-based matching in Columns++ differs from Notepad++ matching.

If anyone reading this routinely has one of the CJK code pages as their system default code page, I would be very interested to know if regular expressions in this version appear to work as expected on your locale’s ANSI files.

guy038

Hello, @coises and All,

I’ve just tried your last ColumnsPlusPlus v1.3 release and indeed, the search is now considered as a true Unicode search, whatever the individual encoding of each file !

Let’s consider this simple UTF-8 text :

This ‟ is a † very • small ‰ text ‱ for › test
   201F    2020   2022   2030    2031    203A  in Unicode UTF-8 enoding

And this ANSI text :

This ? is a † very • small ‰ text ? for › test
     ?     0086   0095    0089    ?    009B   in Windows-1252 encoding

IMPORTANT Don’t forget, when this second text is opened in N++, to run the Encoding > Convert to ANSI option, first !

Now, we can create the following table, which recapitulates the Non-ASCII characters used in my examples :

    •--------•-----------------•-----------------•
    |        |   Windows-1252  |     Unicode     |
    |        •--------•--------•--------•--------•
    |  Char  |   Dec  |   Hex  |   Dec  |   Hex  |
    •--------•--------•--------•--------•--------•
    |   ‟    |   ?    |   ?    |  8223  |  201F  |
    |        |        |        |        |        |
    |   †    |  0134  |  0086  |  8224  |  2020  |
    |        |        |        |        |        |
    |   •    |  0149  |  0095  |  8226  |  2022  |
    |        |        |        |        |        |
    |   ‰    |  0137  |  0089  |  8240  |  2030  |
    |        |        |        |        |        |
    |   ‱  |   ?    |   ?    |  8241  |  2031  |
    |        |        |        |        |        |
    |   ›    |  0155  |  009B  |  8250  |  203A  |
    •--------•--------•--------•--------•--------•

In Notepad++ :
- Within an ANSI file, the regexes [†-‰] or [\x86-\x89] would only find the characters † and ‰ but not the • whose Win-1252 code ( \x95 ) is after \x89
- Within an UTF8 file, the regexes [†-‰] or [\x{2020}-\x{2030}] would find the characters † and ‰ and also the • whose Unicode code-point is between 2020 and 2030
In Columns++ :
- Within an ANSI file, the regexes [†-‰] or [\x{2020}-\x{2030}] would find the characters † and ‰ and also the • whose Unicode code-point is between 2020 and 2030
- Within an UTF8 file, the regexes [†-‰] or [\x{2020}-\x{2030}] would find the characters † and ‰ and also the • whose Unicode code-point is between 2020 and 2030

Note that using the range [†-›] within an ANSI file, a N++ search of the • char would have been successful as its code-point ( 2022 ) lies within the 2020 and 203A range !

Now, @coises, I cannot test easily the CJK behaviour of your new search engine as it’s obvious that I do not a default CJK code-page, needed for such a study ! However, I do not see why your new search behavior couln’t be applied to any kind of Unicode chars ;-)

Best Regards,

guy038