Columns++: Where regex meets Unicode (Here there be dragons!)
-
There is still much to do, but I decided it was time for another experimental release:
Columns++ version 1.1.5.2-experimental
Comments, observations and suggestions are most welcome!
Here is what is expected to be true of regular expressions in this release:
Matching is based on Unicode code points. (In Notepad++ search, matching is based on UTF-16 code units.)
You can use hexadecimal numbers for code points outside the basic multilingual plane (e.g.,
\x{1F642}
for 🙂). This works in both the find and the replace fields.The character classes documented for Unicode work, with the exception of Cs/Surrogate. (Unpaired surrogates cannot yield valid UTF-8; Scintilla displays attepts to encode them — aka WTF-8 — as three invalid bytes, and this regular expression implementation treats them the same way.)
These escapes are added:
\i
- matches invalid UTF-8 characters. (You can also use[:invalid:]
.)\m
- matches “marks”: Unicode characters that combine graphically with the previous character.\o
- matches ASCII characters (code points 0-127).\y
- matches defined characters (all except unassigned, invalid and private use).\I
,\M
,\O
and\Y
match the complements of those classes.
\X
should now always match exactly one graphical character. It is equivalent to\M\m*
(Notepad++ search supports\X
, but it does not work well for characters outside the basic multilingual plane.)The period matches any single Unicode code point except new line and return; since those characters, and only those characters, end a visible line in Notepad++, this makes
.
match all characters within a single visible line and no line breaks, which is consistent with the documentation. (In Notepad++ search, despite the documentation,.
does not match form feed, next line, line separator, paragraph separator, new line or carriage return.)Unicode no longer classifies the Mongolian vowel separator as a white space character. The escapes
\h
and\s
and the[:space:]
character class do not match it. (In Notepad++ search they do match.) The[:blank:]
character class matches horizontal white space, equivalent to\h
. (In Notepad++,[:blank:]
matches tab, space, non-breaking space, ideographic space and zero-width no-break space, but not other horizontal white space.)All the control character and non-printing character abbreviations that are shown (depending on View | Show Symbol settings) in reverse colors can be used as symbolic character names: e.g.,
[[.NBSP.]]
will find non-breaking spaces.Things that still don’t work:
Equivalence classes (
[[=x=]]
) don’t work for characters outside the basic multilingual plane.Unicode character names (e.g.,
[[.CYRILLIC SMALL LETTER RHA.]]
) don’t work.Attempting Select All (dropdown from Count) on something that matches tens of thousands of times or more will hang the application. (It will eventually complete, but can take as long as several minutes.) For reasons not yet known to me, this is much slower than a replace of an equivalent number of matches; which, in turn, is slower than the same action in Notepad++ search.
-
Hello, @coises and All,
Since 11h till now ( 23h30, in France ), I’ve been testing your second version and, so far, everything works correctly, as you explained in your documentation ;-))
I’m going to stop for a bite to eat and resume the tests immediately afterwards.
( Note that my wife is staying away, with her mother for eight days. So, I’m enjoying my freedom !! )
Best Regards,
guy038
-
@guy038 said in Columns++: Where regex meets Unicode (Here there be dragons!):
I’ve been testing
Thank you so much for helping with this! I really appreciate it.
I’m working now on better case insensitive matching (I think it can’t be working outside the BMP in 1.1.5.2, but I’m a bit out of my depth since I know nothing about non-Latin alphabets, outside of knowing Greek lower-case sigma has a different form when it’s at the end of a word), better “equivalence” (
[[=x=]]
) outside the BMP (as far as I can tell, that is locale dependent, which again leaves me with no idea how to tell if it’s working correctly — I’m just a dumb American), and speed. -
Hello, @coises,
I’ve just finished testing and, after dinner, I’ll elaborate my reply. But, rest assured : you did an awesome work in this seccond experimental version ;-))
Best Regards,
guy038
-
Hello, @coises and All,
Presently, with the default Boost regex engine,
26
collating elements ONLY can be found :[[.NUL.][.SOH.][.STX.][.ETX.][.EOT.][.ENQ.][.ACK.][.SO.][.SI.][.DLE.][.DC1.][.DC2.][.DC3.][.DC4.][.NAK.][.SYN.][.ETB.][.CAN.][.EM.][.SUB.][.ESC.][.IS4.][.IS3.][.IS2.][.IS1.][.DEL.]]
With your second experimental version of Columns++,
114
collating elements can be found. Whaouh ![[.NUL.][.SOH.][.STX.][.ETX.][.EOT.][.ENQ.][.ACK.][.BEL.][.BS.][.HT.][.VT.][.FF.][.SO.][.SI.][.DLE.][.DC1.][.DC2.][.DC3.][.DC4.][.NAK.][.SYN.][.ETB.][.CAN.][.EM.][.SUB.][.ESC.][.FS.][.GS.][.RS.][.US.][.DEL.][.PAD.][.HOP.][.BPH.][.NBH.][.IND.][.NEL.][.SSA.][.ESA.][.HTS.][.HTJ.][.LTS.][.PLD.][.PLU.][.RI.][.SS2.][.SS3.][.DCS.][.PU1.][.PU2.][.STS.][.CCH.][.MW.][.SPA.][.EPA.][.SOS.][.SGCI.][.SCI.][.CSI.][.ST.][.OSC.][.PM.][.APC.][.NBSP.][.SHY.][.ALM.][.SAM.][.OSPM.][.MVS.][.NQSP.][.MQSP.][.ENSP.][.EMSP.][.3/MSP.][.4/MSP.][.6/MSP.][.FSP.][.PSP.][.THSP.][.HSP.][.ZWSP.][.ZWNJ.][.ZWJ.][.LRM.][.RLM.][.LS.][.PS.][.LRE.][.RLE.][.PDF.][.LRO.][.RLO.][.NNBSP.][.MMSP.][.WJ.][.(FA).][.(IT).][.(IS).][.(IP).][.LRI.][.RLI.][.FSI.][.PDI.][.ISS.][.ASS.][.IAFS.][.AAFS.][.NADS.][.NODS.][.IDSP.][.ZWNBSP.][.IAA.][.IAS.][.IAT.]]
However, the following FOUR ones cannot be reached although they are format characters ( NOT important )
| 1BCA0 | SHORTHAND FORMAT LETTER OVERLAP | [.SFLO.] | | 1BCA1 | SHORTHAND FORMAT CONTINUING OVERLAP | [.SFCO.] | | 1BCA2 | SHORTHAND FORMAT DOWN STEP | [.SFDS.] | | 1BCA3 | SHORTHAND FORMAT UP STEP | [.SFUS.] |
Now, against the
Total_Chars.txt
file, all the following results are totally correct :(?s). = \I = \p{Any} = [\x{0000}-\x{EFFFD}] => 325,590 \p{Unicode} = [[:Unicode:]] => 325,334 | | Total = 325,590 \P{Unicode} = [[:^Unicode:]] => 256 | \p{Ascii} = (?s)\o => 128 | | Total = 325,590 \P{Ascii} = \O => 325,462 | \X = \M\m* => \X = \M => 323,089 | | Total = 325,590 \m => 2,501 | [\x{E000}-\x{F8FF}]|\y = \p{Assigned} => 161,463 | | Total = 325,590 (?![\x{E000}-\x{F8FF}])\Y = \p{Not Assigned} => 164,127 |
Regarding
\m
, for example, the regexes(?=[\x{0300}-\x{036F}])\m
or(?=\m)[\x{0300}-\x{036F}]
would return112
occurrences, i.e. all Mark characters of the COMBINING DIACRITICAL MARKS Unicode block ( refer https://www.unicode.org/charts/PDF/U0300.pdf )
Here are the correct results, concerning all the Posix character classes, against the
Total_Chars.txt
file[[:ascii:]] an UNDER \x{0080} character 128 = [\x{0000}-\x{007F}] = \p{ascii} [[:unicode]] = \p{unicode} an OVER \x{00FF} character 325,334 = [\x{0100}-\x{EFFFD}] ( in 'Total_Chars.txt' ) [[:space:]] = \p{space} = [[:s:]] = \p{s} = \ps = \s a WHITE-SPACE character 25 (26) = [\t\n\x{000B}\f\r \x{0085}\x{00A0}\x{1680}\x{2000}-\x{200A}\x{2028}\x{2029}\x{202F}\x{205F}\x{3000}] [[:h:]] = \p{h} = \ph = \h an HORIZONTAL white space character 18 (19) = [\t \x{00A0}\x{1680}\x{2000}-\x{200A}\x{202F}\x{205F}\x{3000}] = \p{Zs}|\t [[:blank:]] = \p{blank} a BLANK character 18 ( 5) = [\t \x{00A0}\x{1680}\x{2000}-\x{200A}\x{202F}\x{205F}\x{3000}] = \p{Zs}|\t [[:v:]] = \p{v} = \pv = \v a VERTICAL white space character 7 = [\n\x{000B}\f\r\x{0085}\x{2028}\x{2029}] [[:cntrl:]] = \p{cntrl} a CONTROL code character 235 (99) = [\x{0000}-\x{001F}\x{007F}\x{0080}-\x{009F}\x{00AD}....] Should be 65 like \p{Cc} [[:upper:]] = \p{upper} = [[:u:]] = \p{u} = \pu = \u an UPPER case letter 1,858 (927 + 31) = \p{Lu} [[:lower:]] = \p{lower} = [[:l:]] = \p{l} = \pl = \l a LOWER case letter 2,258 (1,216 + 31) = \p{Ll} a DI-GRAPIC letter 31 (0) = \p{Lt} a MODIFIER letter 404 = \p{Lm} an OTHER letter + SYLLABES / IDEOGRAPHS 136,477 = \p{Lo} [[:digit:]] = \p{digit} = [[:d:]] = \p{d} = \pd = \d a DECIMAL number 760 (313) = \p{Nd} _ = \x{005F} the LOW_LINE character 1 ----------- [[:word:]] = \p{word} = [[:w:]] = \p{w} = \pw = \w a WORD character 141,789 (48,031) = \p{L*}|\p{nd}|_ [[:alnum:]] = \p{alnum} an ALPHANUMERIC character 141,788 (48,030) = \p{L*}|\p{nd} [[:alpha:]] = \p{alpha} any LETTER character 141,028 (47,717) = \p{L*} [[:graph:]] = \p{graph} any VISIBLE character 154,809 (62,671) [[:print:]] = \p{print} any PRINTABLE character 154,834 (48,579) = [[:graph:]]|\s [[:punct:]] = \p{punct} any PUNCTUATION character 9,369 (528) [[:xdigit:]] an HEXADECIMAL character 22 = [0-9A-Fa-f]
Note that, between parentheses, I indicated the present Boost results, which are mostly erroneous !
BTW, there are
31
di-graph characters, which are, either, considered as upper case and lower case letters, and which can be found with the Unicode class char\p{Lt}
. With our present Boost regex engine, it correctly adds it, both, as an upper and lower letter !However, an odd thing is the result of the
[[:cntrl:]]
characters class : normally, as I said above, it should be65
, so\p{Cc}
(32
for the CO controls, +DEL
+32
for the C1 control codes !
And here are the correct results regarding the Unicode character classes, against the
Total_Chars.txt
file :\p{Any} any character 325,590 = (?s). = \I = [\x{0000}-\x{EFFFD} \p{Ascii} a character UNDER \x80 128 \p{Assigned} an ASSIGNED character 161,463 \p{Cc} \p{Control} a C0 or C1 CONTROL code character 65 \p{Cf} \p{Format} a FORMAT CONTROL character 170 \p{Cn} \p{Not Assigned} an UNASSIGNED or NON-CHARACTER character 164,127 ( 'Total_Chars.txt' does NOT contain the 66 NON-CHARACTER chars ) \p{Co} \p{Private Use} a PRIVATE-USE character 6,400 \p{Cs} \p{Surrogate} a SURROGATE character ( ERROR) ( 2,048 ) ( 'Total_Chars.txt' does NOT contain the 2,048 SURROGATE chars ) ----------- \p{C*} \p{Other} 170,762 = \p{Cc}|\p{Cf}|\p{Cn}|\p{Co} \p{Lu} \p{Uppercase Letter} an UPPER case letter 1,858 \p{Ll} \p{Lowercase Letter} a LOWER case letter 2,258 \p{Lt} \p{Titlecase} a DI-GRAPHIC letter 31 \p{Lm} \p{Modifier Letter} a MODIFIER letter 404 \p{Lo} \p{Other Letter} OTHER LETTER, including SYLLABLES and IDEOGRAPHS 136,477 ----------- \p{L*} \p{Letter} 141,028 = \p{Lu}|\p{Ll}|\p{Lt}|\p{Lm}|\p{Lo} \p{Mc} \p{Spacing Combining Mark} a NON-SPACING COMBINING mark (ZERO advance width) 468 \p{Me} \p{Enclosing Mark} a SPACING COMBINING mark (POSITIVE advance width) 13 \p{Mn} \p{Non-Spacing Mark} an ENCLOSING COMBINING mark 2,020 --------- \p{M*} \p{Mark} 2,501 = \p{Mc}|\p{Me}|\p{Mn} = \m \p{Nd} \p{Decimal Digit Number} a DECIMAL number character 760 \p{Nl} \p{Letter Number} a LETTERLIKE numeric character 236 \p{No} \p{Other Number} OTHER NUMERIC character 915 --------- \p{N*} \p{Number} 1,911 = \p{Nd}|\p{Nl}|\p{No} \p{Pd} \p{Dash Punctuation} a DASH or HYPHEN punctuation mark 27 \p{Ps} \p{Open Punctuation} an OPENING PUNCTUATION mark in a pair 79 \p{Pc} \p{Connector Punctuation} a CONNECTING PUNCTUATION mark 10 \p{Pe} \p{Close Punctuation} a CLOSING PUNCTUATION mark in a pair 77 \p{Pi} \p{Initial Punctuation} an INITIAL QUOTATION mark 12 \p{Pf} \p{Final Punctuation} a FINAL QUOTATION mark 10 \p{Po} \p{Other Punctuation} OTHER PUNCTUATION mark 640 ------- \p{P*} \p{Punctuation} 855 = \p{Pd}|\p{Ps}|\p{Pc}|\p{Pe}|\p{Pi}|\p{Pf}|\p{Po} \p{Sm} \p{Math Symbol} a MATHEMATICAL symbol character 950 \p{Sc} \p{Currency Symbol} a CURRENCY character 63 \p{Sk} \p{Modifier Symbol} a NON-LETTERLIKE MODIFIER character 125 \p{So} \p{Other Symbol} OTHER SYMBOL character 7,376 \p{S*} \p{Symbol} 8,514 = \p{Sm}|\p{Sc}|\p{Sk}|\p{So} \p{Zs} \p{Space Separator} a NON-ZERO width SPACE character 17 = [\x{0020}\x{00A0}\x{1680}\x{2000}-\x{200A}\x{202F}\x{205F}\x{3000}] = (?!\t)\h \p{Zl} \p{Line Separator} the LINE SEPARATOR character 1 = \x{2028} \p{Zp} \p{Paragraph Separator} the PARAGRAPH SEPARATOR character 1 = \x{2029} ------ \p{Z*} \p{Separator} 19 = \p{Zs}|\p{Zl}|\p{Zp}
Note that the total of the
\p(M*}
mark characters is exactly the result given by the\m
regex !
Now, if you follow the procedure explained in the last part of this post :
https://community.notepad-plus-plus.org/post/99844
The regexes
[\x{DC80}-\x{DCFF}]
or\i
or[[:invalid:]]
do give134
occurrences, which is the exact number of invalid characters of this example !
In a nutshell, hats off to you ! No problem detected, so far. It’s a major version ! From your last post, I understood that you’re still working for some improvements !
At the end, I’ll put a new version of my
Unicode.zip
archive, in my Google Drive account, referring to your latest experimental version ofColumnsPLusPlus
which should highly simplify the regex syntax, in order to count or mark all chars of Unicode ranges !
In a next post, I’ll expose two points concerning, more specifically,
ColumnsPlusPlus
!Best Regards,
guy038
P.S. :
-
A negative POSIX character class can be expressed as
[^[:........:]]
or[[:^........:]]
-
A negative UNICODE character class can be expressed as
\P{..}
-
-
Hi, @coises,
Two points :
-
Seemingly, if I select all the text of a regex, it does not appear automatically in the
Find What :
zone and I need aCtrl + C
/Ctrl + V
operation. Is this on purpose ? -
You may enter a very long line of text in the
Find What :
zone. I verified that you can add up to30,000
chars if it does not contain any line-break. So my question is :
Is it possible to enter a multi-lines text in the search zone of
ColumnsPlusPlus
?TIA,
BR
guy038
-
-
@guy038 said in Columns++: Where regex meets Unicode (Here there be dragons!):
With your second experimental version of Columns++,
114
collating elements can be found. Whaouh !There are actually 116 in that group, since
[.LF.]
and[.CR.]
are also included.However, the following FOUR ones cannot be reached although they are format characters ( NOT important )
| 1BCA0 | SHORTHAND FORMAT LETTER OVERLAP | [.SFLO.] | | 1BCA1 | SHORTHAND FORMAT CONTINUING OVERLAP | [.SFCO.] | | 1BCA2 | SHORTHAND FORMAT DOWN STEP | [.SFDS.] | | 1BCA3 | SHORTHAND FORMAT UP STEP | [.SFUS.] |
I’ll add those.
BTW, there are
31
di-graph characters, which are, either, considered as upper case and lower case letters, and which can be found with the Unicode class char\p{Lt}
. With our present Boost regex engine, it correctly adds it, both, as an upper and lower letter !I see that indeed, when using Notepad++,
(?-i)\u(?<=\l)
matches 31 characters. I’m not yet convinced that is desirable, though. Shouldn’t\l
/\u
,[:lower:]
/[:upper:]
and[:Ll:]
/[:Lu:]
all be the same? The title case characters are[:Lt:]
.However, an odd thing is the result of the
[[:cntrl:]]
characters class : normally, as I said above, it should be65
, so\p{Cc}
(32
for the CO controls, +DEL
+32
for the C1 control codes!It wasn’t clear to me how the POSIX
[:cntrl:]
definition should be applied to Unicode. Notepad++ search appears to include most of the Cc and Cf characters in the basic multilingual plane, so I made it Cc + Cf. I’ll change that to Cc only.Thank you so much for looking so closely at all of this!
-
This post is deleted! -
@guy038 said in Columns++: Where regex meets Unicode (Here there be dragons!):
- Seemingly, if I select all the text of a regex, it does not appear automatically in the
Find What :
zone and I need aCtrl + C
/Ctrl + V
operation. Is this on purpose ?
It is. My original motivation for including a search function in Columns++ was to make it possible to search in rectangular selections — something Notepad++ search will not do. Thus, I expected that the most common way of using it would be to select the rectangular block in which you want to search, then open the dialog. (In a rectangular selection,
^
and$
match the beginning and end of the selection in each row. I attempt to explain the whole thing here.)After some feedback, I made it so that the initial selection is used to set an indicator to the region to be searched. That made sequential finds make a lot more sense.
My search has been subject to “mission creep” as I added formulas in replacement text, the Select options on the Count drop-down, the ability to convert multiple selections to search regions, and now 32-bit Unicode searching. Some day I might make a separate search plugin (or try to make a case for adding these features to Notepad++); for now, the one in Columns++ will be first oriented toward working conveniently with rectangular selections.
- You may enter a very long line of text in the
Find What :
zone. I verified that you can add up to30,000
chars if it does not contain any line-break. So my question is :
Is it possible to enter a multi-lines text in the search zone of
ColumnsPlusPlus
?At present, no. It’s a good idea, though. I’ve thought of making it possible to open a separate window in which to enter search and replacement expressions, allowing more space and maybe containing a feature to pin frequently-used searches and/or a “builder” that would guide novices in the construction of regular expressions.
That’s getting so complex, though, that I think it might have to wait for that apocryphal day when I build a separate plugin that’s just for search.
- Seemingly, if I select all the text of a regex, it does not appear automatically in the
-
Hello, @coises and All,
Thanks for adding the
4
remaining elements : so we’ll get a round number of collating elements :120
!You said :
Notepad++ search appears to include most of the Cc and Cf characters in the basic multilingual plane, so I made it Cc + Cf. I’ll change that to Cc only
I confirm that, in your second version,
[[:cntrl:]]
=\p{Cc}
+\p{Cf}
= 65 + 170 =235
and thanks for the future modification
You said :
Shouldn’t
\l
/\u
,[:lower:]
/[:upper:]
and[:Ll:]
/[:Lu:]
all be the same? The title case characters are[:Lt:].
What do you want to say ? Presently, in your second version, it’s just the case, as shown below or may I miss something obvious !?
[[:upper:]] = \p{upper} = [[:u:]] = \p{u} = \pu = \u = \p{Lu} = \p{Uppercase Letter} = [[:Lu:]] an UPPER case letter = 1,858 [[:lower:]] = \p{lower} = [[:l:]] = \p{l} = \pl = \l = \p{Ll} = \p{Lowercase Letter} = [[:Ll:]] a LOWER case letter = 2,258
BTW, I didn’t know that the syntax of an Unicode character class
\p{Xy}
could also be expressed as[[:Xy:]]
!Best Regards,
guy038
-
@guy038 said in Columns++: Where regex meets Unicode (Here there be dragons!):
You said :
Shouldn’t
\l
/\u
,[:lower:]
/[:upper:]
and[:Ll:]
/[:Lu:]
all be the same? The title case characters are[:Lt:].
What do you want to say ? Presently, in your second version, it’s just the case, as shown below or may I miss something obvious !?
I don’t think you missed anything. I think I might have misunderstood you. I thought you were saying that
[:lower:]
and[:upper:]
and/or\l
and\u
should match the[:Lt:]
characters, so that those 31 characters are both upper case and lower case. Perhaps we are agreed that they are neither.BTW, I didn’t know that the syntax of an Unicode character class
\p{Xy}
could also be expressed as[[:Xy:]]
!Boost::regex is built such that
\p{whatever}
and[[:whatever:]]
are the same. It also “delegates” backslash lower case letter escapes that don’t have any other meaning to classes with the same name, and upper case escapes without another meaning to the complements; so\s
is internally “defined” as[[:s:]]
and\S
as +[^[:s:]]
. That’s how I was able to define\i
,\m
,\o
and\y
. It’s also why we have to write\p{L*}
instead of\p{L}
: class names are case insensitive, and “l” already defines\l
as lower case. For consistency, all the Unicode general category groups use the asterisk notation. -
I’m hopeful that the real end goal for all of this is integration into native Notepad++, and that the plugin is really just a “testbed” for what you’re doing. Columns++ is great, but this is about core unicode searching, and as such really belongs in the standard product.
It’s great that a person has finally been found that’s capable of (and interested in) doing this stuff, and it would be a shame if Notepad++ moves forward without the benefits of this work.
Thank you for your work.