Columns++: Where regex meets Unicode (Here there be dragons!)
-
There is still much to do, but I decided it was time for another experimental release:
Columns++ version 1.1.5.2-experimental
Comments, observations and suggestions are most welcome!
Here is what is expected to be true of regular expressions in this release:
Matching is based on Unicode code points. (In Notepad++ search, matching is based on UTF-16 code units.)
You can use hexadecimal numbers for code points outside the basic multilingual plane (e.g.,
\x{1F642}
for 🙂). This works in both the find and the replace fields.The character classes documented for Unicode work, with the exception of Cs/Surrogate. (Unpaired surrogates cannot yield valid UTF-8; Scintilla displays attepts to encode them — aka WTF-8 — as three invalid bytes, and this regular expression implementation treats them the same way.)
These escapes are added:
\i
- matches invalid UTF-8 characters. (You can also use[:invalid:]
.)\m
- matches “marks”: Unicode characters that combine graphically with the previous character.\o
- matches ASCII characters (code points 0-127).\y
- matches defined characters (all except unassigned, invalid and private use).\I
,\M
,\O
and\Y
match the complements of those classes.
\X
should now always match exactly one graphical character. It is equivalent to\M\m*
(Notepad++ search supports\X
, but it does not work well for characters outside the basic multilingual plane.)The period matches any single Unicode code point except new line and return; since those characters, and only those characters, end a visible line in Notepad++, this makes
.
match all characters within a single visible line and no line breaks, which is consistent with the documentation. (In Notepad++ search, despite the documentation,.
does not match form feed, next line, line separator, paragraph separator, new line or carriage return.)Unicode no longer classifies the Mongolian vowel separator as a white space character. The escapes
\h
and\s
and the[:space:]
character class do not match it. (In Notepad++ search they do match.) The[:blank:]
character class matches horizontal white space, equivalent to\h
. (In Notepad++,[:blank:]
matches tab, space, non-breaking space, ideographic space and zero-width no-break space, but not other horizontal white space.)All the control character and non-printing character abbreviations that are shown (depending on View | Show Symbol settings) in reverse colors can be used as symbolic character names: e.g.,
[[.NBSP.]]
will find non-breaking spaces.Things that still don’t work:
Equivalence classes (
[[=x=]]
) don’t work for characters outside the basic multilingual plane.Unicode character names (e.g.,
[[.CYRILLIC SMALL LETTER RHA.]]
) don’t work.Attempting Select All (dropdown from Count) on something that matches tens of thousands of times or more will hang the application. (It will eventually complete, but can take as long as several minutes.) For reasons not yet known to me, this is much slower than a replace of an equivalent number of matches; which, in turn, is slower than the same action in Notepad++ search.
-
Hello, @coises and All,
Since 11h till now ( 23h30, in France ), I’ve been testing your second version and, so far, everything works correctly, as you explained in your documentation ;-))
I’m going to stop for a bite to eat and resume the tests immediately afterwards.
( Note that my wife is staying away, with her mother for eight days. So, I’m enjoying my freedom !! )
Best Regards,
guy038
-
@guy038 said in Columns++: Where regex meets Unicode (Here there be dragons!):
I’ve been testing
Thank you so much for helping with this! I really appreciate it.
I’m working now on better case insensitive matching (I think it can’t be working outside the BMP in 1.1.5.2, but I’m a bit out of my depth since I know nothing about non-Latin alphabets, outside of knowing Greek lower-case sigma has a different form when it’s at the end of a word), better “equivalence” (
[[=x=]]
) outside the BMP (as far as I can tell, that is locale dependent, which again leaves me with no idea how to tell if it’s working correctly — I’m just a dumb American), and speed. -
Hello, @coises,
I’ve just finished testing and, after dinner, I’ll elaborate my reply. But, rest assured : you did an awesome work in this seccond experimental version ;-))
Best Regards,
guy038
-
Hello, @coises and All,
Presently, with the default Boost regex engine,
26
collating elements ONLY can be found :[[.NUL.][.SOH.][.STX.][.ETX.][.EOT.][.ENQ.][.ACK.][.SO.][.SI.][.DLE.][.DC1.][.DC2.][.DC3.][.DC4.][.NAK.][.SYN.][.ETB.][.CAN.][.EM.][.SUB.][.ESC.][.IS4.][.IS3.][.IS2.][.IS1.][.DEL.]]
With your second experimental version of Columns++,
114
collating elements can be found. Whaouh ![[.NUL.][.SOH.][.STX.][.ETX.][.EOT.][.ENQ.][.ACK.][.BEL.][.BS.][.HT.][.VT.][.FF.][.SO.][.SI.][.DLE.][.DC1.][.DC2.][.DC3.][.DC4.][.NAK.][.SYN.][.ETB.][.CAN.][.EM.][.SUB.][.ESC.][.FS.][.GS.][.RS.][.US.][.DEL.][.PAD.][.HOP.][.BPH.][.NBH.][.IND.][.NEL.][.SSA.][.ESA.][.HTS.][.HTJ.][.LTS.][.PLD.][.PLU.][.RI.][.SS2.][.SS3.][.DCS.][.PU1.][.PU2.][.STS.][.CCH.][.MW.][.SPA.][.EPA.][.SOS.][.SGCI.][.SCI.][.CSI.][.ST.][.OSC.][.PM.][.APC.][.NBSP.][.SHY.][.ALM.][.SAM.][.OSPM.][.MVS.][.NQSP.][.MQSP.][.ENSP.][.EMSP.][.3/MSP.][.4/MSP.][.6/MSP.][.FSP.][.PSP.][.THSP.][.HSP.][.ZWSP.][.ZWNJ.][.ZWJ.][.LRM.][.RLM.][.LS.][.PS.][.LRE.][.RLE.][.PDF.][.LRO.][.RLO.][.NNBSP.][.MMSP.][.WJ.][.(FA).][.(IT).][.(IS).][.(IP).][.LRI.][.RLI.][.FSI.][.PDI.][.ISS.][.ASS.][.IAFS.][.AAFS.][.NADS.][.NODS.][.IDSP.][.ZWNBSP.][.IAA.][.IAS.][.IAT.]]
However, the following FOUR ones cannot be reached although they are format characters ( NOT important )
| 1BCA0 | SHORTHAND FORMAT LETTER OVERLAP | [.SFLO.] | | 1BCA1 | SHORTHAND FORMAT CONTINUING OVERLAP | [.SFCO.] | | 1BCA2 | SHORTHAND FORMAT DOWN STEP | [.SFDS.] | | 1BCA3 | SHORTHAND FORMAT UP STEP | [.SFUS.] |
Now, against the
Total_Chars.txt
file, all the following results are totally correct :(?s). = \I = \p{Any} = [\x{0000}-\x{EFFFD}] => 325,590 \p{Unicode} = [[:Unicode:]] => 325,334 | | Total = 325,590 \P{Unicode} = [[:^Unicode:]] => 256 | \p{Ascii} = (?s)\o => 128 | | Total = 325,590 \P{Ascii} = \O => 325,462 | \X = \M\m* => \X = \M => 323,089 | | Total = 325,590 \m => 2,501 | [\x{E000}-\x{F8FF}]|\y = \p{Assigned} => 161,463 | | Total = 325,590 (?![\x{E000}-\x{F8FF}])\Y = \p{Not Assigned} => 164,127 |
Regarding
\m
, for example, the regexes(?=[\x{0300}-\x{036F}])\m
or(?=\m)[\x{0300}-\x{036F}]
would return112
occurrences, i.e. all Mark characters of the COMBINING DIACRITICAL MARKS Unicode block ( refer https://www.unicode.org/charts/PDF/U0300.pdf )
Here are the correct results, concerning all the Posix character classes, against the
Total_Chars.txt
file[[:ascii:]] an UNDER \x{0080} character 128 = [\x{0000}-\x{007F}] = \p{ascii} [[:unicode:]] = \p{unicode} an OVER \x{00FF} character 325,334 = [\x{0100}-\x{EFFFD}] ( in 'Total_Chars.txt' ) [[:space:]] = \p{space} = [[:s:]] = \p{s} = \ps = \s a WHITE-SPACE character 25 (26) = [\t\n\x{000B}\f\r \x{0085}\x{00A0}\x{1680}\x{2000}-\x{200A}\x{2028}\x{2029}\x{202F}\x{205F}\x{3000}] [[:h:]] = \p{h} = \ph = \h an HORIZONTAL white space character 18 (19) = [\t \x{00A0}\x{1680}\x{2000}-\x{200A}\x{202F}\x{205F}\x{3000}] = \p{Zs}|\t [[:blank:]] = \p{blank} a BLANK character 18 ( 5) = [\t \x{00A0}\x{1680}\x{2000}-\x{200A}\x{202F}\x{205F}\x{3000}] = \p{Zs}|\t [[:v:]] = \p{v} = \pv = \v a VERTICAL white space character 7 = [\n\x{000B}\f\r\x{0085}\x{2028}\x{2029}] [[:cntrl:]] = \p{cntrl} a CONTROL code character 235 (99) = [\x{0000}-\x{001F}\x{007F}\x{0080}-\x{009F}\x{00AD}....] Should be 65 like \p{Cc} [[:upper:]] = \p{upper} = [[:u:]] = \p{u} = \pu = \u an UPPER case letter 1,858 (927 + 31) = \p{Lu} [[:lower:]] = \p{lower} = [[:l:]] = \p{l} = \pl = \l a LOWER case letter 2,258 (1,216 + 31) = \p{Ll} a DI-GRAPIC letter 31 (0) = \p{Lt} a MODIFIER letter 404 = \p{Lm} an OTHER letter + SYLLABES / IDEOGRAPHS 136,477 = \p{Lo} [[:digit:]] = \p{digit} = [[:d:]] = \p{d} = \pd = \d a DECIMAL number 760 (313) = \p{Nd} _ = \x{005F} the LOW_LINE character 1 ----------- [[:word:]] = \p{word} = [[:w:]] = \p{w} = \pw = \w a WORD character 141,789 (48,031) = \p{L*}|\p{nd}|_ [[:alnum:]] = \p{alnum} an ALPHANUMERIC character 141,788 (48,030) = \p{L*}|\p{nd} [[:alpha:]] = \p{alpha} any LETTER character 141,028 (47,717) = \p{L*} [[:graph:]] = \p{graph} any VISIBLE character 154,809 (62,671) [[:print:]] = \p{print} any PRINTABLE character 154,834 (48,579) = [[:graph:]]|\s [[:punct:]] = \p{punct} any PUNCTUATION character 9,369 (528) [[:xdigit:]] an HEXADECIMAL character 22 = [0-9A-Fa-f]
Note that, between parentheses, I indicated the present Boost results, which are mostly erroneous !
BTW, there are
31
di-graph characters, which are, either, considered as upper case and lower case letters, and which can be found with the Unicode class char\p{Lt}
. With our present Boost regex engine, it correctly adds it, both, as an upper and lower letter !However, an odd thing is the result of the
[[:cntrl:]]
characters class : normally, as I said above, it should be65
, so\p{Cc}
(32
for the CO controls, +DEL
+32
for the C1 control codes !
And here are the correct results regarding the Unicode character classes, against the
Total_Chars.txt
file :\p{Any} any character 325,590 = (?s). = \I = [\x{0000}-\x{EFFFD} \p{Ascii} a character UNDER \x80 128 \p{Assigned} an ASSIGNED character 161,463 \p{Cc} \p{Control} a C0 or C1 CONTROL code character 65 \p{Cf} \p{Format} a FORMAT CONTROL character 170 \p{Cn} \p{Not Assigned} an UNASSIGNED or NON-CHARACTER character 164,127 ( 'Total_Chars.txt' does NOT contain the 66 NON-CHARACTER chars ) \p{Co} \p{Private Use} a PRIVATE-USE character 6,400 \p{Cs} \p{Surrogate} a SURROGATE character ( ERROR) ( 2,048 ) ( 'Total_Chars.txt' does NOT contain the 2,048 SURROGATE chars ) ----------- \p{C*} \p{Other} 170,762 = \p{Cc}|\p{Cf}|\p{Cn}|\p{Co} \p{Lu} \p{Uppercase Letter} an UPPER case letter 1,858 \p{Ll} \p{Lowercase Letter} a LOWER case letter 2,258 \p{Lt} \p{Titlecase} a DI-GRAPHIC letter 31 \p{Lm} \p{Modifier Letter} a MODIFIER letter 404 \p{Lo} \p{Other Letter} OTHER LETTER, including SYLLABLES and IDEOGRAPHS 136,477 ----------- \p{L*} \p{Letter} 141,028 = \p{Lu}|\p{Ll}|\p{Lt}|\p{Lm}|\p{Lo} \p{Mc} \p{Spacing Combining Mark} a NON-SPACING COMBINING mark (ZERO advance width) 468 \p{Me} \p{Enclosing Mark} a SPACING COMBINING mark (POSITIVE advance width) 13 \p{Mn} \p{Non-Spacing Mark} an ENCLOSING COMBINING mark 2,020 --------- \p{M*} \p{Mark} 2,501 = \p{Mc}|\p{Me}|\p{Mn} = \m \p{Nd} \p{Decimal Digit Number} a DECIMAL number character 760 \p{Nl} \p{Letter Number} a LETTERLIKE numeric character 236 \p{No} \p{Other Number} OTHER NUMERIC character 915 --------- \p{N*} \p{Number} 1,911 = \p{Nd}|\p{Nl}|\p{No} \p{Pd} \p{Dash Punctuation} a DASH or HYPHEN punctuation mark 27 \p{Ps} \p{Open Punctuation} an OPENING PUNCTUATION mark in a pair 79 \p{Pc} \p{Connector Punctuation} a CONNECTING PUNCTUATION mark 10 \p{Pe} \p{Close Punctuation} a CLOSING PUNCTUATION mark in a pair 77 \p{Pi} \p{Initial Punctuation} an INITIAL QUOTATION mark 12 \p{Pf} \p{Final Punctuation} a FINAL QUOTATION mark 10 \p{Po} \p{Other Punctuation} OTHER PUNCTUATION mark 640 ------- \p{P*} \p{Punctuation} 855 = \p{Pd}|\p{Ps}|\p{Pc}|\p{Pe}|\p{Pi}|\p{Pf}|\p{Po} \p{Sm} \p{Math Symbol} a MATHEMATICAL symbol character 950 \p{Sc} \p{Currency Symbol} a CURRENCY character 63 \p{Sk} \p{Modifier Symbol} a NON-LETTERLIKE MODIFIER character 125 \p{So} \p{Other Symbol} OTHER SYMBOL character 7,376 \p{S*} \p{Symbol} 8,514 = \p{Sm}|\p{Sc}|\p{Sk}|\p{So} \p{Zs} \p{Space Separator} a NON-ZERO width SPACE character 17 = [\x{0020}\x{00A0}\x{1680}\x{2000}-\x{200A}\x{202F}\x{205F}\x{3000}] = (?!\t)\h \p{Zl} \p{Line Separator} the LINE SEPARATOR character 1 = \x{2028} \p{Zp} \p{Paragraph Separator} the PARAGRAPH SEPARATOR character 1 = \x{2029} ------ \p{Z*} \p{Separator} 19 = \p{Zs}|\p{Zl}|\p{Zp}
Note that the total of the
\p(M*}
mark characters is exactly the result given by the\m
regex !
Now, if you follow the procedure explained in the last part of this post :
https://community.notepad-plus-plus.org/post/99844
The regexes
[\x{DC80}-\x{DCFF}]
or\i
or[[:invalid:]]
do give134
occurrences, which is the exact number of invalid characters of this example !
In a nutshell, hats off to you ! No problem detected, so far. It’s a major version ! From your last post, I understood that you’re still working for some improvements !
At the end, I’ll put a new version of my
Unicode.zip
archive, in my Google Drive account, referring to your latest experimental version ofColumnsPLusPlus
which should highly simplify the regex syntax, in order to count or mark all chars of Unicode ranges !
In a next post, I’ll expose two points concerning, more specifically,
ColumnsPlusPlus
!Best Regards,
guy038
P.S. :
-
A negative POSIX character class can be expressed as
[^[:........:]]
or[[:^........:]]
-
A negative UNICODE character class can be expressed as
\P{..}
-
-
Hi, @coises,
Two points :
-
Seemingly, if I select all the text of a regex, it does not appear automatically in the
Find What :
zone and I need aCtrl + C
/Ctrl + V
operation. Is this on purpose ? -
You may enter a very long line of text in the
Find What :
zone. I verified that you can add up to30,000
chars if it does not contain any line-break. So my question is :
Is it possible to enter a multi-lines text in the search zone of
ColumnsPlusPlus
?TIA,
BR
guy038
-
-
@guy038 said in Columns++: Where regex meets Unicode (Here there be dragons!):
With your second experimental version of Columns++,
114
collating elements can be found. Whaouh !There are actually 116 in that group, since
[.LF.]
and[.CR.]
are also included.However, the following FOUR ones cannot be reached although they are format characters ( NOT important )
| 1BCA0 | SHORTHAND FORMAT LETTER OVERLAP | [.SFLO.] | | 1BCA1 | SHORTHAND FORMAT CONTINUING OVERLAP | [.SFCO.] | | 1BCA2 | SHORTHAND FORMAT DOWN STEP | [.SFDS.] | | 1BCA3 | SHORTHAND FORMAT UP STEP | [.SFUS.] |
I’ll add those.
BTW, there are
31
di-graph characters, which are, either, considered as upper case and lower case letters, and which can be found with the Unicode class char\p{Lt}
. With our present Boost regex engine, it correctly adds it, both, as an upper and lower letter !I see that indeed, when using Notepad++,
(?-i)\u(?<=\l)
matches 31 characters. I’m not yet convinced that is desirable, though. Shouldn’t\l
/\u
,[:lower:]
/[:upper:]
and[:Ll:]
/[:Lu:]
all be the same? The title case characters are[:Lt:]
.However, an odd thing is the result of the
[[:cntrl:]]
characters class : normally, as I said above, it should be65
, so\p{Cc}
(32
for the CO controls, +DEL
+32
for the C1 control codes!It wasn’t clear to me how the POSIX
[:cntrl:]
definition should be applied to Unicode. Notepad++ search appears to include most of the Cc and Cf characters in the basic multilingual plane, so I made it Cc + Cf. I’ll change that to Cc only.Thank you so much for looking so closely at all of this!
-
This post is deleted! -
@guy038 said in Columns++: Where regex meets Unicode (Here there be dragons!):
- Seemingly, if I select all the text of a regex, it does not appear automatically in the
Find What :
zone and I need aCtrl + C
/Ctrl + V
operation. Is this on purpose ?
It is. My original motivation for including a search function in Columns++ was to make it possible to search in rectangular selections — something Notepad++ search will not do. Thus, I expected that the most common way of using it would be to select the rectangular block in which you want to search, then open the dialog. (In a rectangular selection,
^
and$
match the beginning and end of the selection in each row. I attempt to explain the whole thing here.)After some feedback, I made it so that the initial selection is used to set an indicator to the region to be searched. That made sequential finds make a lot more sense.
My search has been subject to “mission creep” as I added formulas in replacement text, the Select options on the Count drop-down, the ability to convert multiple selections to search regions, and now 32-bit Unicode searching. Some day I might make a separate search plugin (or try to make a case for adding these features to Notepad++); for now, the one in Columns++ will be first oriented toward working conveniently with rectangular selections.
- You may enter a very long line of text in the
Find What :
zone. I verified that you can add up to30,000
chars if it does not contain any line-break. So my question is :
Is it possible to enter a multi-lines text in the search zone of
ColumnsPlusPlus
?At present, no. It’s a good idea, though. I’ve thought of making it possible to open a separate window in which to enter search and replacement expressions, allowing more space and maybe containing a feature to pin frequently-used searches and/or a “builder” that would guide novices in the construction of regular expressions.
That’s getting so complex, though, that I think it might have to wait for that apocryphal day when I build a separate plugin that’s just for search.
- Seemingly, if I select all the text of a regex, it does not appear automatically in the
-
Hello, @coises and All,
Thanks for adding the
4
remaining elements : so we’ll get a round number of collating elements :120
!You said :
Notepad++ search appears to include most of the Cc and Cf characters in the basic multilingual plane, so I made it Cc + Cf. I’ll change that to Cc only
I confirm that, in your second version,
[[:cntrl:]]
=\p{Cc}
+\p{Cf}
= 65 + 170 =235
and thanks for the future modification
You said :
Shouldn’t
\l
/\u
,[:lower:]
/[:upper:]
and[:Ll:]
/[:Lu:]
all be the same? The title case characters are[:Lt:].
What do you want to say ? Presently, in your second version, it’s just the case, as shown below or may I miss something obvious !?
[[:upper:]] = \p{upper} = [[:u:]] = \p{u} = \pu = \u = \p{Lu} = \p{Uppercase Letter} = [[:Lu:]] an UPPER case letter = 1,858 [[:lower:]] = \p{lower} = [[:l:]] = \p{l} = \pl = \l = \p{Ll} = \p{Lowercase Letter} = [[:Ll:]] a LOWER case letter = 2,258
BTW, I didn’t know that the syntax of an Unicode character class
\p{Xy}
could also be expressed as[[:Xy:]]
!Best Regards,
guy038
-
@guy038 said in Columns++: Where regex meets Unicode (Here there be dragons!):
You said :
Shouldn’t
\l
/\u
,[:lower:]
/[:upper:]
and[:Ll:]
/[:Lu:]
all be the same? The title case characters are[:Lt:].
What do you want to say ? Presently, in your second version, it’s just the case, as shown below or may I miss something obvious !?
I don’t think you missed anything. I think I might have misunderstood you. I thought you were saying that
[:lower:]
and[:upper:]
and/or\l
and\u
should match the[:Lt:]
characters, so that those 31 characters are both upper case and lower case. Perhaps we are agreed that they are neither.BTW, I didn’t know that the syntax of an Unicode character class
\p{Xy}
could also be expressed as[[:Xy:]]
!Boost::regex is built such that
\p{whatever}
and[[:whatever:]]
are the same. It also “delegates” backslash lower case letter escapes that don’t have any other meaning to classes with the same name, and upper case escapes without another meaning to the complements; so\s
is internally “defined” as[[:s:]]
and\S
as +[^[:s:]]
. That’s how I was able to define\i
,\m
,\o
and\y
. It’s also why we have to write\p{L*}
instead of\p{L}
: class names are case insensitive, and “l” already defines\l
as lower case. For consistency, all the Unicode general category groups use the asterisk notation. -
I’m hopeful that the real end goal for all of this is integration into native Notepad++, and that the plugin is really just a “testbed” for what you’re doing. Columns++ is great, but this is about core unicode searching, and as such really belongs in the standard product.
It’s great that a person has finally been found that’s capable of (and interested in) doing this stuff, and it would be a shame if Notepad++ moves forward without the benefits of this work.
Thank you for your work.
-
@Alan-Kilborn said in Columns++: Where regex meets Unicode (Here there be dragons!):
I’m hopeful that the real end goal for all of this is integration into native Notepad++, and that the plugin is really just a “testbed” for what you’re doing. Columns++ is great, but this is about core unicode searching, and as such really belongs in the standard product.
For now I’m focusing on making the search in Columns++ as good as I can make it within the bounds of what I’ve intended search in Columns++ to accomplish. I don’t know that I can get this to where I’m comfortable calling it “stable” before the plugins list for the just-announced Notepad++ 8.7.8 release is frozen, but that’s the limit of my ambition at this point.
I do hope that once it has been in use for a time, it can serve as a proof of viability — and maybe a bit of pressure — to incorporate better Unicode searching into Notepad++. That would be a massive code change, though, and unfortunately not everything can be simply copied from the way I’m doing it. (Columns++ uses Boost::regex directly; Notepad++ integrates Boost::regex with Scintilla and then uses the upgraded Scintilla search. Most of the same principles should apply, but details, details… details are where the bugs live.)
There will also surely be a repeat of the same question I faced: whether to use ad hoc code or somehow incorporate ICU, which Boost::regex can use. And since Windows 10 version 1703 (but changing in 1709 and again in 1903), Windows incorporates a stripped-down version of ICU. It appears that Boost::regex can’t use that, but perhaps Boost will fix that someday, or perhaps I or someone else will find a way to connect them. By the time this could be considered for Notepad++, it might be plausible to limit new versions to Win 10 version 1903 or later. Avoiding bespoke code would minimize the possibility of future maintenance burdens for Notepad++. So there will be a lot to consider.
Thank you for your kind words and encouragement, Alan.
-
I’ve posted Columns++ for Notepad++ version 1.1.5.3-Experimental.
Changes:
-
Search in Columns++ shows a progress dialog when it estimates that a count, select or replace all operation will take more than two seconds. That should make apparent freezes (which were observed when attempting select all for expressions that make tens or hundreds of thousands of separate matches) far less likely to happen. (Note that this is not connected to the “Expression too complex” situation; this happens when the expression is reasonable, but there are an extremely high number of matches.)
-
[[:cntrl:]]
matches only Unicode General Category Cc characters. Mnemonics for formatting characters[[.sflo.]]
,[[.sfco.]]
,[[.sfds.]]
and[[.sfus.]]
work. -
I corrected an error that would have caused equivalence classes (e.g.,
[[=a=]]
) to fail for characters U+10000 and above. However, I don’t know if there are any working equivalence classes for characters U+10000 and above, anyway. (Present support for those is dependent on a Windows function; it appears to me that it might not process surrogate pairs in a useful way.) -
There were other organizational changes.
As always comments, observations and suggestions are most welcome. My aim is for this to be the last “experimental” release in this series, if nothing awful happens… in which case the major remaining thing to be done before a normal release is documentation.
-
-
Hi, @coises and All,
First, here is the summary of the contents of the
Total_Chars.txt
file :•----------------•---------•------------------------------------------•-----------•-------------------------------------------------•-----------• | Range | Plane | COUNT / MARK of ALL characters | # Chars | COUNT / MARK of ALL UNASSIGNED characters | # Unas. | •----------------•---------•------------------------------------------•-----------•-------------------------------------------------•-----------• | 0000...FFFD | 0 | [\x{0000}-\x{FFFD}] | 63,454 | (?=[\x{0000}-\x{D7FF}]|[\x{F900}-\x{FFFD}])\Y | 1,398 | | 10000..1FFFD | 1 | [\x{10000}-\x{1FFFD}] | 65,534 | (?=[\x{10000}-\x{1FFFD}])\Y | 37,090 | | 20000..2FFFD | 2 | [\x{20000}-\x{2FFFD}] | 65,534 | (?=[\x{20000}-\x{2FFFD}])\Y | 4,039 | | 30000..3FFFD | 3 | [\x{30000}-\x{3FFFD}] | 65,534 | (?=[\x{30000}-\x{3FFFD}])\Y | 56,403 | | E0000..EFFFD | 14 | [\x{E01F0}-\x{EFFFD}] | 65,534 | (?=[\x{E0000}-\x{EFFFD}])\Y | 65,197 | •----------------•---------•------------------------------------------•-----------•-------------------------------------------------•-----------• | 00000..EFFFD | | (?s). \I \p{Any} [\x0-\x{EFFFD}] | 325,590 | (?![\x{E000}-\x{F8FF}])\Y \p{Not Assigned} | 164,127 | •----------------•---------•------------------------------------------•-----------•-------------------------------------------------•-----------•
Indeed, I cannot post my new
Unicode_Col++.txt
file, in its entirety, with the detail of all the Unicode blocks ( Too large ! ). However, it will be part of my futureUnicode.zip
archive that I’ll post on myGoogle Drive
account !
Now, I tested your third experimental version of
Columns++
and everything works as you surely expect to !!You said :
Search in Columns++ shows a progress dialog when it estimates that a count, select or replace all operation will take more than two seconds…
I pleased to tell you that with this new feature, my laptop did not hang on any more ! For example, I tried to select all the matches of the regex
(?s).
, against myTotal_Chars.txt
file and, with the process dialog on my HP ProBook 450 G8 / Windows 10 Pro 64 / Version 21H1 / Intel® Core™ i7 / RAM 32 GB DDR4-3200 MHz, after8 m 21s
, the green zone was complete and it said :325 590 matches selected
! I even copied all this selection on a new tab and, after suppression of all\r\n
line-breaks, theComparePlus
plugin did not find any difference betweenTotal_Chars.txt
and this new tab !
You said :
[[:cntrl:]]
matches only Unicode General Category Cc characters. Mnemonics for formatting characters[[.sflo.]]
,[[.sfco.]]
,[[.sfds.]]
and[[.sfus.]]
work.I confirm that these two changes are effective
Now, I particularly tested the
Equivalence
classes feature. You can refer to the following link :https://unicode.org/charts/collation/index.html
And also consult the help at :
https://unicode.org/charts/collation/help.html
For the letter
a
, it detects160
equivalences of aa
letterHowever, against the
Total_Chars.txt
file, the regex[[=a=]]
returns86
matches. So we can deduce that :-
A lot of equivalences are not found with the
[[=a=]]
regex -
Some equivalents, not shown from this link, can be found with the
[[=a=]]
regex. it’s the case with the\x{249C}
character ( PARENTHESIZED LATIN SMALL LETTER A ) !
This situation happens with any character : for example, the regex
[[=1=]]
finds54
matches, but, on the site, it shows209
equivalences to the digit1
Now, with your experimental
UTF-32
version, you can use any other equivalent character of thea
letter to get the86
matches (((=Ⱥ=]]
,[[=ⱥ=]]
,[[=Ɐ=]]
, … ). Note that, with our presentBoost
regex engine, some equivalences do not return the86
matches. It’s the case for the regexes :[[=ɐ=]]
,[[=ɑ=]]
,[[=ɒ=]]
,[[=ͣ=]]
,[[=ᵃ=]]
,[[=ᵄ=]]
,[[=ⱥ=]]
,[[=Ɑ=]]
,[[=Ɐ=]]
,[[=Ɒ=]]
Thus, your version is more coherent, as it does give the same result, whatever the char used in the equivalence class regex !
Here is below the list of all the equivalences of any char of the
Windows-1252
code-page, from\x{0020}
till\x{00DE}
Note that, except for the DEL character, as en example, I did not consider the equivalence classes which return only one match !I also confirm, that I did not find any character over
\x{FFFF}
which would be part of a regex equivalence class, either with our Boost engine or with yourColumns++
experimental version ![[= =]] = [[=space=]] => 3 ( ) [[=!=]] = [[=exclamation-mark=]] => 2 ( !! ) [[="=]] = [[=quotation-mark=]] => 3 ( "⁍" ) [[=#=]] = [[=number-sign=]] => 4 ( #؞⁗# ) [[=$=]] = [[=dollar-sign=]] => 3 ( $⁒$ ) [[=%=]] = [[=percent-sign=]] => 3 ( %⁏% ) [[=&=]] = [[=ampersand=]] => 3 ( &⁋& ) [[='=]] = [[=apostrophe=]] => 2 ( '' ) [[=(=]] = [[=left-parenthesis=]] => 4 ( (⁽₍( ) [[=)=]] = [[=right-parenthesis=]] => 4 ( )⁾₎) ) [[=*=]] = [[=asterisk=]] => 2 ( ** ) [[=+=]] = [[=plus-sign=]] => 6 ( +⁺₊﬩﹢+ ) [[=,=]] = [[=comma=]] => 2 ( ,, ) [[=-=]] = [[=hyphen=]] => 3 ( -﹣- ) [[=.=]] = [[=period=]] => 3 ( .․. ) [[=/=]] = [[=slash=]] => 2 ( // ) [[=0=]] = [[=zero=]] => 48 ( 0٠۟۠۰߀०০੦૦୦୵௦౦౸೦൦๐໐༠၀႐០᠐᥆᧐᪀᪐᭐᮰᱀᱐⁰₀↉⓪⓿〇㍘꘠ꛯ꠳꣐꤀꧐꩐꯰0 ) [[=1=]] = [[=one=]] => 54 ( 1¹١۱߁१১੧૧୧௧౧౹౼೧൧๑໑༡၁႑፩១᠑᥇᧑᧚᪁᪑᭑᮱᱁᱑₁①⑴⒈⓵⚀❶➀➊〡㋀㍙㏠꘡ꛦ꣑꤁꧑꩑꯱1 ) [[=2=]] = [[=two=]] => 54 ( 2²ƻ٢۲߂२২੨૨୨௨౨౺౽೨൨๒໒༢၂႒፪២᠒᥈᧒᪂᪒᭒᮲᱂᱒₂②⑵⒉⓶⚁❷➁➋〢㋁㍚㏡꘢ꛧ꣒꤂꧒꩒꯲2 ) [[=3=]] = [[=three=]] => 53 ( 3³٣۳߃३৩੩૩୩௩౩౻౾೩൩๓໓༣၃႓፫៣᠓᥉᧓᪃᪓᭓᮳᱃᱓₃③⑶⒊⓷⚂❸➂➌〣㋂㍛㏢꘣ꛨ꣓꤃꧓꩓꯳3 ) [[=4=]] = [[=four=]] => 51 ( 4٤۴߄४৪੪૪୪௪౪೪൪๔໔༤၄႔፬៤᠔᥊᧔᪄᪔᭔᮴᱄᱔⁴₄④⑷⒋⓸⚃❹➃➍〤㋃㍜㏣꘤ꛩ꣔꤄꧔꩔꯴4 ) [[=5=]] = [[=five=]] => 53 ( 5Ƽƽ٥۵߅५৫੫૫୫௫౫೫൫๕໕༥၅႕፭៥᠕᥋᧕᪅᪕᭕᮵᱅᱕⁵₅⑤⑸⒌⓹⚄❺➄➎〥㋄㍝㏤꘥ꛪ꣕꤅꧕꩕꯵5 ) [[=6=]] = [[=six=]] => 52 ( 6٦۶߆६৬੬૬୬௬౬೬൬๖໖༦၆႖፮៦᠖᥌᧖᪆᪖᭖᮶᱆᱖⁶₆ↅ⑥⑹⒍⓺⚅❻➅➏〦㋅㍞㏥꘦ꛫ꣖꤆꧖꩖꯶6 ) [[=7=]] = [[=seven=]] => 50 ( 7٧۷߇७৭੭૭୭௭౭೭൭๗໗༧၇႗፯៧᠗᥍᧗᪇᪗᭗᮷᱇᱗⁷₇⑦⑺⒎⓻❼➆➐〧㋆㍟㏦꘧ꛬ꣗꤇꧗꩗꯷7 ) [[=8=]] = [[=eight=]] => 50 ( 8٨۸߈८৮੮૮୮௮౮೮൮๘໘༨၈႘፰៨᠘᥎᧘᪈᪘᭘᮸᱈᱘⁸₈⑧⑻⒏⓼❽➇➑〨㋇㍠㏧꘨ꛭ꣘꤈꧘꩘꯸8 ) [[=9=]] = [[=nine=]] => 50 ( 9٩۹߉९৯੯૯୯௯౯೯൯๙໙༩၉႙፱៩᠙᥏᧙᪉᪙᭙᮹᱉᱙⁹₉⑨⑼⒐⓽❾➈➒〩㋈㍡㏨꘩ꛮ꣙꤉꧙꩙꯹9 ) [[=:=]] = [[=colon=]] => 2 ( :: ) [[=;=]] = [[=semicolon=]] => 3 ( ;;; ) [[=<=]] = [[=less-than-sign=]] => 3 ( <﹤< ) [[===]] = [[=equals-sign=]] => 5 ( =⁼₌﹦= ) [[=>=]] = [[=greater-than-sign=]] => 3 ( >﹥> ) [[=?=]] = [[=question-mark=]] => 2 ( ?? ) [[=@=]] = [[=commercial-at=]] => 2 ( @@ ) [[=A=]] => 86 ( AaªÀÁÂÃÄÅàáâãäåĀāĂ㥹ǍǎǞǟǠǡǺǻȀȁȂȃȦȧȺɐɑɒͣᴀᴬᵃᵄᶏᶐᶛᷓḀḁẚẠạẢảẤấẦầẨẩẪẫẬậẮắẰằẲẳẴẵẶặₐÅ⒜ⒶⓐⱥⱭⱯⱰAa ) [[=B=]] => 29 ( BbƀƁƂƃƄƅɃɓʙᴃᴮᴯᵇᵬᶀḂḃḄḅḆḇℬ⒝ⒷⓑBb ) [[=C=]] => 40 ( CcÇçĆćĈĉĊċČčƆƇƈȻȼɔɕʗͨᴄᴐᵓᶗᶜᶝᷗḈḉℂ℃ℭ⒞ⒸⓒꜾꜿCc ) [[=D=]] => 44 ( DdÐðĎďĐđƊƋƌƍɗʤͩᴅᴆᴰᵈᵭᶁᶑᶞᷘᷙḊḋḌḍḎḏḐḑḒḓⅅⅆ⒟ⒹⓓꝹꝺDd ) [[=E=]] => 82 ( EeÈÉÊËèéêëĒēĔĕĖėĘęĚěƎƏǝȄȅȆȇȨȩɆɇɘəɚͤᴇᴱᴲᵉᵊᶒᶕḔḕḖḗḘḙḚḛḜḝẸẹẺẻẼẽẾếỀềỂểỄễỆệₑₔ℮ℯℰ⅀ⅇ⒠ⒺⓔⱸⱻEe ) [[=F=]] => 22 ( FfƑƒᵮᶂᶠḞḟ℉ℱℲⅎ⒡ⒻⓕꜰꝻꝼꟻFf ) [[=G=]] => 45 ( GgĜĝĞğĠġĢģƓƔǤǥǦǧǴǵɠɡɢɣɤʛˠᴳᵍᵷᶃᶢᷚᷛḠḡℊ⅁⒢ⒼⓖꝾꝿꞠꞡGg ) [[=H=]] => 41 ( HhĤĥĦħȞȟɥɦʜʰʱͪᴴᶣḢḣḤḥḦḧḨḩḪḫẖₕℋℌℍℎℏ⒣ⒽⓗⱧⱨꞍHh ) [[=I=]] => 61 ( IiÌÍÎÏìíîïĨĩĪīĬĭĮįİıƖƗǏǐȈȉȊȋɨɩɪͥᴉᴵᵎᵢᵻᵼᶖᶤᶥᶦᶧḬḭḮḯỈỉỊịⁱℐℑⅈ⒤ⒾⓘꟾIi ) [[=J=]] => 23 ( JjĴĵǰȷɈɉɟʄʝʲᴊᴶᶡᶨⅉ⒥ⒿⓙⱼJj ) [[=K=]] => 38 ( KkĶķĸƘƙǨǩʞᴋᴷᵏᶄᷜḰḱḲḳḴḵₖK⒦ⓀⓚⱩⱪꝀꝁꝂꝃꝄꝅꞢꞣKk ) [[=L=]] => 56 ( LlĹĺĻļĽľĿŀŁłƚƛȽɫɬɭɮʟˡᴌᴸᶅᶩᶪᶫᷝᷞḶḷḸḹḺḻḼḽₗℒℓ⅂⅃⒧ⓁⓛⱠⱡⱢꝆꝇꝈꝉꞀꞁLl ) [[=M=]] => 33 ( MmƜɯɰɱͫᴍᴟᴹᵐᵚᵯᶆᶬᶭᷟḾḿṀṁṂṃₘℳ⒨ⓂⓜⱮꝳꟽMm ) [[=N=]] => 47 ( NnÑñŃńŅņŇňʼnƝƞǸǹȠɲɳɴᴎᴺᴻᵰᶇᶮᶯᶰᷠᷡṄṅṆṇṈṉṊṋⁿₙℕ⒩ⓃⓝꞤꞥNn ) [[=O=]] => 106 ( OoºÒÓÔÕÖØòóôõöøŌōŎŏŐőƟƠơƢƣǑǒǪǫǬǭǾǿȌȍȎȏȪȫȬȭȮȯȰȱɵɶɷͦᴏᴑᴒᴓᴕᴖᴗᴼᵒᵔᵕᶱṌṍṎṏṐṑṒṓỌọỎỏỐốỒồỔổỖỗỘộỚớỜờỞởỠỡỢợₒℴ⒪ⓄⓞⱺꝊꝋꝌꝍOo ) [[=P=]] => 33 ( PpƤƥɸᴘᴾᵖᵱᵽᶈᶲṔṕṖṗₚ℘ℙ⒫ⓅⓟⱣⱷꝐꝑꝒꝓꝔꝕꟼPp ) [[=Q=]] => 16 ( QqɊɋʠℚ℺⒬ⓆⓠꝖꝗꝘꝙQq ) [[=R=]] => 64 ( RrŔŕŖŗŘřƦȐȑȒȓɌɍɹɺɻɼɽɾɿʀʁʳʴʵʶͬᴙᴚᴿᵣᵲᵳᶉᷢᷣṘṙṚṛṜṝṞṟℛℜℝ⒭ⓇⓡⱤⱹꝚꝛꝜꝝꝵꝶꞂꞃRr ) [[=S=]] => 47 ( SsŚśŜŝŞşŠšƧƨƩƪȘșȿʂʃʅʆˢᵴᶊᶋᶘᶳᶴᷤṠṡṢṣṤṥṦṧṨṩₛ⒮ⓈⓢⱾꜱSs ) [[=T=]] => 46 ( TtŢţŤťƫƬƭƮȚțȶȾʇʈʧʨͭᴛᵀᵗᵵᶵᶿṪṫṬṭṮṯṰṱẗₜ⒯ⓉⓣⱦꜨꜩꝷꞆꞇTt ) [[=U=]] => 82 ( UuÙÚÛÜùúûüŨũŪūŬŭŮůŰűŲųƯưƱǓǔǕǖǗǘǙǚǛǜȔȕȖȗɄʉʊͧᴜᵁᵘᵤᵾᵿᶙᶶᶷᶸṲṳṴṵṶṷṸṹṺṻỤụỦủỨứỪừỬửỮữỰự⒰ⓊⓤUu ) [[=V=]] => 29 ( VvƲɅʋʌͮᴠᵛᵥᶌᶹᶺṼṽṾṿỼỽ⒱ⓋⓥⱱⱴⱽꝞꝟVv ) [[=W=]] => 28 ( WwŴŵƿǷʍʷᴡᵂẀẁẂẃẄẅẆẇẈẉẘ⒲ⓌⓦⱲⱳWw ) [[=X=]] => 15 ( XxˣͯᶍẊẋẌẍₓ⒳ⓍⓧXx ) [[=Y=]] => 36 ( YyÝýÿŶŷŸƳƴȲȳɎɏʎʏʸẎẏẙỲỳỴỵỶỷỸỹỾỿ⅄⒴ⓎⓨYy ) [[=Z=]] => 41 ( ZzŹźŻżŽžƵƶȤȥɀʐʑᴢᵶᶎᶻᶼᶽᷦẐẑẒẓẔẕℤ℥ℨ⒵ⓏⓩⱫⱬⱿꝢꝣZz ) [[=[=]] = [[=left-square-bracket=]] => 2 ( [[ ) [[=\=]] = [[=backslash=]] => 2 ( \\ ) [[=]=]] = [[=right-square-bracket=]] => 2 ( ]] ) [[=^=]] = [[=circumflex=]] => 3 ( ^ˆ^ ) [[=_=]] = [[=underscore=]] => 2 ( __ ) [[=`=]] = [[=grave-accent=]] => 4 ( `ˋ`` ) [[={=]] = [[=left-curly-bracket=]] => 2 ( {{ ) [[=|=]] = [[=vertical-line=]] => 2 ( || ) [[=}=]] = [[=right-curly-bracket=]] => 2 ( }} ) [[=~=]] = [[=tilde=]] => 2 ( ~~ ) [[==]] = [[=DEL=]] => 1 ( ) [[=Œ=]] => 2 ( Œœ ) [[=¢=]] => 3 ( ¢《¢ ) [[=£=]] => 3 ( £︽£ ) [[=¤=]] => 2 ( ¤》 ) [[=¥=]] => 3 ( ¥︾¥ ) [[=¦=]] => 2 ( ¦¦ ) [[=¬=]] => 2 ( ¬¬ ) [[=¯=]] => 2 ( ¯ ̄ ) [[=´=]] => 2 ( ´´ ) [[=·=]] => 2 ( ·· ) [[=¼=]] => 4 ( ¼୲൳꠰ ) [[=½=]] => 6 ( ½୳൴༪⳽꠱ ) [[=¾=]] => 4 ( ¾୴൵꠲ ) [[=Þ=]] => 6 ( ÞþꝤꝥꝦꝧ )
Some double-letter characters give some equivalences which allow you to get the right single char to use, instead of the two trivial letters :
[[=AE=]] = [[=Ae=]] = [[=ae=]] => 11 ( ÆæǢǣǼǽᴁᴂᴭᵆᷔ ) [[=CH=]] = [[=Ch=]] = [[=ch=]] => 0 ( ? ) [[=DZ=]] = [[=Dz=]] = [[=dz=]] => 6 ( DŽDždžDZDzdz ) [[=LJ=]] = [[=Lj=]] = [[=lj=]] => 3 ( LJLjlj ) [[=LL=]] = [[=Ll=]] = [[=ll=]] => 2 ( Ỻỻ ) [[=NJ=]] = [[=Nj=]] = [[=nj=]] => 3 ( NJNjnj ) [[=SS=]] = [[=Ss=]] = [[=ss=]] => 2 ( ßẞ )
However, the use of these di-graph characters are quite delicate ! Let’s consider these
7
di-graph collating elements, below, with various cases :[[.AE.]] [[.Ae.]] [[.ae.]] ( European Ligature ) [[.CH.]] [[.Ch.]] [[.ch.]] ( Spanish ) [[.DZ.]] [[.Dz.]] [[.dz.]] ( Hungarian, Polish, Slovakian, Serbo-Croatian ) [[.LJ.]] [[.Lj.]] [[.lj.]] ( Serbo-Croatian ) [[.LL.]] [[.Ll.]] [[.ll.]] ( Spanish ) [[.NJ.]] [[.Nj.]] [[.nj.]] ( Serbo-Croatian ) [[.SS.]] [[.Ss.]] [[.ss.]] ( German )
As we know that :
LJ 01C7 LATIN CAPITAL LETTER LJ Lj 01C8 LATIN CAPITAL LETTER L WITH SMALL LETTER J lj 01C9 LATIN SMALL LETTER LJ DZ 01F1 LATIN CAPITAL LETTER DZ Dz 01F2 LATIN CAPITAL LETTER D WITH SMALL LETTER Z dz 01F3 LATIN SMALL LETTER DZ
If we apply the regex
[[.dz.]-[.lj.][=dz=][=lj=]]
against the textbcddzdzefghiijjklljljmn
, pasted in a new tab, Columns++ would find12
matches :dz dz e f g h i j k l lj lj
To sum up, @coises, the key points, of your third experimental version, are :
-
A major regex engine, inplemented in UTF-32, which correctly handle all the Unicode characters, from
\x{0}
to\x{0010FFFF}
, and correctly manage all the Unicode character classes\p{Xy}
or[[:Xy:]]
-
Additional features as
\i
,\m
,\o
and\y
and their complements -
The
\X
regex feature (\M\m*
) correctly works for characters OVER the BMP -
The invalid
UTF-8
characters may be kept, replaced or deleted ( FIND\i+
, REPLACEABC $1 XYZ
) -
The
NUL
character can be placed in replacement ( FINDABC\x00XYZ
, REPLACE\x0--$0--\x{00}
) -
Correct handle of case replacements, even in case of accentuated characters ( FIND
(?-s).
REPLACE\U$0
) -
The
\K
feature ALSO works in a step-by-step replacement with theReplace
button ( FIND^.{20}\K(.+)
, REPLACE--\1--
)
To end, @coises, do you think it’s worth testing some regex examples with possible replacements ? I could test some tricky regexes to check the robustness of your final
UTF-32
version., if necessary ?Best Regards,
guy038
-
-
@guy038 said:
The \K feature ALSO works in a step-by-step replacement with the Replace button
That’s major. Perhaps whatever change allows that could be factored out and put into native Notepad++?
(Again, I’m not one to say often that functionality that’s in a plugin should “go native”…but, when we’re talking about important find/replace functionality…it should).
I could test some tricky regexes to check the robustness of your final UTF-32 version
First, I’d encourage this further testing.
Second, is there a reason to mention UTF-32, specifically?
-
@guy038 said in Columns++: Where regex meets Unicode (Here there be dragons!):
To end, @coises, do you think it’s worth testing some regex examples with possible replacements ? I could test some tricky regexes to check the robustness of your final UTF-32 version., if necessary ?
It would surely be helpful; but I now know there will be at least one more experimental version, so you might as well wait for that. I no longer expect to have this ready in time to be included in the plugins list for the next Notepad++ release.
It turns out that
\X
does not work correctly. Consider this text:👍👍🏻👍🏼👍🏽👍🏾👍🏿
There are six “graphical characters” there, but
\X
finds eleven (if you copy without a line ending). It turns out the rules for identifying grapheme cluster breaks are complex, and Boost::regex does not implement them correctly. (As far as I can tell, Scintilla is agnostic about this. Selections go by code point — stepping with the right arrow key, you can see the cursor move to the middle of any character comprised of multiple code points. I think Scintilla depends on the fonts and the display engine to render grapheme clusters properly, but I haven’t verified that.)So I’m working on making that work properly. I think I’ve found a way, but work is still in progress.
I will also look at the equivalence classes problems you identified. Thank you for that information! It will help greatly.
I’ve had a couple thoughts, and I’m wondering what others think:
-
I find the character class matching when Match case is not checked (or
(?i)
is used) absurd. Boost::regex makes\l
and\u
match all[[:alpha:]]
characters (not just cased letters), and the Unicode classes become entirely erratic. I can’t think of any named character classes that would be less useful if case insensitivity were ignored when matching them. If it’s possible to do that — so that, for example,\u
still matches only upper case characters even when Match case is not checked — would others find that an improvement? Would anyone find it problematic? (This wouldn’t affect classes specified with explicit characters, like[aeiou]
or[A-F]
: Match case would still control how those match. If I can accomplish this as I intend, only the Unicode “General Category” character classes,\l
,\u
,[:lower:]
,[:upper:]
and obvious correlates would be changed to ignore case insensitivity and always test the document text as written.) -
Should there be an option to make the POSIX classes and their escapes (such as
\s
,\w
,[[:alnum:]]
,[[:punct:]]
) match only ASCII characters? Unfortunately, I don’t see any reasonable way to make that an in-expression switch like(?i)
; if it were done at all, it would have to be a checkbox that would apply to the entire expression. Would this help anyone, or just add complication for little value? -
Does anyone care much about having Unicode script properties available as regex properties (e.g.,
\p{Greek}
,\p{Hebrew}
,\p{Latin}
)? -
Does anyone care much about having Unicode character names available (e.g.,
[[.GREEK SMALL LETTER FINAL SIGMA.]]
equivalent to\x{03C2}
)? My thought is that including those will make the module much larger, and that by the time you’ve looked up the exact way the name has to be given, you could just look up the hexadecimal code point anyway.
-
-
@Alan-Kilborn said in Columns++: Where regex meets Unicode (Here there be dragons!):
@guy038 said:
The \K feature ALSO works in a step-by-step replacement with the Replace button
That’s major. Perhaps whatever change allows that could be factored out and put into native Notepad++?
When doing a Replace, the first step is to check that the selection matches the search expression. In general, this fails when
\K
is used because an expression using\K
doesn’t select starting from where it matched.What Columns++ does is to remember the starting position for the last successful find (whether it was from Find or as part of a Replace). Then, when Replace is clicked, it checks starting from there rather than from the selection.
It’s been a while since I wrote that code, and I don’t remember exactly why, but I put in a number of checks to be sure the remembered starting point is still valid. In particular, if focus leaves the Search dialog (meaning the user might have changed the selection or the text), the position memory is marked invalid and a normal search starting from the beginning of the selection is used. I think the reason was to be sure that if the user intends to start a new search (by clicking in a new position or changing the selection), it would be important not to start from some remembered (and now meaningless) point.
Off hand, I don’t see any reason the same principle couldn’t be applied to Notepad++ search. It wouldn’t be a matter of just copying, though; someone would have to think through the logic from scratch in the context of how Notepad++ implements search.
-
Hello, @coises and All,
You said :
It turns out that
\X
does not work correctly. Consider this text:Ah…, indeed, I spoke too quickly and/or did not test this part thoroughly ! Thanks for your investigations in this matter !
You said :
Should there be an option to make the POSIX classes and their escapes (such as \s, \w, [[:alnum:]], [[:punct:]]) match only ASCII characters ?
I do not think it’s necessary as we can provide the same behaviour with the following regexes :
-
(?-i)(?=[[:ascii:]])\p{punct}
or(?-i)(?=\p{punct})[[:ascii:]]
gives32
matches -
(?-i)(?=[[:ascii:]])\u
or(?-i)(?=\u)[[:ascii:]]
gives26
matches -
(?-i)(?=[[:ascii:]])\l
or(?-i)(?=\l)[[:ascii:]]
gives26
matches
However, note that the insensitive regexes
(?i)(?=[[:ascii:]])\u
or(?i)(?=\u)[[:ascii:]]
or(?i)(?=[[:ascii:]])\l
or(?i)(?=\l)[[:ascii:]]
return a wrong result54
matches !But, luckily, the sensitive regexes
(?-i)(?=[[:ascii:]])[\u\l]
or(?-i)(?=[\u\l])[[:ascii:]]
do return52
matchesSee, right after, my opinion on the sensitive vs insensitive ways :
You said :
I find the character class matching when Match case is not checked (or (?i) is used) absurd. …
For example, let’s suppose that we run this regex
(?-i)[A-F[:lower:]]
, against myTotal_Chrs.txt
file. It does give2264
matches, so6
UPPER letters +2258
LOWER lettersNow, if we run this same regex, in an insensitive way, the
(?i)[A-F[:lower:]]
regex returns141029
matches Of course, this result is erroneous but, first oddity, why141029
instead of141028
( the total number of letters ) ?Well, the
ˮ
character (\x{02EE}
) is the last lowercase letter of the SPACING MODIFIER LETTERS block. As within my file, this Unicode block is followed with the COMBINING DIACRITICAL MARKS block, it happens that an additional\x{0345
combining diacritical mark is tied to that\x{02EE}
character ( don’t know why !? )But, actually, the non-sensitive regex
(?i)[A-F[:lower:]]
should be modified in the sensitive regex(?-i)[A-Fa-f[:upper:][:lower:]]
which, in turn, is identical to the regex(?-i)[[:upper:][:lower:]]
and correctly returns4,116
matches ( So1,858
UPPER letters +2,258
LOWER letters )So, as you cannot check the
Match Case
option of your own accord, I think that the more simple way would be, as long as theRegular expression
radio button is checked :- When a
(?i)
modifier is found within the regex
or
- When the
Match Case
option is unchecked
To show a message, saying :
The given regex may produce wrong results, particularly, if replacement is involved. Try to refactor this **insensitive** regex in a **sensitive** way !
You said :
Does anyone care much about having
Unicode script
properties available as regex properties`` (e.g.,\p{Greek}
,\p{Hebrew}
,\p{Latin})?
It might be useful, sometimes, to differentiate Unicode characters of a text, according to their scripts ( regions ). But it’s up to you : this should not be a primary goal !
You said :
Does anyone care much about having Unicode character names available (e.g.,
[[.GREEK SMALL LETTER FINAL SIGMA.]]
equivalent to\x{03C2}
) ? …I agree to your reasoning and I think that it would have little interest in doing so. So, let this module fast enough, regarding the included UNICODE features, so far !
During my tests, I once searched for the
[\p{L*}]
regex and I was surprised to get5
matches. Note that an Unicode character class CANNOT be part of a usual class character !. Thus my regex[\p{L*}]
poorly matched the five characters ‘*
’, ‘L
’, ‘p
’, ‘{
’ and ‘}
’ !Regarding the
31
characters of theTitle Case
Unicode category (\p{Lt}
), you said, in a previous post, that you saw this particularity, with our presentBoost
engine, thanks to the regex(?-i)\u(?<=\l)
or also(?-i)(?=\l)\u
This is possible because, presently, these chars are both included as UPPER and LOWER chars. However, with theColumns++
plugin, these two regexes correctly return0
match because the\p[Lt}
! class is not concerned !
I wish your good development for the next, and probably, last version of your experimental
Columns++
plugin !Best Regards,
guy038
P.S. :
When the
Match case
option is unchecked, in yourColumns++
plugin, the following POSIX classes return a wrong number of occurrences, when applied against theTotal_Chars
file :[[:ascii:]]
,[[:unicode:]]
,[[:upper:]]
,[[:lower:]]
,[[:word:]]
,[[:alnum:]]
and[[:alpha:]]
-
-
@guy038 said in Columns++: Where regex meets Unicode (Here there be dragons!):
Now, if we run this same regex, in an insensitive way, the (?i)[A-F[:lower:]] regex returns 141029 matches Of course, this result is erroneous but, first oddity, why 141029 instead of 141028 ( the total number of letters ) ?
Well, the ˮ character ( \x{02EE} ) is the last lowercase letter of the SPACING MODIFIER LETTERS block. As within my file, this Unicode block is followed with the COMBINING DIACRITICAL MARKS block, it happens that an additional \x{0345 combining diacritical mark is tied to that \x{02EE} character ( don’t know why !? )
This is a combination of the way Boost::regex handles case-insensitivity for character classes and a peculiarity of U+0345.
In general, case insensitivity makes use of “case folding.” (This is subtly different from just lower casing; for example, Greek capital sigma, small sigma and small final sigma all case fold to small sigma; that way, a case insensitive search for capital sigma will match both small sigma and small final sigma. But it also means a case-insensitive search for either small sigma or small final sigma will match both.) Boost::regex supports explicitly specifying the case folding algorithm for a custom character type, and I’m using the “simple case folding” defined by Unicode.
That file includes the line:
U+0345; C; 03B9; # COMBINING GREEK YPOGEGRAMMENI
which says that U+0345 should case fold to U+03B9. U+0345 is a combining diacritical mark; U+03B9 is a lowercase letter. (Presumably this is because both uppercase to U+399, Greek Capital Letter Iota. Since I don’t know Greek, I’d be speculating as to why it works this way, but the Unicode people probably know what they’re doing.)Boost::regex does the case folding translation when matching character classes. This makes sense for
[A-F]
, but it makes for nonsense when applied to[[:lower:]]
. (Further confusion results from the fact that Boost::regex adds[[:alpha:]]
to[[:lower:]]
and[[:upper:]]
when in case-insensitive mode. So all three match any code point which case folds to a letter.)Behavior for the Unicode classes is even more bizarre, since not everything that changes when case folding changes to lowercase.
(?i)\p{Lu}
finds 644 matches.I haven’t yet deeply investigated whether it is practical to change this behavior. You can perhaps see, though, why I think it should be changed.
Note that an Unicode character class CANNOT be part of a usual class character!
This is a Boost::regex characteristic (peculiarity)? The
\p{...}
escapes do not work inside square brackets. However, in all cases,\p{something}
is equivalent to[[:something:]]
and you can combine that class as usual; so(?-i)[A-F[:Ll:]]
will work. It is equivalent to(?-i)[A-F[:lower:]]
— but the case-insensitive versions are not equivalent (because Boost::regex silently adds[:alpha:]
to case-insensitive[:lower:]
, but it has no knowledge of or special behavior for the Unicode classes).