Search for character classes but not replace them
-
@guy038 said in Search for character classes but not replace them:
correct searches and replacements in all circonstances. what’s your feeling about it ?
I can add this note, for what (if anything) it’s worth.
At some point in the development of Columns++, I realized that to get around some limitations in the Scintilla search interface I’d need to use Boost Regex directly. I really wanted, as part of that, to handle Unicode properly, as Unicode characters instead of as UTF-16 bytes. Boost Regex includes support for Unicode, but to do that it depends on ICU.
I could not figure out how to include the necessary dependencies (whatever they are) from ICU as part of a DLL compilation. All instructions discussed installing it at the operating system level. I didn’t want to tell users they had to install something separate system-wide. I gave up on that approach.
So then I thought I could at least write a proper iterator for UTF-32 instead of wchar_t. And ran into character traits. I thought seriously of trying to leverage the traits for wchar_t and “guess” at what to do outside the BMP. (Looking into this made it clear why Boost relies on ICU instead of doing it themselves.) I eventually gave up and implemented UTF-16/wchar_t, essentially what Notepad++ does. It works reasonably well with Windows (which is also UTF-16 as wchar_t) when searching for specific character sequences and/or working with characters in the BMP.
Full and proper Unicode support, as best I can figure out, involves a large amount of detail, which is continuously being updated. (For those who don’t know: not every Unicode character is a single Unicode code point. And unlike the UTF-8/16/32 relationship, there’s no fixed algorithm to tell you which code points combine with others. Then there’s knowing what’s a capital letter, what’s a lower case letter, which letters are equal when case ignored… none of it follows a formula.) If there’s a more compact, contained implementation than ICU, that would be great, but I couldn’t find one. (The C++ standards committee has punted and deprecated the little bit of Unicode support C++ ever had. There are types defined, but nothing that does anything useful with them.)
I did, however, discover after reading this thread that my search doesn’t handle
[[:unicode:]]
the way Notepad++ does. There must be something clever hidden in the Notepad++ implementation that I missed which lets it “understand” characters outside the basic multilingual plane. -
@mkupper said:
which can be searched using…
Here’s how to search for them using surrogate pairs
Clearly you see why this isn’t a good answer to the original query?
I don’t want to search specifically, I want to search generically.I started with
(?s).
as the simplest thing from this thread, as it was stated earlier that it “works”.
I showed (using some specific characters) that this generic search didn’t work.Sure, I can try
[[:unicode:]]
for what I’m trying to do, and see what else – problemwise – I run into. -
@Alan-Kilborn said in Search for character classes but not replace them:
Sure, I can try [[:unicode:]] for what I’m trying to do, and see what else – problemwise – I run into.
I did an experiment with searching for
[[:unicode:]]
on @guy038’s Total_Chars.txt file and learned the following:- It does not match \x{0000} to \x{00FF}
- It matches \x{0100} to \x{0177}
- It does not match
Ÿ
which is \x{0178} - It matches \x{0179} to \x{FFFF}
Starting at U+10000 it gets weird. I made a UTF-8 encoded test file that has 78343 lines where each line starts with a Unicode character starting at U+10000 and running up to U+10FFFF. Each character is followed by a tab and then notes about the character. For example line 15125 has:
🌵 U+1F335 \x{D83C}\x{DF35} \xF0\x9F\x8C\xB5
It lets me know the Unicode code point, the surrogate pairs, and the UTF-8 encoding for that character.
- A count for
[[:unicode:]]
says 78343 which is the number of lines. - A search for
^[[:unicode:]]
or\R[[:unicode:]]
gets zero hits. - A search for
[[:unicode:]]\t
gets 78343 hits.
It seems that
[[:unicode:]]
is matching the second word of the surrogate pair but not the first. The first word of the pairs ranges from\x{D800}
to\x{DBFF}
while the second word is always in the range\x{DC00}
to\x{DFFF}
. The weird thing is that[[:unicode:]]
matches orphan words in the range\x{D800}
to\x{DBFF}
and also matches orphans in the range\x{DC00}
to\x{DFFF}
. It’s possible that Notepad++ does something special with those orphans as you not supposed to have them as orphans plus there are intentional gaps in the UTF-8 encoding system so they can’t be encoded as UTF-8 … if you follow the rules. -
Hello, @alan-kilborn, @mkupper, @coises and All,
First, @mkupper, you made the same mistake that I did when we spoke about the
LS
andPS
characters and for which you had given me the solution !-
Indeed, the regex
(?i)[[:unicode:]]
does not match the\x{0178}
character -
Luckily, the regexes
(?-i)[[:unicode:]]
, even(?s-i)[[:unicode:]]
, do match the\x{0178}
character as well as any character over\x{00FF}
Oh…, My God : regarding the
Total_Chars.txt
file, I’m really confused because I’ve completely forgotten that this file was accessible, among some others, on my google drive account ! So, for people interested, simply click on the link below :https://drive.google.com/file/d/1kYtbIGPRLdypY7hNMI-vAJXoE7ilRMOC/view?usp=sharing
As a security, once the
Total_Chars.txt
file loaded in Notepad++, you can right-click on its tab and choose theRead-Only
optionThank you, @mkupper, for refreshing my memory ;-))
Best Regards
guy038
-
-
@mkupper said in Search for character classes but not replace them:
It’s possible that Notepad++ does something special with those orphans as you not supposed to have them as orphans plus there are intentional gaps in the UTF-8 encoding system so they can’t be encoded as UTF-8 … if you follow the rules.
I’ve been making attempts to follow this under debug in Visual Studio, but so far… I’m lost in the murky depths of Boost regex.
The iterator for UTF-8 documents is implemented in these files:
UTF8DocumentIterator.h
UTF8DocumentIterator.cxxand you can see here how UTF-8 sequences are mapped to wchar_t/UTF-16.
But why
.
matches one of a surrogate pair but[[:unicode:]]
matches both escapes me. (In my search in Columns++ both only match a single wchar_t. I don’t use the same iterator code, but I don’t know what I do that would produce different results, other than handling invalid UTF-8 differently.)To make sense of invalid UTF-16, we’d have to look at the process by which Notepad++ loads UTF-16 and transforms it into UTF-8. I think there is some method of encoding wchar_t sequences that don’t represent valid UTF-16 as invalid, but still round-trip-able, UTF-8.
If you uncover a clue, I would welcome one.
-
@guy038 said in Search for character classes but not replace them:
(?-i)[[:unicode:]]
Thank you for doing that test as I was thinking about doing something similar. I had seen that
Ÿ - \x{0178}
was the upper case form ofÿ - \x{00FF}
and wondered if the failure to match was a one-off edge error. The failure to match still seems like a bug to me unless the rule for(?-i)[[:unicode:]]
is that it only matches if both the upper and lower case form of a letter has a character code of \x0100 or higher. FWIW, Notepad++'s convert case functions work onÿŸ
.I did a search for other letters where the one letter case was \x0000 to \x00FF and the other was \x100 or higher and found
ß \x{00DF} \xC3\x9F LATIN SMALL LETTER SHARP S ẞ \x{1E9E} \xE1\xBA\x9E LATIN CAPITAL LETTER SHARP S
(?i)[[:unicode:]]
matchesẞ
(U+1E9E) as expected. However, I also see that Notepad++'s case conversion functions fail to convert that letter to its upper or lower case version. A search using(?-i)ß
or(?-i)ẞ
also fails match both cases of that letter. According to U+00DF and U+1E9E on fileformat.info that pair should be case-convertible. -
@Coises, The UTF8DocumentIterator code seems straightforward and does more or less mindless conversion. It barely cares about invalid codes, etc. The logic silently allows overlong encoding where for example, a 3-byte UTF sequence is used to encode a value from
0x00
to0x7F
which is normally a 1 byte sequence or0x0080
to0x07FF
which is normally a 2 byte sequenceThe logic also silently allows 4-byte UTF-8 sequences that encode 0x110000 to 0x1FFFFF which is beyond the range assigned to Unicode. It will attempt to convert those values into surrogate pairs. The first word of the pair will overflow the 0xD800 to 0xDBFF range assigned to the first word. The second word is ok and will be a value in the range 0xDC00 to 0xDFFF which is correct for the second word of the pair. I’d have to trace a bit more carefully but the code also seems to silently allow for 5 and 6 byte long encodings that either contain underlong values or will overflow the first word of the surrogate pairs. Overall, it’s not a huge issue that results in garbage in, garbage out, but it should not crash the editor unless something is unhappy about orphan parts of surrogate pairs.
I’m now wondering if the internal storage is UTF-16. That would explain some of the search behavior.
-
@Coises said in Search for character classes but not replace them:
But why . matches one of a surrogate pair but [[:unicode:]] matches both escapes me. (In my search in Columns++ both only match a single wchar_t. I don’t use the same iterator code, but I don’t know what I do that would produce different results, other than handling invalid UTF-8 differently.)
This reminds me of an issue where (IIRC; don’t have a computer with Notepad++ on it in front of me right now) you get weird things like
.*
matching any number of emojis, but(.)*
does not match any emojis at all. -
@mkupper said in Search for character classes but not replace them:
I’m now wondering if the internal storage is UTF-16. That would explain some of the search behavior.
That part I can tell you with certainty, because the search I implemented in Columns++ accesses the Scintilla buffer directly using SCI_GETRANGEPOINTER.
Scintilla’s storage for document text is either “ANSI” (current system code page) or UTF-8. However, the character type template parameter for boost::regex search in UTF-8 documents is
wchar_t
because of the problems I mentioned earlier. (I think part of this is that thestd::char_traits
specialization on Windows forwchar_t
is apt for UTF-16. In any case, boost::regex can handlewchar_t
reasonably by itself, but something else — either ICU or astd::char_traits
specialization — would be needed to handle 32-bit Unicode.) So a custom iterator is needed to translate the UTF-8 bytes intowchar_t
holding UTF-16.Notepad++ and Columns++ both do that translation, though we go about it in somewhat different ways.
I still haven’t figured out how
[[:unicode:]]
, being a character class which should always match either nothing or a single “character” in the regex sense of character, manages to match a surrogate pair in Notepad++. -
@Coises said in Search for character classes but not replace them:
I still haven’t figured out how [[:unicode:]], being a character class which should always match either nothing or a single “character” in the regex sense of character, manages to match a surrogate pair in Notepad++.
I don’t think
[[:unicode:]]
matches surrogate pairs. It seems to match the second word of the pair but not the first. Let’s say we havetab🌵tab
on a line🌵
- The cactus is U+1F335 and we normally use the surrogate pair
\x{D83C}\x{DF35}
to search for it. \x{D83C}
reports a zero length match\x{DF35}
selects the cactus and seems to select all of the encoded data for the cactus.\t\x{D83C}
selects the leading tab but not the cactus. You can verify this by pressingDel
and the tab goes away but the cactus remains.\x{DF35}\t
selects the cactus and the trailing tab. It seems to be including both words in the selection as I can copy/paste it and get the cactus+tab.\t[[:unicode:]]
says it can’t find the text.[[:unicode:]]\t
selects the cactus and the trailing tab.\x{D83C}[[:unicode:]]\t
selects the cactus and the trailing tab.[[:unicode:]][[:unicode:]]\t
says it can’t find the text.\t.\t
says it can’t find the text.\t..\t
selects thetab+cactus+tab
.
It seems that
[[:unicode:]]
matches the second word. I suspect this was done so that you don’t need to use[[:unicode:]][[:unicode:]]
the way you need to use dot-dot to match these characters.I did a search/replace using
[[:unicode:]]\t
with\x{DF36}\t
. The cactus changes into an empty box. I was hoping to see a 🌶 hot pepper by swapping out the second word. The UTF-8 data is09 f3 9d a0 89
meaning the leading tab is still there but the trailing tab is gone and it’s a hot mess as that that last89
is a continuation byte that holds a tab. - The cactus is U+1F335 and we normally use the surrogate pair
-
All,
It’s really good discussion going on here. I’m listening. :-)
Meanwhile I’m getting discouraged that, with the tools available to me, I’m not going to be able to do the task I set out to do. I haven’t provided details on that, but it should be clear by now that I want an expression to match any single UTF-8 “character”. Maybe I’m being silly because perhaps that isn’t a solid concept. :-(
-
@alan-kilborn, @mkupper, @coises, @mark-olson and All,
As promised, here is my feature request :
https://github.com/notepad-plus-plus/notepad-plus-plus/issues/16153
Best Regards
guy038
-
@guy038 said in Search for character classes but not replace them:
https://github.com/notepad-plus-plus/notepad-plus-plus/issues/16153
While you mentioned
\x{0001F600}
in the introduction I noticed that the the macro won’t transform it into a surrogate pair. Maybe you’d want to first normalize the value:
Search:(?-i)(?<=\\x\{)0+(?i)(?=(?:[1-9A-F]|10)[[:xdigit:]]{4}\})
Replace: (nothing)It’s possible you could add that to your first S/R which appends the
\x1F
and then would not need to test for a trailing\}
in your later S/R. -
Hello, @alan-kilborn, @mkupper, @coises, @mark-olson and All,
Thanks @mkupper for your observation. I first did not thought about possible leading zeros. So, I improved my macro to this version, which, I hope, will be the last one !
<Macro name="Surrogates Pairs in Selection" Ctrl="no" Alt="no" Shift="no" Key="0"> <Action type="3" message="1700" wParam="0" lParam="0" sParam="" /> <Action type="3" message="1601" wParam="0" lParam="0" sParam="(?-i)\\x\{0*((?:10|[[:xdigit:]])[[:xdigit:]]{4})(?=\})" /> <Action type="3" message="1625" wParam="0" lParam="2" sParam="" /> <Action type="3" message="1602" wParam="0" lParam="0" sParam="\\x{$1\x1F" /> <Action type="3" message="1702" wParam="0" lParam="640" sParam="" /> <Action type="3" message="1701" wParam="0" lParam="1609" sParam="" /> <Action type="3" message="1700" wParam="0" lParam="0" sParam="" /> <Action type="3" message="1601" wParam="0" lParam="0" sParam="(?i)(?:(1)|(2)|(3)|(4)|(5)|(6)|(7)|(8)|(9)|(A)|(B)|(C)|(D)|(E)|(F)|(10))(?=[[:xdigit:]]{4}\x1F\})|(?:(0)|(1)|(2)|(3)|(4)|(5)|(6)|(7)|(8)|(9)|(A)|(B)|(C)|(D)|(E)|(F))(?=[[:xdigit:]]{0,3}\x1F\})" /> <Action type="3" message="1625" wParam="0" lParam="2" sParam="" /> <Action type="3" message="1602" wParam="0" lParam="0" sParam="(?{1}0000)(?{2}0001)(?{3}0010)(?{4}0011)(?{5}0100)(?{6}0101)(?{7}0110)(?{8}0111)(?{9}1000)(?{10}1001)(?{11}1010)(?{12}1011)(?{13}1100)(?{14}1101)(?{15}1110)(?{16}1111)(?{17}0000)(?{18}0001)(?{19}0010)(?{20}0011)(?{21}0100)(?{22}0101)(?{23}0110)(?{24}0111)(?{25}1000)(?{26}1001)(?{27}1010)(?{28}1011)(?{29}1100)(?{30}1101)(?{31}1110)(?{32}1111)" /> <Action type="3" message="1702" wParam="0" lParam="640" sParam="" /> <Action type="3" message="1701" wParam="0" lParam="1609" sParam="" /> <Action type="3" message="1700" wParam="0" lParam="0" sParam="" /> <Action type="3" message="1601" wParam="0" lParam="0" sParam="([01]{10})([01]{10})(?=\x1F)" /> <Action type="3" message="1625" wParam="0" lParam="2" sParam="" /> <Action type="3" message="1602" wParam="0" lParam="0" sParam="110110\1\x1F}\\x{110111\2" /> <Action type="3" message="1702" wParam="0" lParam="640" sParam="" /> <Action type="3" message="1701" wParam="0" lParam="1609" sParam="" /> <Action type="3" message="1700" wParam="0" lParam="0" sParam="" /> <Action type="3" message="1601" wParam="0" lParam="0" sParam="(?:(0000)|(0001)|(0010)|(0011)|(0100)|(0101)|(0110)|(0111)|(1000)|(1001)|(1010)|(1011)|(1100)|(1101)|(1110)|(1111))(?=[[:xdigit:]]*\x1F\})|\x1F" /> <Action type="3" message="1625" wParam="0" lParam="2" sParam="" /> <Action type="3" message="1602" wParam="0" lParam="0" sParam="(?{1}0)(?{2}1)(?{3}2)(?{4}3)(?{5}4)(?{6}5)(?{7}6)(?{8}7)(?{9}8)(?{10}9)(?11A)(?12B)(?13C)(?14D)(?15E)(?16F)" /> <Action type="3" message="1702" wParam="0" lParam="640" sParam="" /> <Action type="3" message="1701" wParam="0" lParam="1609" sParam="" /> </Macro>
As a test :
- Select all the text below :
\x0f \xff \x{0f} \x{ff} \x{000f} \x{00ff} \x{0fff} \x{ffff} \x{10000} \x{010000} \x{0010000} \x{00010000} \x{1f600} \x{01f600} \x{001f600} \x{0001f600} \x{10ffff} \x{010ffff} \x{0010ffff} \x{110000} \x{1fffff}
- Run once the
Surrogate pairs in selection
macro
You should get the expected text, below :
\x0f \xff \x{0f} \x{ff} \x{000f} \x{00ff} \x{0fff} \x{ffff} \x{D800}\x{DC00} \x{D800}\x{DC00} \x{D800}\x{DC00} \x{D800}\x{DC00} \x{D83D}\x{DE00} \x{D83D}\x{DE00} \x{D83D}\x{DE00} \x{D83D}\x{DE00} \x{DBFF}\x{DFFF} \x{DBFF}\x{DFFF} \x{DBFF}\x{DFFF} \x{110000} \x{1fffff}
As you can see, values under \x{FFFF} are not changed and the last
\x{110000}
and\x{1fffff}
code-points are not considered too, as not Unicode characters at all !Best Regards,
guy038
-
@guy038 Your first expression allows
\x{0ffff}
and the remaining expressions end up mangling that into\x{7FFF1}
. That is why I had used(?:[1-9A-F]|10)
earlier as I knew I could not and should not attempt to translate BMP code points into surrogate pairs.At the time I wrote
(?:[1-9A-F]|10)
I wondered if(?:10|[1-9A-F])
was better and would result in less backtracking for common values. Mental gymnastics in the shower lead to that(?:10|[1-9A-F])
was better but I then forgot about it and did not do any testing to see how common it was for Unicode code points values to start with10
. I noticed you used(?:10|[[:xdigit:]])
but don’t know if that was based on testing, experience, or habit. -
@Alan-Kilborn said in Search for character classes but not replace them:
Meanwhile I’m getting discouraged that, with the tools available to me, I’m not going to be able to do the task I set out to do. I haven’t provided details on that, but it should be clear by now that I want an expression to match any single UTF-8 “character”. Maybe I’m being silly because perhaps that isn’t a solid concept. :-(
It’s not unreasonable. The concept of a Unicode code point is well-defined, and the UTF-8 representation of a code point is well-defined.
The “character” problem arises with combining marks and variant forms. If you assume normalization to composed form, most characters are single code points. I suspect you would be content with an expression which matched any character represented by the UTF-8 sequence for a single Unicode code point. (That includes all the ones in your example.)
I’m still trying to make sense of what happens. I found an error in my Columns++ code which prevents me (until I publish a fix) from demonstrating this, but what you want is theoretically possible (with the caveat above about code points vs characters) with the expression
[\x{d800}-\x{dbff}]?[^\x{d800}-\x{dbff}]
. If you try that by itself, it looks like it works in Notepad++; but if you try putting anything else to match in front of it, it fails. The match with a low surrogate never happens, while the high surrogate match visibly selects the whole character, but internally matches only the second half of the surrogate pair. -
@Alan-Kilborn said in Search for character classes but not replace them:
I haven’t provided details on that, but it should be clear by now that I want an expression to match any single UTF-8 “character”. Maybe I’m being silly because perhaps that isn’t a solid concept. :-(
By the way… based on something I found while working on finding a better way to do this… I think that you’ll find the regex search in PythonScript doesn’t have this problem. I believe a period there will just match one character of any sort. (I haven’t tried it because I don’t know Python… I base my statement on the support code, which has given me an idea for how to make similar improvements to the search in Columns++.) I think you’ll also be able to use actual Unicode values above 0xffff rather than needing to split them into surrogates.
-
@Coises said:
…given me an idea for how to make similar improvements to the search in Columns++.
I believe that you are talking about what you said HERE.
I think that you’ll find the regex search in PythonScript doesn’t have this problem. I believe a period there will just match one character of any sort
It appears to be true; see code below.
I think you’ll also be able to use actual Unicode values above 0xffff rather than needing to split them into surrogates.
This does NOT appear to be true; see code below.
The aforementioned code (note that it is Python3):
from Npp import * # 💙<-- we'll be searching for the blue heart character print('using character:', editor.findText(FINDOPTION.REGEXP, 0, editor.getLength(), '💙')) print('using surrogate pairs:', editor.findText(FINDOPTION.REGEXP, 0, editor.getLength(), r'\x{D83D}\x{DC99}')) print('using codepoint above FFFF:', editor.findText(FINDOPTION.REGEXP, 0, editor.getLength(), r'\x{1F499}')) print('using .:', editor.findText(FINDOPTION.REGEXP, 0, editor.getLength(), r'.(?=<--)'))
and results when running it:
using character: (21, 25) using surrogate pairs: (21, 25) using codepoint above FFFF: (-2, 7) using .: (21, 25)
The
-2
returned for the first position of the range match is a indication of “invalid regular expression”. The other results,(21,25)
, are correct for the position range of the character being searched for – it’s a character with 4-byte encoding. -
@Alan-Kilborn @Ekopalypse @guy038 @mkupper
If anyone is curious, an experimental version of Columns++ in which regular expressions process each Unicode code point as a single regex character is available here:
https://github.com/Coises/ColumnsPlusPlus/releases/tag/v1.1.5.1-Experimental
It appears to be working reasonably well, but there hasn’t been a lot of testing yet. Feedback from anyone who cares to try it would be much appreciated.
-
Hello, @coises and All,
This morning, from your post above mine, I downloaded your experimental release of
Columns++
( ColumnsPlusPlus-1.1.5.1-Experimental-x64.zip ) and installed your plugin on my portable N++v8.7.6
. I even did not need yourQuick Installer
!So I decided to test this new version of our regex engine and to do some tests against my
Total_Chars.txt
file that you may download from my Drive Account :https://drive.google.com/file/d/1kYtbIGPRLdypY7hNMI-vAJXoE7ilRMOC/view?usp=sharing
Here is an updated version of the contents of the
Total_Chars.txt
file, as I corrected two typo errors and I also split the range0080 - 07FF
in two ranges0080 - 00FF
and0100 - 07FF
in order to easily highlight the[[:unicode:]]
range of chars•--------------------•-------------------•------------•---------------------------•----------------•-------------------• | Range | Description | Status | Number of Chars | UTF-8 Encoding | Number of Bytes | •--------------------•-------------------•------------•-------------•-------------•----------------•-------------------• | 0000 - 007F | PLANE 0 - BMP | Included | | 128 | 1 Byte | 128 | •--------------------•-------------------•------------•-------------•-------------•----------------•-------------------• | 0080 - 00FF | PLANE 0 - BMP | Included | | + 128 | | 256 | | | | | | | 2 Bytes | |------- | 0100 - 07FF | PLANE 0 - BMP | Included | | + 1,792 | | 3,584 | \ •--------------------•-------------------•------------•-------------•-------------•----------------•-------------------• | | 0800 - D7FF | PLANE 0 - BMP | Included | | + 53,248 | | 159,744 | | | | | | | | | | | | D800 - DFFF | SURROGATES zone | EXCLUDED | - 2,048 | | | | | | | | | | | | | | | E000 - F8FF | PLANE 0 - PUA | Included | | + 6,400 | | 19,200 | | | | | | | | | | | | F900 - FDCF | PLANE 0 - BMP | Included | | + 1,232 | 3 Bytes | 3,696 | | | | | | | | | | | | FDD0 - FDEF | NON-characters | EXCLUDED | - 32 | | | | | | | | | | | | | | | FDF0 - FFFD | PLANE 0 - BMP | Included | | + 526 | | 1,578 | | | | | | | | | | |==> [[:unicode:]] | FFFE - FFFF | NON-characters | EXCLUDED | - 2 | | | | | •--------------------•-------------------•------------•-------------•-------------•----------------•-------------------• | | Plane 0 - BMP | SUB-Totals | - 2,082 | + 63,454 | | 188,186 | | •--------------------•-------------------•------------•-------------•-------------•----------------•-------------------• | | 10000 - 1FFFD | PLANE 1 - SMP | Included | | + 65,534 | | 262,136 | | | | | | | | | | | | 1FFFE - 1FFFF | NON-characters | EXCLUDED | - 2 | | | | | •--------------------•-------------------•------------•-------------•-------------• •-------------------• | | 20000 - 2FFFD | PLANE 2 - SIP | Included | | + 65,534 | | 262,136 | | | | | | | | 4 Bytes | | | | 2FFFE - 2FFFF | NON-characters | EXCLUDED | - 2 | | | | | •--------------------•-------------------•------------•-------------•-------------• •-------------------• | | 30000 - 3FFFD | PLANE 3 - TIP | Included | | + 65,534 | | 262,136 | | | | | | | | | | | | 3FFFE - 3FFFF | NON-characters | EXCLUDED | - 2 | | | | | •--------------------•-------------------•------------•-------------•-------------•----------------•-------------------• | | 40000 - DFFFF | PLANES 4 to 13 | NOT USED | - 655,360 | | 4 Bytes | | | •--------------------•-------------------•------------•-------------•-------------•----------------•-------------------• | | E0000 - EFFFD | PLANE 14 - SPP | Included | | + 65,534 | | 262,136 | / | | | | | | | | | EFFFE - EFFFF | NON-characters | EXCLUDED | - 2 | | | | •--------------------•-------------------•------------•-------------•-------------• •-------------------• | FFFF0 - FFFFD | PLANE 15 - SPUA | NOT USED | - 65,334 | | | | | | | | | | | | | FFFFE - FFFFF | NON-characters | EXCLUDED | - 2 | | | | •--------------------•-------------------•------------•-------------•-------------• 4 Bytes •-------------------• | 100000 - 10FFFD | PLANE 16 - SPUA | NOT USED | - 65,334 | | | | | | | | | | | | | 10FFFE - 10FFFF | NON-characters | EXCLUDED | - 2 | | | | •--------------------•-------------------•------------•-------------•-------------•----------------•-------------------• | GRAND Totals | - 788,522 | + 325,590 | | 1,236,730 | | | | | | | | Byte Order Mark - BOM | | | | 3 | •-----------------------------------------------------•-------------•-------------• •-------------------• | | 1,114,112 Unicode chars | | Size 1,236,733 | •-----------------------------------------------------•---------------------------•----------------•-------------------•
And I’m pleased to tell you that, globally, this
UTF-32
version works nicely ! Against myTotal_Chars.txt
, I got these results :[\x{0000}-\x{007F}] => 128 chars ( OK ) [\x{0080}-\x{00FF}] => 128 chars ( OK ) [\x{0100}-\x{07FF}] => 1,762 chars ( OK ) [\x{0800}-\x{D7FF}] => 53,248 chars ( OK ) [\x{E000}-\x{F8FF}] => 6,400 chars ( OK ) [\x{F900}-\x{FDCF}] => 1,232 chars ( OK ) [\x{FDF0}-\x{FFFD}] => 526 chars ( OK ) [\x{10000}-\x{1FFFD}] => 65,534 chars ( OK ) [\x{20000}-\x{2FFFD}] => 65 534 chars ( OK ) [\x{30000}-\x{3FFFD}] => 65 534 chars ( OK ) [\x{E0000}-\x{EFFFD}] => 65 534 chars ( OK ) [\x{0000}-\x{007F}] => 128 chars ( OK ) coded with 1 byte [\x{0080}-\x{07FF}] => 1,920 chars ( OK ) coded with 2 bytes [\x{0800}-\x{FFFD}] => 61,406 chars ( OK ) coded with 3 bytes [\x{10000}-\x{EFFFD}] => 262,136 chars ( OK ) coded with 4 bytes [\x{0000}-\x{EFFFD}] => 325,590 chars ( OK ) Total of characters [\x{0100}-\x{EFFFD}] => 325 334 chars ( OK ) Total chars OVER \x{00FF} [:unicode:]] => 323 286 chars ( KO ) Should be 325,334 chars
Regarding this error, note that
323,286
+2,048
do give the right value325,334
!The
2,048
value seems linked to the Surrogate pairs mechanism. However, myTotal_Chars.txt
file do NOT contain any Surrogate character at all So what !?
Note also that the regex
(?s).
returns , as expected,325,590
chars and that the simple.
regex returns325572
chars. The difference concerns the18
characters below, which are considered as line separator :LF 000A LINE FEED
000C FORM FEED
CR 000D CARRIAGE RETURN0085 NEXT LINE
2028 LINE SEPARATOR
2029 PARAGRAH SEPARATOR𐂅 10085 LINEAR B IDEOGRAM B105M STALLION
𒀨 12028 CUNEIFORM SIGN AL TIMES USH
𒀩 12029 CUNEIFORM SIGN ALAN𠂅 20085 [CJK Unified Ideographs Extension B]
𢀨 22028 [CJK Unified Ideographs Extension B]
𢀩 22029 [CJK Unified Ideographs Extension B]𰂅 30085 [CJK Unified Ideographs Extension G]
𲀨 32028 [CJK Unified Ideographs Extension H]
𲀩 32029 [CJK Unified Ideographs Extension H] E0085 UNASSIGNED code-point
E2028 UNASSIGNED code-point
E2029 UNASSIGNED code-pointNote, that, normally, the first six chars should ONLY be seen as a line separator. So, it"s quite funny that the same chars as
\x{ 0085}
,\x{2028}
and\x{2029}
, in the other planes, included in my file, are also considered as line separator !
I also ran the test regarding the
NUL
character, described in this post :https://community.notepad-plus-plus.org/post/99736
FIND
ABC\x00XYZ
REPLACE
\x0--$0--\x{000}
Which works and gives the correct result !
I also tried some expressions with look-aheads and look-behinds, containing overlapping zones !
For instance, against this text
aaaabaaababbbaabbabb
, pasted in a new tab, with a final line-break, all the regexes, below, give the correct number of matches :ba*(?=a) => 4 matches ba*(?!a) => 9 matches ba*(?=b) => 8 matches ba*(?!b) => 5 matches (?<=a)ba* => 5 matches (?<!b)ba* => 5 matches (?<=b)ba* => 4 matches (?<!a)ba* => 4 matches
However, I could not test this irritating problem :
Backward regex searches, for
NON ANSI
files, stops as soon as it matches a character with code-point over\x{007F}
Just because you do not allow backward searches when choosing the
Regular expression
search mode ! May be you could add it among all theColumns++
options ?
In the end, I can say that this experimental release is a valuable improvement and should be adopted in your
Columns++
plugin, as well as in the N++Boost
regex engine ;-))@coises, Many thanks for your efforts, so far !
Best Regards,
guy038
P.S. :
Just a small drawback :
- For the four regexes
[\x{10000}-\x{EFFFD}]
,[\x{0000}-\x{EFFFD}]
,[\x{0100}-\x{EFFFD}]
and[[:unicode:]]
which match more than200,000
occurences, theSelect All
option of the Count button leads to the message the program des not respond :-((
- For the four regexes