Search for character classes but not replace them

Coises

@mkupper said in Search for character classes but not replace them:

I’m now wondering if the internal storage is UTF-16. That would explain some of the search behavior.

That part I can tell you with certainty, because the search I implemented in Columns++ accesses the Scintilla buffer directly using SCI_GETRANGEPOINTER.

Scintilla’s storage for document text is either “ANSI” (current system code page) or UTF-8. However, the character type template parameter for boost::regex search in UTF-8 documents is wchar_t because of the problems I mentioned earlier. (I think part of this is that the std::char_traits specialization on Windows for wchar_t is apt for UTF-16. In any case, boost::regex can handle wchar_t reasonably by itself, but something else — either ICU or a std::char_traits specialization — would be needed to handle 32-bit Unicode.) So a custom iterator is needed to translate the UTF-8 bytes into wchar_t holding UTF-16.

Notepad++ and Columns++ both do that translation, though we go about it in somewhat different ways.

I still haven’t figured out how [[:unicode:]], being a character class which should always match either nothing or a single “character” in the regex sense of character, manages to match a surrogate pair in Notepad++.

mkupper

@Coises said in Search for character classes but not replace them:

I still haven’t figured out how [[:unicode:]], being a character class which should always match either nothing or a single “character” in the regex sense of character, manages to match a surrogate pair in Notepad++.

I don’t think [[:unicode:]] matches surrogate pairs. It seems to match the second word of the pair but not the first. Let’s say we have tab🌵tab on a line

	🌵

The cactus is U+1F335 and we normally use the surrogate pair \x{D83C}\x{DF35} to search for it.
\x{D83C} reports a zero length match
\x{DF35} selects the cactus and seems to select all of the encoded data for the cactus.
\t\x{D83C} selects the leading tab but not the cactus. You can verify this by pressing Del and the tab goes away but the cactus remains.
\x{DF35}\t selects the cactus and the trailing tab. It seems to be including both words in the selection as I can copy/paste it and get the cactus+tab.
\t[[:unicode:]] says it can’t find the text.
[[:unicode:]]\t selects the cactus and the trailing tab.
\x{D83C}[[:unicode:]]\t selects the cactus and the trailing tab.
[[:unicode:]][[:unicode:]]\t says it can’t find the text.
\t.\t says it can’t find the text.
\t..\t selects the tab+cactus+tab.

It seems that [[:unicode:]] matches the second word. I suspect this was done so that you don’t need to use [[:unicode:]][[:unicode:]] the way you need to use dot-dot to match these characters.

I did a search/replace using [[:unicode:]]\t with \x{DF36}\t. The cactus changes into an empty box. I was hoping to see a 🌶 hot pepper by swapping out the second word. The UTF-8 data is 09 f3 9d a0 89 meaning the leading tab is still there but the trailing tab is gone and it’s a hot mess as that that last 89 is a continuation byte that holds a tab.

Alan Kilborn

All,

It’s really good discussion going on here. I’m listening. :-)

Meanwhile I’m getting discouraged that, with the tools available to me, I’m not going to be able to do the task I set out to do. I haven’t provided details on that, but it should be clear by now that I want an expression to match any single UTF-8 “character”. Maybe I’m being silly because perhaps that isn’t a solid concept. :-(

guy038

@alan-kilborn, @mkupper, @coises, @mark-olson and All,

As promised, here is my feature request :

https://github.com/notepad-plus-plus/notepad-plus-plus/issues/16153

Best Regards

guy038

mkupper

@guy038 said in Search for character classes but not replace them:

https://github.com/notepad-plus-plus/notepad-plus-plus/issues/16153

While you mentioned \x{0001F600} in the introduction I noticed that the the macro won’t transform it into a surrogate pair. Maybe you’d want to first normalize the value:
Search: (?-i)(?<=\\x\{)0+(?i)(?=(?:[1-9A-F]|10)[[:xdigit:]]{4}\})
Replace: (nothing)

It’s possible you could add that to your first S/R which appends the \x1F and then would not need to test for a trailing \} in your later S/R.

guy038

Hello, @alan-kilborn, @mkupper, @coises, @mark-olson and All,

Thanks @mkupper for your observation. I first did not thought about possible leading zeros. So, I improved my macro to this version, which, I hope, will be the last one !

        <Macro name="Surrogates Pairs in Selection" Ctrl="no" Alt="no" Shift="no" Key="0">
            <Action type="3" message="1700" wParam="0" lParam="0" sParam="" />
            <Action type="3" message="1601" wParam="0" lParam="0" sParam="(?-i)\\x\{0*((?:10|[[:xdigit:]])[[:xdigit:]]{4})(?=\})" />
            <Action type="3" message="1625" wParam="0" lParam="2" sParam="" />
            <Action type="3" message="1602" wParam="0" lParam="0" sParam="\\x{$1\x1F" />
            <Action type="3" message="1702" wParam="0" lParam="640" sParam="" />
            <Action type="3" message="1701" wParam="0" lParam="1609" sParam="" />
            <Action type="3" message="1700" wParam="0" lParam="0" sParam="" />
            <Action type="3" message="1601" wParam="0" lParam="0" sParam="(?i)(?:(1)|(2)|(3)|(4)|(5)|(6)|(7)|(8)|(9)|(A)|(B)|(C)|(D)|(E)|(F)|(10))(?=[[:xdigit:]]{4}\x1F\})|(?:(0)|(1)|(2)|(3)|(4)|(5)|(6)|(7)|(8)|(9)|(A)|(B)|(C)|(D)|(E)|(F))(?=[[:xdigit:]]{0,3}\x1F\})" />
            <Action type="3" message="1625" wParam="0" lParam="2" sParam="" />
            <Action type="3" message="1602" wParam="0" lParam="0" sParam="(?{1}0000)(?{2}0001)(?{3}0010)(?{4}0011)(?{5}0100)(?{6}0101)(?{7}0110)(?{8}0111)(?{9}1000)(?{10}1001)(?{11}1010)(?{12}1011)(?{13}1100)(?{14}1101)(?{15}1110)(?{16}1111)(?{17}0000)(?{18}0001)(?{19}0010)(?{20}0011)(?{21}0100)(?{22}0101)(?{23}0110)(?{24}0111)(?{25}1000)(?{26}1001)(?{27}1010)(?{28}1011)(?{29}1100)(?{30}1101)(?{31}1110)(?{32}1111)" />
            <Action type="3" message="1702" wParam="0" lParam="640" sParam="" />
            <Action type="3" message="1701" wParam="0" lParam="1609" sParam="" />
            <Action type="3" message="1700" wParam="0" lParam="0" sParam="" />
            <Action type="3" message="1601" wParam="0" lParam="0" sParam="([01]{10})([01]{10})(?=\x1F)" />
            <Action type="3" message="1625" wParam="0" lParam="2" sParam="" />
            <Action type="3" message="1602" wParam="0" lParam="0" sParam="110110\1\x1F}\\x{110111\2" />
            <Action type="3" message="1702" wParam="0" lParam="640" sParam="" />
            <Action type="3" message="1701" wParam="0" lParam="1609" sParam="" />
            <Action type="3" message="1700" wParam="0" lParam="0" sParam="" />
            <Action type="3" message="1601" wParam="0" lParam="0" sParam="(?:(0000)|(0001)|(0010)|(0011)|(0100)|(0101)|(0110)|(0111)|(1000)|(1001)|(1010)|(1011)|(1100)|(1101)|(1110)|(1111))(?=[[:xdigit:]]*\x1F\})|\x1F" />
            <Action type="3" message="1625" wParam="0" lParam="2" sParam="" />
            <Action type="3" message="1602" wParam="0" lParam="0" sParam="(?{1}0)(?{2}1)(?{3}2)(?{4}3)(?{5}4)(?{6}5)(?{7}6)(?{8}7)(?{9}8)(?{10}9)(?11A)(?12B)(?13C)(?14D)(?15E)(?16F)" />
            <Action type="3" message="1702" wParam="0" lParam="640" sParam="" />
            <Action type="3" message="1701" wParam="0" lParam="1609" sParam="" />
        </Macro>

As a test :

Select all the text below :

\x0f
\xff

\x{0f}
\x{ff}

\x{000f}
\x{00ff}
\x{0fff}
\x{ffff}

\x{10000}
\x{010000}
\x{0010000}
\x{00010000}

\x{1f600}
\x{01f600}
\x{001f600}
\x{0001f600}

\x{10ffff}
\x{010ffff}
\x{0010ffff}

\x{110000}
\x{1fffff}

Run once the Surrogate pairs in selection macro

You should get the expected text, below :

\x0f
\xff

\x{0f}
\x{ff}

\x{000f}
\x{00ff}
\x{0fff}
\x{ffff}

\x{D800}\x{DC00}
\x{D800}\x{DC00}
\x{D800}\x{DC00}
\x{D800}\x{DC00}

\x{D83D}\x{DE00}
\x{D83D}\x{DE00}
\x{D83D}\x{DE00}
\x{D83D}\x{DE00}

\x{DBFF}\x{DFFF}
\x{DBFF}\x{DFFF}
\x{DBFF}\x{DFFF}

\x{110000}
\x{1fffff}

As you can see, values under \x{FFFF} are not changed and the last \x{110000} and \x{1fffff} code-points are not considered too, as not Unicode characters at all !

Best Regards,

guy038

mkupper

@guy038 Your first expression allows \x{0ffff} and the remaining expressions end up mangling that into \x{7FFF1}. That is why I had used (?:[1-9A-F]|10) earlier as I knew I could not and should not attempt to translate BMP code points into surrogate pairs.

At the time I wrote (?:[1-9A-F]|10) I wondered if (?:10|[1-9A-F]) was better and would result in less backtracking for common values. Mental gymnastics in the shower lead to that (?:10|[1-9A-F]) was better but I then forgot about it and did not do any testing to see how common it was for Unicode code points values to start with 10. I noticed you used (?:10|[[:xdigit:]]) but don’t know if that was based on testing, experience, or habit.

Coises

@Alan-Kilborn said in Search for character classes but not replace them:

Meanwhile I’m getting discouraged that, with the tools available to me, I’m not going to be able to do the task I set out to do. I haven’t provided details on that, but it should be clear by now that I want an expression to match any single UTF-8 “character”. Maybe I’m being silly because perhaps that isn’t a solid concept. :-(

It’s not unreasonable. The concept of a Unicode code point is well-defined, and the UTF-8 representation of a code point is well-defined.

The “character” problem arises with combining marks and variant forms. If you assume normalization to composed form, most characters are single code points. I suspect you would be content with an expression which matched any character represented by the UTF-8 sequence for a single Unicode code point. (That includes all the ones in your example.)

I’m still trying to make sense of what happens. I found an error in my Columns++ code which prevents me (until I publish a fix) from demonstrating this, but what you want is theoretically possible (with the caveat above about code points vs characters) with the expression [\x{d800}-\x{dbff}]?[^\x{d800}-\x{dbff}]. If you try that by itself, it looks like it works in Notepad++; but if you try putting anything else to match in front of it, it fails. The match with a low surrogate never happens, while the high surrogate match visibly selects the whole character, but internally matches only the second half of the surrogate pair.

Coises

@Alan-Kilborn said in Search for character classes but not replace them:

I haven’t provided details on that, but it should be clear by now that I want an expression to match any single UTF-8 “character”. Maybe I’m being silly because perhaps that isn’t a solid concept. :-(

By the way… based on something I found while working on finding a better way to do this… I think that you’ll find the regex search in PythonScript doesn’t have this problem. I believe a period there will just match one character of any sort. (I haven’t tried it because I don’t know Python… I base my statement on the support code, which has given me an idea for how to make similar improvements to the search in Columns++.) I think you’ll also be able to use actual Unicode values above 0xffff rather than needing to split them into surrogates.

Alan Kilborn

@Coises said:

…given me an idea for how to make similar improvements to the search in Columns++.

I believe that you are talking about what you said HERE.

I think that you’ll find the regex search in PythonScript doesn’t have this problem. I believe a period there will just match one character of any sort

It appears to be true; see code below.

I think you’ll also be able to use actual Unicode values above 0xffff rather than needing to split them into surrogates.

This does NOT appear to be true; see code below.

The aforementioned code (note that it is Python3):

from Npp import *
# 💙<-- we'll be searching for the blue heart character
print('using character:', editor.findText(FINDOPTION.REGEXP, 0, editor.getLength(), '💙'))
print('using surrogate pairs:', editor.findText(FINDOPTION.REGEXP, 0, editor.getLength(), r'\x{D83D}\x{DC99}'))
print('using codepoint above FFFF:', editor.findText(FINDOPTION.REGEXP, 0, editor.getLength(), r'\x{1F499}'))
print('using .:', editor.findText(FINDOPTION.REGEXP, 0, editor.getLength(), r'.(?=<--)'))

and results when running it:

using character: (21, 25)
using surrogate pairs: (21, 25)
using codepoint above FFFF: (-2, 7)
using .: (21, 25)

The -2 returned for the first position of the range match is a indication of “invalid regular expression”. The other results, (21,25), are correct for the position range of the character being searched for – it’s a character with 4-byte encoding.

Coises

@Alan-Kilborn @Ekopalypse @guy038 @mkupper

If anyone is curious, an experimental version of Columns++ in which regular expressions process each Unicode code point as a single regex character is available here:

https://github.com/Coises/ColumnsPlusPlus/releases/tag/v1.1.5.1-Experimental

It appears to be working reasonably well, but there hasn’t been a lot of testing yet. Feedback from anyone who cares to try it would be much appreciated.

guy038

Hello, @coises and All,

This morning, from your post above mine, I downloaded your experimental release of Columns++ ( ColumnsPlusPlus-1.1.5.1-Experimental-x64.zip ) and installed your plugin on my portable N++ v8.7.6. I even did not need your Quick Installer !

So I decided to test this new version of our regex engine and to do some tests against my Total_Chars.txt file that you may download from my Drive Account :

https://drive.google.com/file/d/1kYtbIGPRLdypY7hNMI-vAJXoE7ilRMOC/view?usp=sharing

Here is an updated version of the contents of the Total_Chars.txt file, as I corrected two typo errors and I also split the range 0080 - 07FF in two ranges 0080 - 00FF and 0100 - 07FF in order to easily highlight the [[:unicode:]] range of chars

    •--------------------•-------------------•------------•---------------------------•----------------•-------------------•
    |       Range        |    Description    |   Status   |      Number of Chars      | UTF-8 Encoding |  Number of Bytes  |
    •--------------------•-------------------•------------•-------------•-------------•----------------•-------------------•
    |    0000  -   007F  |  PLANE 0 - BMP    |  Included  |             |        128  |    1 Byte      |              128  |
    •--------------------•-------------------•------------•-------------•-------------•----------------•-------------------•
    |    0080  -   00FF  |  PLANE 0 - BMP    |  Included  |             |    +   128  |                |              256  |
    |                    |                   |            |             |             |    2 Bytes     |                   |-------
    |    0100  -   07FF  |  PLANE 0 - BMP    |  Included  |             |    + 1,792  |                |            3,584  |      \
    •--------------------•-------------------•------------•-------------•-------------•----------------•-------------------•      |
    |    0800  -   D7FF  |  PLANE 0 - BMP    |  Included  |             |   + 53,248  |                |          159,744  |      |
    |                    |                   |            |             |             |                |                   |      |
    |    D800  -   DFFF  |  SURROGATES zone  |  EXCLUDED  |    - 2,048  |             |                |                   |      |
    |                    |                   |            |             |             |                |                   |      |
    |    E000  -   F8FF  |  PLANE 0 - PUA    |  Included  |             |    + 6,400  |                |           19,200  |      |
    |                    |                   |            |             |             |                |                   |      |
    |    F900  -   FDCF  |  PLANE 0 - BMP    |  Included  |             |    + 1,232  |    3 Bytes     |            3,696  |      |
    |                    |                   |            |             |             |                |                   |      |
    |    FDD0  -   FDEF  |  NON-characters   |  EXCLUDED  |       - 32  |             |                |                   |      |
    |                    |                   |            |             |             |                |                   |      |
    |    FDF0  -   FFFD  |  PLANE 0 - BMP    |  Included  |             |      + 526  |                |            1,578  |      |
    |                    |                   |            |             |             |                |                   |      |==>  [[:unicode:]]
    |    FFFE  -   FFFF  |  NON-characters   |  EXCLUDED  |        - 2  |             |                |                   |      |
    •--------------------•-------------------•------------•-------------•-------------•----------------•-------------------•      |
    |                       Plane 0 - BMP    | SUB-Totals |    - 2,082  |   + 63,454  |                |          188,186  |      |
    •--------------------•-------------------•------------•-------------•-------------•----------------•-------------------•      |
    |   10000  -  1FFFD  |  PLANE 1 - SMP    |  Included  |             |   + 65,534  |                |          262,136  |      |
    |                    |                   |            |             |             |                |                   |      |
    |   1FFFE  -  1FFFF  |  NON-characters   |  EXCLUDED  |        - 2  |             |                |                   |      |
    •--------------------•-------------------•------------•-------------•-------------•                •-------------------•      |
    |   20000  -  2FFFD  |  PLANE 2 - SIP    |  Included  |             |   + 65,534  |                |          262,136  |      |
    |                    |                   |            |             |             |    4 Bytes     |                   |      |
    |   2FFFE  -  2FFFF  |  NON-characters   |  EXCLUDED  |        - 2  |             |                |                   |      |
    •--------------------•-------------------•------------•-------------•-------------•                •-------------------•      |
    |   30000  -  3FFFD  |  PLANE 3 - TIP    |  Included  |             |   + 65,534  |                |          262,136  |      |
    |                    |                   |            |             |             |                |                   |      |
    |   3FFFE  -  3FFFF  |  NON-characters   |  EXCLUDED  |        - 2  |             |                |                   |      |
    •--------------------•-------------------•------------•-------------•-------------•----------------•-------------------•      |
    |   40000  -  DFFFF  |  PLANES 4 to 13   |  NOT USED  |  - 655,360  |             |    4 Bytes     |                   |      |
    •--------------------•-------------------•------------•-------------•-------------•----------------•-------------------•      |
    |   E0000  -  EFFFD  |  PLANE 14 - SPP   |  Included  |             |   + 65,534  |                |          262,136  |      /
    |                    |                   |            |             |             |                |                   |
    |   EFFFE  -  EFFFF  |  NON-characters   |  EXCLUDED  |        - 2  |             |                |                   |
    •--------------------•-------------------•------------•-------------•-------------•                •-------------------•
    |   FFFF0  -  FFFFD  |  PLANE 15 - SPUA  |  NOT USED  |   - 65,334  |             |                |                   |
    |                    |                   |            |             |             |                |                   |
    |   FFFFE  -  FFFFF  |  NON-characters   |  EXCLUDED  |        - 2  |             |                |                   |
    •--------------------•-------------------•------------•-------------•-------------•    4 Bytes     •-------------------•
    |  100000  - 10FFFD  |  PLANE 16 - SPUA  |  NOT USED  |   - 65,334  |             |                |                   |
    |                    |                   |            |             |             |                |                   |
    |  10FFFE  - 10FFFF  |  NON-characters   |  EXCLUDED  |        - 2  |             |                |                   |
    •--------------------•-------------------•------------•-------------•-------------•----------------•-------------------•
    |                                       GRAND Totals  |  - 788,522  |  + 325,590  |                |        1,236,730  |
    |                                                     |             |             |                |                   |
    |                              Byte Order Mark - BOM  |             |             |                |                3  |
    •-----------------------------------------------------•-------------•-------------•                •-------------------•
    |                                                     |  1,114,112 Unicode chars  |                |  Size  1,236,733  |
    •-----------------------------------------------------•---------------------------•----------------•-------------------•

And I’m pleased to tell you that, globally, this UTF-32 version works nicely ! Against my Total_Chars.txt, I got these results :

     [\x{0000}-\x{007F}]   =>      128 chars ( OK )
     [\x{0080}-\x{00FF}]   =>      128 chars ( OK )
     [\x{0100}-\x{07FF}]   =>    1,792 chars ( OK )
     [\x{0800}-\x{D7FF}]   =>   53,248 chars ( OK )
     [\x{E000}-\x{F8FF}]   =>    6,400 chars ( OK )
     [\x{F900}-\x{FDCF}]   =>    1,232 chars ( OK )
     [\x{FDF0}-\x{FFFD}]   =>      526 chars ( OK )

    [\x{10000}-\x{1FFFD}]  =>   65,534 chars ( OK )
    [\x{20000}-\x{2FFFD}]  =>   65 534 chars ( OK )
    [\x{30000}-\x{3FFFD}]  =>   65 534 chars ( OK )
    [\x{E0000}-\x{EFFFD}]  =>   65 534 chars ( OK )

     [\x{0000}-\x{007F}]   =>      128 chars ( OK ) coded with 1 byte
     [\x{0080}-\x{07FF}]   =>    1,920 chars ( OK ) coded with 2 bytes
     [\x{0800}-\x{FFFD}]   =>   61,406 chars ( OK ) coded with 3 bytes
    [\x{10000}-\x{EFFFD}]  =>  262,136 chars ( OK ) coded with 4 bytes

     [\x{0000}-\x{EFFFD}]  =>  325,590 chars ( OK ) Total of characters
    
    [\x{0100}-\x{EFFFD}]   =>  325 334 chars ( OK ) Total chars OVER \x{00FF}
    
    [[:unicode:]]          =>  323 286 chars ( KO ) Should be 325,334 chars

Regarding this error, note that 323,286 + 2,048 do give the right value 325,334 !

The 2,048 value seems linked to the Surrogate pairs mechanism. However, my Total_Chars.txt file do NOT contain any Surrogate character at all So what !?

Note also that the regex (?s). returns , as expected, 325,590 chars and that the simple . regex returns 325,572 chars. The difference concerns the 18 characters below, which are considered as line separator :

LF 000A LINE FEED
000C FORM FEED
CR 000D CARRIAGE RETURN

0085 NEXT LINE
  2028 LINE SEPARATOR
  2029 PARAGRAH SEPARATOR

𐂅 10085 LINEAR B IDEOGRAM B105M STALLION
𒀨 12028 CUNEIFORM SIGN AL TIMES USH
𒀩 12029 CUNEIFORM SIGN ALAN

𠂅 20085 [CJK Unified Ideographs Extension B]
𢀨 22028 [CJK Unified Ideographs Extension B]
𢀩 22029 [CJK Unified Ideographs Extension B]

𰂅 30085 [CJK Unified Ideographs Extension G]
𲀨 32028 [CJK Unified Ideographs Extension H]
𲀩 32029 [CJK Unified Ideographs Extension H]

󠂅 E0085 UNASSIGNED code-point
󢀨 E2028 UNASSIGNED code-point
󢀩 E2029 UNASSIGNED code-point

Note, that, normally, the first six chars should ONLY be seen as a line separator. So, it"s quite funny that the same chars as \x{ 0085}, \x{2028} and \x{2029}, in the other planes, included in my file, are also considered as line separator !

I also ran the test regarding the NUL character, described in this post :

https://community.notepad-plus-plus.org/post/99736

FIND ABC\x00XYZ

REPLACE \x0--$0--\x{000}

Which works and gives the correct result !

I also tried some expressions with look-aheads and look-behinds, containing overlapping zones !

For instance, against this text aaaabaaababbbaabbabb, pasted in a new tab, with a final line-break, all the regexes, below, give the correct number of matches :

ba*(?=a)   =>  4 matches
ba*(?!a)   =>  9 matches
ba*(?=b)   =>  8 matches
ba*(?!b)   =>  5 matches

(?<=a)ba*  =>  5 matches
(?<!b)ba*  =>  5 matches

(?<=b)ba*  =>  4 matches
(?<!a)ba*  =>  4 matches

However, I could not test this irritating problem :

Backward regex searches, for NON ANSI files, stops as soon as it matches a character with code-point over \x{007F}

Just because you do not allow backward searches when choosing the Regular expression search mode ! May be you could add it among all the Columns++ options ?

In the end, I can say that this experimental release is a valuable improvement and should be adopted in your Columns++ plugin, as well as in the N++ Boost regex engine ;-))

@coises, Many thanks for your efforts, so far !

Best Regards,

guy038

P.S. :

Just a small drawback :

For the four regexes [\x{10000}-\x{EFFFD}], [\x{0000}-\x{EFFFD}], [\x{0100}-\x{EFFFD}] and [[:unicode:]] which match more than 200,000 occurences, the Select All option of the Count button leads to the message the program des not respond :-((

Coises

@guy038 said in Search for character classes but not replace them:

So I decided to test this new version of our regex engine and to do some tests against my Total_Chars.txt file

Thank you!

The tests you’ve run and the file will be very helpful in finding and fixing errors. I really appreciate your help.

Coises

@guy038 said in Search for character classes but not replace them:

Note also that the regex (?s). returns , as expected, 325,590 chars and that the simple . regex returns 325572 chars. The difference concerns the 18 characters below, which are considered as line separator :

LF 000A LINE FEED
000C FORM FEED
CR 000D CARRIAGE RETURN

0085 NEXT LINE
  2028 LINE SEPARATOR
  2029 PARAGRAH SEPARATOR

𐂅 10085 LINEAR B IDEOGRAM B105M STALLION
𒀨 12028 CUNEIFORM SIGN AL TIMES USH
𒀩 12029 CUNEIFORM SIGN ALAN

𠂅 20085 [CJK Unified Ideographs Extension B]
𢀨 22028 [CJK Unified Ideographs Extension B]
𢀩 22029 [CJK Unified Ideographs Extension B]

𰂅 30085 [CJK Unified Ideographs Extension G]
𲀨 32028 [CJK Unified Ideographs Extension H]
𲀩 32029 [CJK Unified Ideographs Extension H]

󠂅 E0085 UNASSIGNED code-point
󢀨 E2028 UNASSIGNED code-point
󢀩 E2029 UNASSIGNED code-point

Note, that, normally, the first six chars should ONLY be seen as a line separator. So, it"s quite funny that the same chars as \x{ 0085}, \x{2028} and \x{2029}, in the other planes, included in my file, are also considered as line separator !

I found the cause of this (a default template in boost::regex that required specialization for 32-bit characters) and “fixed” it in my local copy… but that leads me to a question.

The boost::regex documentation says the point (without (?s)) matches anything but new line.

The Notepad++ documentation says it matches anything but \r or \n.

The code (allowing for it being for 16-byte characters) says the first six you listed (line feed, form feed, carriage return, next line, line separator, paragraph separator).

The web site Regular-Expressions.info notes that there is great variety in how this is handled, observing that Boost is unusual in including the Form Feed. Examining your Total_Chars.txt file, I notice that . matches the vertical tab but skips the form feed, though both are displayed by Notepad++ as control characters and do not create a line break. Apparently Scintilla can, under some conditions, recognize “Unicode line endings,” but this is dependent on the active lexer; the documentation does not say what “Unicode line endings” are and in any case, Notepad++ never enables them.

So, since I have to override the Boost default anyway… what should be excluded by . without (?s)?

My thought is that it should be the same things Scintilla recognizes as line breaks and the Notepad++ documentation states: just \n and \r. The character classes \s and \v and their complements are independent of this and match as expected (with \v matching the first six you listed plus the vertical tab).

Coises

@guy038 said in Search for character classes but not replace them:

Regarding this error, note that 323,286 + 2,048 do give the right value 325,334 !

The 2,048 value seems linked to the Surrogate pairs mechanism. However, my Total_Chars.txt file do NOT contain any Surrogate character at all So what !?

It turned out to be linked in that I made a typo in the code that was supposed to recognize the surrogate range, which caused it to treat U+D000 - U+D7FF as invalid Unicode characters (which they are not). Thank you for catching that! It will be fixed in the next experimental version.

Note also that the regex (?s). returns , as expected, 325,590 chars and that the simple . regex returns 325572 chars. The difference concerns the 18 characters below, which are considered as line separator

As described in more detail in my previous post, this turned out to be easy to fix, but I’m now wondering what is the best way to fix it. Your thoughts on that would be very welcome.

However, I could not test this irritating problem :

Backward regex searches, for NON ANSI files, stops as soon as it matches a character with code-point over \x{007F}

Just because you do not allow backward searches when choosing the Regular expression search mode ! May be you could add it among all the Columns++ options ?

I’m unlikely to do that. The search in Notepad++ is “managed” by Scintilla — Scintilla is built to allow integration with a search engine like Boost, while Scintilla still controls the process. The search in Columns++ doesn’t work that way: I use the Boost engine directly (mainly because I needed access to some information Scintilla doesn’t expose to implement my numeric calculation extensions). There’s nothing I see in boost::regex that supports backward searching; I’d have to implement that from the ground up. I’m not even sure what it’s supposed to mean: if you have a variable-length search, how much do you “back up” after a match? Just one character? Do you refuse to match until the new match doesn’t overlap the old one? What if there’s a match that doesn’t overlap, but it isn’t the first/best match? (You can’t just change the end of the searchable region because the expression might have look-aheads which should be able to overlap.)

For the four regexes [\x{10000}-\x{EFFFD}], [\x{0000}-\x{EFFFD}], [\x{0100}-\x{EFFFD}] and [[:unicode:]] which match more than 200,000 occurences, the Select All option of the Count button leads to the message the program does not respond :-((

I have to think about how to do this, but indeed there should be a way to interrupt that process. (It does complete eventually.) The reason it takes so long is that Select All makes a multiple selection with one selection for each match: so you are asking Scintilla to make a multiple selection with over 200,000 component selections. Either I need a sanity check on the number of matches or a progress dialog with a Cancel button that gets raised after more than a couple of seconds have passed.

Alan Kilborn

@guy038 said :

https://drive.google.com/file/d/1kYtbIGPRLdypY7hNMI-vAJXoE7ilRMOC/view?usp=sharing

Here is an updated version of the contents of the Total_Chars.txt file, as I corrected…

When one downloads from that link, there is nothing updated, as the latest file in the archive is from 2023.

guy038

Hi, @alan-kilborn and All,

Alan, may be my message was a bit ambiguous !

The total message said explicitly :

Here is an updated version of the contents of the Total_Chars.txt file, as I corrected two typo errors and I also split the range 0080 - 07FF in two ranges 0080 - 00FF and 0100 - 07FF in order to easily highlight the [[:unicode:]] range of chars

Thus, the contents of the Total_chars.txt has NOT changed since the date I published it on my Drive Account

But, it’s the description of the zones of this file which has been updated !!

The part :

    |    0080  -   0FFF  |  PLANE 0 - BMP    |  Included  |             |    + 1,920  |    2 Bytes     |            3,840  |

Has been changed into :

    |    0080  -   00FF  |  PLANE 0 - BMP    |  Included  |             |    +   128  |                |              256  |
    |                    |                   |            |             |             |    2 Bytes     |                   |
    |    0100  -   07FF  |  PLANE 0 - BMP    |  Included  |             |    + 1,792  |                |            3,584  |

And this line :

    |    F900  -   FDFC  |  PLANE 0 - BMP    |  Included  |             |    + 1,232  |    3 Bytes     |            3,696  |

Has been changed into :

    |    F900  -   FDCF  |  PLANE 0 - BMP    |  Included  |             |    + 1,232  |    3 Bytes     |            3,696  |

So, you can notice the two typo errors :

The end of range 0FFF which should be 07FF
The end of range FDFC which should be FDCF

I saw these errors when I calculated the number of chars of these zones, in @coises’s post !

Best Regards,

guy038

Alan Kilborn

@guy038 said :

the contents of the Total_chars.txt has NOT changed since the date I published it on my Drive Account…it’s the description of the zones of this file which has been updated !!

So… what you’re saying is that if one wants to use the file (and appreciate it fully) they really have to piece together all of its documentation, e.g. the “description of the zones”, from comments you’ve made about it in this and other Community postings that refer to the file?

OK…but it’s rather traditional when you publish something, to have everything all in one place.

Full admission: I haven’t really dug into the file yet, but I plan to…
EDIT: OK, I’ve looked a little bit now, and the fact that Total_chars.txt only contains 3 lines, one of which is quite long, had me wishing for more of a “separated” approach, like the author of these data files provided: https://github.com/bits/UTF-8-Unicode-Test-Documents

Hopefully this discussion is still considered on-topic for N++; it really needs to be an editor that can handle Unicode well.

guy038

Hi, @coises and All,

You said,

My thought is that it should be the same things Scintilla recognizes as line breaks and the Notepad++ documentation states: just \n and \r.

I think that this reasoning is the right one ! More over, note that we use the same reasoning when we want to find all chars but a specific one, in each single line : we use the regex [^c\r\n], where c is the character we do not want to !

Thus, against my Total_Chars.txt file, the regex (?s). should return 325,590 occurrences and the regex (?-s). should return 325,588 occurrences

Now, regarding my question :

Just because you do not allow backward searches when choosing the Regular expression search mode ! May be you could add it among all the Columns++ options ?

I do understand all the reasons why you are not inclined to do so ! However, note that, as regularly using the regexBackward4PowerUser="yes" option, in the FindHistory node of the config.xml file, I can assure you that a lot, but not all, of regexes can be processed in backward direction ! Unfortunately, with our present Boost regex engine, you can verify my assertion :

Backward regex searches, for NON ANSI files, stops as soon as it matches a character with code-point over \x{007F}

I also tested the search of invalid UTF-8 bytes. To do so :

Open a new N++ tab. ( I assume that its current encoding is UTF-8 ! )
Run the Encoding > Convert to ANSI menu option
Paste the text below, in this new ANSI tab


ABC íŸ¿ XYZ   \x{D7FF}  ED 9F BF  LAST  valid char BEFORE Surrogates range
ABC í € XYZ   \x{D800}  ED A0 80  FIRST SURROGATE char
ABC í¿¿ XYZ   \x{DFFF}  ED BF BF  LAST  SURROGATE char
ABC î€€ XYZ   \x{E000}  EE 80 80  First valid char AFTER  Surrogates range

ABC € XYZ
ABC  XYZ
ABC ‚ XYZ
ABC ƒ XYZ
ABC „ XYZ
ABC … XYZ
ABC † XYZ
ABC ‡ XYZ
ABC ˆ XYZ
ABC ‰ XYZ
ABC Š XYZ
ABC ‹ XYZ
ABC Œ XYZ
ABC  XYZ
ABC Ž XYZ
ABC  XYZ
ABC  XYZ
ABC ‘ XYZ
ABC ’ XYZ
ABC “ XYZ
ABC ” XYZ
ABC • XYZ
ABC – XYZ
ABC — XYZ
ABC ˜ XYZ
ABC ™ XYZ
ABC š XYZ
ABC › XYZ
ABC œ XYZ
ABC  XYZ
ABC ž XYZ
ABC Ÿ XYZ
ABC   XYZ
ABC ¡ XYZ
ABC ¢ XYZ
ABC £ XYZ
ABC ¤ XYZ
ABC ¥ XYZ
ABC ¦ XYZ
ABC § XYZ
ABC ¨ XYZ
ABC © XYZ
ABC ª XYZ
ABC « XYZ
ABC ¬ XYZ
ABC  XYZ
ABC ® XYZ
ABC ¯ XYZ
ABC ° XYZ
ABC ± XYZ
ABC ² XYZ
ABC ³ XYZ
ABC ´ XYZ
ABC µ XYZ
ABC ¶ XYZ
ABC · XYZ
ABC ¸ XYZ
ABC ¹ XYZ
ABC º XYZ
ABC » XYZ
ABC ¼ XYZ
ABC ½ XYZ
ABC ¾ XYZ
ABC ¿ XYZ
ABC À XYZ
ABC Á XYZ
ABC Â XYZ
ABC Ã XYZ
ABC Ä XYZ
ABC Å XYZ
ABC Æ XYZ
ABC Ç XYZ
ABC È XYZ
ABC É XYZ
ABC Ê XYZ
ABC Ë XYZ
ABC Ì XYZ
ABC Í XYZ
ABC Î XYZ
ABC Ï XYZ
ABC Ð XYZ
ABC Ñ XYZ
ABC Ò XYZ
ABC Ó XYZ
ABC Ô XYZ
ABC Õ XYZ
ABC Ö XYZ
ABC × XYZ
ABC Ø XYZ
ABC Ù XYZ
ABC Ú XYZ
ABC Û XYZ
ABC Ü XYZ
ABC Ý XYZ
ABC Þ XYZ
ABC ß XYZ
ABC à XYZ
ABC á XYZ
ABC â XYZ
ABC ã XYZ
ABC ä XYZ
ABC å XYZ
ABC æ XYZ
ABC ç XYZ
ABC è XYZ
ABC é XYZ
ABC ê XYZ
ABC ë XYZ
ABC ì XYZ
ABC í XYZ
ABC î XYZ
ABC ï XYZ
ABC ð XYZ
ABC ñ XYZ
ABC ò XYZ
ABC ó XYZ
ABC ô XYZ
ABC õ XYZ
ABC ö XYZ
ABC ÷ XYZ
ABC ø XYZ
ABC ù XYZ
ABC ú XYZ
ABC û XYZ
ABC ü XYZ
ABC ý XYZ
ABC þ XYZ
ABC ÿ XYZ

Now, choose the Encoding > UTF-8 encoding. So all characters of this ANSI file are re-interpreted as they were UTf_8 chars

=> You should see, between the strings ABC and XYZ :

-The last VALID UTF-8 char ( ED 9F BF ) before the SURROGATE range

The 3-bytes sequence of the first SURROGATE char, which is an INVALID sequence
The 3-bytes sequence of the last SURROGATE char, which is an INVALID sequence
The first VALID UTF-8 char ( EE 80 80 ) after the SURROGATE range

Then, a list of the 128 IVALID UTF-8 characters as the UTF-8 encoding does NOT allow any 1-byte character OVER \x{007F} !

Now :

Move the caret to the first empty line
Run the option Plugins > Columns++ > Search...
Enter the range [\x{DC80}-\x{DCFF}] in the Find what : zone
Click on the Find First button

=>

The Search region is set to the entire document
The first INVALID byte \xED is selected
Click on the Find Next button => It will select, one after another, all the other IVALID UTF-8 characters of this new tab !

So, @coises, your new implementation works correctly, regarding the INVALID UTF-8 chars and I’m longing for your second experimental version ;-))

Best Regards,

guy038