Columns++ version 1.2: better Unicode search
-
Hi, @Coises,
Many thanks for your new
Columns++ version 1.2
. So, you just anticipated my last reply which confirmed, to my mind, that you last experimental release was mature ;-))
I was a bit confused by your last sentence :
the search now starts from the caret position instead of from the beginning or end of the document.
I was initially afraid that it would just, for example, count from caret position to end of file. But I understood, by comparing your last version and the present one, that results are identical, as long as no previous selection occurred and that the
Auto set
option was checked. It’s just the start of the cycle among the matches which is different !
Now, may I request for one useful improvement ? The font, used in the two drop-down lists
Find what :
andReplace with
, is visibly a proportional font. To be convinced of this fact, enter the string WWWWWIIIII in theFind what
zone !To my mind , it would be nice, like within Notepad++, to choose, instead, a mono-spaced font ( maybe an option ! ).
A second possibility would be to allow the selection from a drop-down list of all the installed fonts ?
A third possibility would be to have an option, in the dialog, to enlarge, temporarily or not, these two zones. I suppose that this last solution would be more difficult to implement !
As for now, I just use the Microsoft magnifier feature (
300 %
) to solve this problem !Best Regards,
guy038
-
@guy038 said in Columns++ version 1.2: better Unicode search:
It’s just the start of the cycle among the matches which is different !
While testing things, I kept making the mistake of placing the caret just before something I wanted to check, then opening search, clicking Find (not noticing that it said Find First and not Find Next) and having it bounce to the start of the document. I figured if it’s counter-intuitive to me, it’s surely surprising to everyone else. Losing one’s place in a large document seems much more annoying than having to press Ctrl+Home if you want to start from the beginning, so I figured this to be a change that will do more good than harm.
Now, may I request for one useful improvement ? The font, used in the two drop-down lists
Find what :
andReplace with
, is visibly a proportional font. To be convinced of this fact, enter the string WWWWWIIIII in theFind what
zone !To my mind , it would be nice, like within Notepad++, to choose, instead, a mono-spaced font ( maybe an option ! ).
A second possibility would be to allow the selection from a drop-down list of all the installed fonts ?
A third possibility would be to have an option, in the dialog, to enlarge, temporarily or not, these two zones. I suppose that this last solution would be more difficult to implement !
All good ideas. I hadn’t thought about the monospaced font. (I forgot that Notepad++ has that option — I remember that I liked it except that it takes up more space, so I can see less of what I’ve typed without making the dialog obscure even more of the document.)
A thought I’ve had for some time is to have a button that opens a second dialog, or an extended “pane” attached to the search dialog, that’s just for entering a regular expression or a replacement. My “vision” (and it’s only that — I’ve done no coding or even a mock-up yet) is that the expression entry areas would be Scintilla controls which would, at least by default, reflect the font and size used in the document; they could contain multiple lines and possibly have appropriate syntax highlighting. Ideally there would be some kind of a “builder” to help people who are less familiar with regular expressions know what they can enter (escapes, class names, symbolic character names, quantifiers — and those formulas I process in the replacement), and an area where users could save frequently-used expressions.
I’ve also wondered if search should be a dockable panel — so results of a find don’t get hidden behind the dialog, which I find an annoying occurrence. Dockable dialogs are kind of strange, though, and from what I’ve seen (I’m still learning), some of the control one has with an ordinary dialog is lost when it becomes dockable (such as that setting height and width constraints don’t seem to work, even when the dialog is undocked).
Either of those ideas are getting so far from the nominal purposes of Columns++, though, that it seems it would really be time to make a separate plugin. (Yes, @Alan-Kilborn, hoping someday it could be part of the main program. But far less “aggressive” changes have caused consternation when made to Notepad++; at the least, I think anything so dramatic should have a considerable test period to demonstrate its value and stability before I would dream of suggesting it as a replacement for existing functionality.)
-
Hi, @Coises and All,
Luckily, I do not need the Microsoft magnifier in my everyday work on my Windows-10 laptop !! But sometimes, as the size of your search dialog font seems a bit small, it helped me to clearly see which kind or regex I typed in, during the tests of your experimental versions. However, for example, I just use the N++ default zoom to prepare this post !
Note that regular expressions use a lot of chars not easy to distinguish, like the
.
char, the(
and)
chars, the[
and]
chars, the{
and}
chars, and so on…, which look very thin, with the present proportional font !So, whatever you plan to do in the future, regarding my request, it should be better than the present situation. No doubt about it !
Best Regards,
guy038
-
@Coises said:
it seems it would really be time to make a separate plugin
I would go so far as to suggest something that looks and operates like the Notepad++ Find dialog and its tabs.
Then someone using the plugin would not have to learn anything new, and would feel “right at home”.
You know why I suggest this, right? :-) -
@Alan-Kilborn said in Columns++ version 1.2: better Unicode search:
I would go so far as to suggest something that looks and operates like the Notepad++ Find dialog and its tabs.
Then someone using the plugin would not have to learn anything new, and would feel “right at home”.
You know why I suggest this, right? :-)I think I do, but to be honest, if and when I take on such a project, non-trivial user interface changes would be the whole point. Given that, I’m not sure I’d want to tie myself to recreating a legacy user interface and using it as an underlying model. Familiarity would be a plus, but I am unlikely to impose it on myself as a constraint.
This is all far enough down the road that someone else might well get to it before I do, anyway. I have at least two other self-assignments that would come first, and that’s just in the realm of computer programming.
-
G guy038 referenced this topic on
-
Hello, @coises and All,
Refer to this FAQ that I’ve just updated with references to your last
Columns++-1.2
release :Best Regards,
guy038
-
@guy038 said in Columns++ version 1.2: better Unicode search:
Refer to this FAQ that I’ve just updated with references to your last
Columns++-1.2
release :Thank you for mentioning Columns++. Might I suggest a couple things?
-
Columns++ version 1.2 is available using Plugins Admin in Notepad++ 8.7.8; there is no need to do the more complicated installation procedure.
-
I fear that your FAQ might make some readers expect that the search in Columns++ is a replacement for Notepad++ search, when it really isn’t. There are many things Notepad++ search can do (finding in all open files, finding and replacing in multiple files, etc.) that Columns++ search does not do and almost certainly never will. Its original, and still primary, reason for existence is to make it possible to find and replace within a rectangular selection — something Notepad++ search cannot do. There is also the extension of using mathematical formulas in replacements. I would recommend perhaps a link to the online help file sections about Search and Regular Expressions to clarify when Columns++ might be useful.
-
It might be unclear that while the progress dialog change applies to all Count, Select and Replace All actions in Columns++ search, the other changes you mentioned apply only to searches in Unicode files. Searches in “ANSI” files work the same as always. (The ability to search in regions based on a rectangular or multiple selection also applies to all searches, and the ability to use formulas in replacements applies to all regular expression searches.)
-
-
Hello, @coises and All,
You said :
Columns++ version 1.2 is available using Plugins Admin in Notepad++ 8.7.8; there is no need to do the more complicated installation procedure.
Well, I updated the regex documentation with the N++ release
v8.7.6
and, at that time,Columns++
did not seem to belong to the plugins’s list !?
You said :
I fear that your FAQ might make some readers expect that the search in Columns++ is a replacement for Notepad++ search, when it really isn’t.
I agree that I did not presented your plugin the right way. So, I did some modifications and I hope you’ll agree with the new phrasing !
You said :
… the other changes you mentioned apply only to searches in Unicode files. Searches in “ANSI” files work the same as always.
Well, your assertion is a bit paradoxal, regarding the title of this post ! Indeed, your title says :
Columns++ version 1.2: better Unicode search !!
And anyway, against an
ANSI
file, any search of an UNICODE property triggers anInvalid Regex message
! So, the benefit of this improved version is not so obvious forANSI
files. However, I did add a mention which clearly says that the search and replace are correct withANSI
files,too.However, I noticed a odd thing :
-
Write these five characters
,¼½¾,
in a newUTF-8
tab -
Ask Columns++ to select all the punctuation characters with the
[[:punct:]]
regex
=> It correctly find the two commas only, as the fractions has the UNICODE
\p{other Number}
property and are not punctuation chars-
Now, convert this
UTF-8
file to anANSI
file, with theEncoding > Convert to ANSI
option -
Re-try the
[[:punct:]]
regex against this, from now on,ANSI
file
=> This time, the five characters are selected !?
If you try the
\p{other Number}
regex it returns, as expected, an error message !
In your documentation, regarding your last sequence
[[.x80.]]–[[.xff.]]
, right before theSearch
file section :At first sight, it’s not a class of characters and it seems to be an invalid regular expression. Surprisingly, it’s not ! Actually, it’s a sequence of three consecutive characters :
-
An invalid
\x80
UTF-8 byte -
An EN DASH character (
\x{2013}
) -
An invalid
\xff
UTF-8 byte
So, @Coises, just modify this regex as
[[.x80.]-[.xff.]]
, with an Hyphen-Minus character which, indeed, finds any invalidUTF-8
character !BTW, I like your two buttons, at the bottom right of your documentation (
txt- TXT+
), which allows us to zoom in or out. It surely help a lot of people !
Best Regards
guy038
P.S. :
In the first of my three consecutive posts which ended up my test’s period of your plugin ( https://community.notepad-plus-plus.org/post/100087 ) I wrote :
\p{Ascii}
=(?s)\o
=>128
when applied against myTotal_Chars.txt
file ! Now, I understand that the(?s)
modifier does not change anything for the Count results. Indeed, the(?s)
or(?-s)
modifiers are ONLY needed if there is, at least, one.
regex character in the entire regex !So, if we want to omit the
\r
and\n
in the above regex, we must use the(?![\r\n])\p{Ascii}
or the(?![\r\n])\o
syntax, which correctly return126
matchesNote that this it only true for a NON
ANSI
file. For an ANSI file :-
The regex
(?![\r\n])\p{Ascii}
is invalid, as explained above. -
The regex
(?![\r\n])\o
does work but returns just one match : the lower-case lettero
!! ( theMatch case
option was set )
-
-
@guy038 said in Columns++ version 1.2: better Unicode search:
Well, I updated the regex documentation with the N++ release
v8.7.6
and, at that time,Columns++
did not seem to belong to the plugins’s list !?The previous “stable” version was there, but I got the pull request to update to version 1.2 in just barely in time to make it into Notepad++ 8.7.8.
I did some modifications and I hope you’ll agree with the new phrasing !
Thank you. I like that. I just didn’t want people to install it and then be disappointed that it’s no help if they want to use one of the many features of Notepad++ search that Columns++ does not attempt to replicate.
However, I noticed a odd thing :
-
Write these five characters
,¼½¾,
in a newUTF-8
tab -
Ask Columns++ to select all the punctuation characters with the
[[:punct:]]
regex
=> It correctly find the two commas only, as the fractions has the UNICODE
\p{other Number} property
and are not punctuation chars-
Now, convert this
UTF-8
file to anANSI
file, with theEncoding > Convert to ANSI
option -
Re-try the
[[:punct:]]
regex against this, from now on,ANSI
file
=> This time, the five characters are selected !?
Yes, that is something I don’t like about my own work: there are now inconsistencies between ANSI and UTF-8, because I changed nothing about ANSI regular expressions. For example,
(?i)\u
still matches all alphabetic characters in ANSI files. (For obscure technical reasons involving C++ template specialization and how Boost::regex is implemented, it may prove to be more difficult to make the corresponding changes to ANSI than it was to make them to Unicode. So far, I haven’t even tried.)In your documentation, regarding your last sequence
[[.x80.]]–[[.xff.]]
, right before theSearch
file section :At first sight, it’s not a class of characters and it seems to be an invalid regular expression. Surprisingly, it’s not ! Actually, it’s a sequence of three consecutive characters
It looks like I failed to convey what I meant in that entry. What I was trying to say was that you can use [[.xhh.]] as a symbolic character reference to find an invalid byte; so that, for example,
[[.xB2.]]
will find any byte 0xB2 that is part of an invalid UTF-8 sequence. (There is no way to isolate bytes 0xB2 that are parts of valid UTF-8 sequences, though; for that, you’d have to reinterpret as — not convert to — ANSI.) I added those as I was updating the documentation, because I thought it was less confusing than telling people they could use expressions like\x{DCB2}
to find specific invalid bytes. This mirrors how control and invisible characters have symbolic names that match the way Scintilla displays them. -
-
Hi, @coises,
You said :
It looks like I failed to convey what I meant in that entry
Ah…, now I understand what you meant ! Thus, may be the two following entries would just mean what you expected to :
•-----------------------------•-----------•------------------------------------------------------• | From [[.x00.]] to [[.xff.]] | [[.x##.]] | The invalid UTF-8 byte [[.x##.]] | | [[.x80.]-[.xff.]] | | Any invalid UTF-8 byte | •-----------------------------•-----------•------------------------------------------------------•
Like you, I’m a bit upset about the differences of behavior, of your
Columns++
plugin, betweenANSI
andUNICODE
files. So, I will do additional tests to narrow down where these differences occur ! Like myTotal_Chars.txt
UNICODE file, I’ll create an ANSI file containing the256
characters of theWinsows-1252
encoding to this purpose !https://en.wikipedia.org/wiki/Windows-1252
See you later,
BR
guy038
-
Hello, @coises and All,
I’ve decided to use the same canvas to describe results with an
ANSI
file as I did for results with aUNICODE
file. This description will spread over two posts !So, I first created this ANSI file, named
Total_ANSI.txt
:•---------------•-----------------•------------•-----------------------------•-----------•-----------------•-----------• | Range | Description | Status | COUNT / MARK of ALL chars | # Chars | ANSI Encoding | # Bytes | •---------------•-----------------•------------•-----------------------------•-----------•-----------------•- ---------• | 0000 - 007F | PLANE 0 - BMP | Included | [\x00-\x7F] | 128 | | 128 | | | | | | | 1 Byte | | | 0080 - 00FF | PLANE 0 - BMP | Included | [\x80-\xFF] | 128 | | 128 | •---------------•-----------------•------------•-----------------------------•-----------•-----------------•-----------•
Against this file, the following results are correct :
[\x00-\xFF] => 256 chars, coded with one byte = TOTAL of characters [[:unicode:]] => 0 char = Total chars OVER \x{00FF}
I tried some expressions with look-aheads and look-behinds, containing overlapping zones !
For instance, against this text
aaaabaaababbbaabbabb
, pasted in a newANSI
tab, with a final line-break, all the regexes, below, give the correct number of matches :ba*(?=a) => 4 matches ba*(?!a) => 9 matches ba*(?=b) => 8 matches ba*(?!b) => 5 matches (?<=a)ba* => 5 matches (?<!b)ba* => 5 matches (?<=b)ba* => 4 matches (?<!a)ba* => 4 matches
But, on the other hand, the search of the regex :
[[.NUL.][.SOH.][.STX.][.ETX.][.EOT.][.ENQ.][.ACK.][.BEL.][.BS.][.HT.][.VT.][.FF.][.SO.][.SI.][.DLE.][.DC1.][.DC2.][.DC3.][.DC4.][.NAK.][.SYN.][.ETB.][.CAN.][.EM.][.SUB.][.ESC.][.FS.][.GS.][.RS.][.US.][.DEL.][.HOP.][.RI.][.SS3.][.DCS.][.OCS.][.SHY.]]
Leads to an Invalid Regex message. Logical, as this kind of search concerns
Unicode
files, only.
Now, against the
Total_ANSI.txt
file, all the following results are correct :(?s). = [\x00-\xFF] => 256 Total = 256 (?-s). = [^\x0A\x0C\x0D] => 253 \p{Unicode} = [[:Unicode:]] => 0 | | Total = 256 \P{Unicode} = [[:^Unicode:]] => 256 | \X => 256 | | Total = 256 (?!\X). => 0 |
Here are the correct results, concerning all the Posix character classes, against the
Total_ANSI.txt
file[[:unicode:]] = \p{unicode} an OVER \x{00FF} character 0 = [^\x00-\xFF}] [[:space:]] = \p{space} = [[:s:]] = \p{s} = \ps = \s a WHITE-SPACE character 7 = [\t\n\x0B\f\r\x20\xA0] [[:h:]] = \p{h} = \ph = \h an HORIZONTAL white space character 3 = [\t\x20\xA0] [[:blank:]] = \p{blank} a BLANK character 3 = [\t\x20\xA0] [[:v:]] = \p{v} = \pv = \v a VERTICAL white space character 4 = [\n\x0B\f\r] [[:cntrl:]] = \p{cntrl} a CONTROL code character 39 = [\x00-\x1F\x7F\x81\x8D\x8F\x90\x9D\xAD] [[:upper:]] = \p{upper} = [[:u:]] = \p{u} = \pu = \u an UPPER case letter 60 = [A-ZŠŒŽÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞß] [[:lower:]] = \p{lower} = [[:l:]] = \p{l} = \pl = \l a LOWER case letter 65 = [a-zƒšœžŸªµºàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ] [[:digit:]] = \p{digit} = [[:d:]] = \p{d} = \pd = \d a DECIMAL number 13 = [0-9²³¹] _ = \x{005F} the LOW_LINE character 1 ------- [[:word:]] = \p{word} = [[:w:]] = \p{w} = \pw = \w a WORD character 139 = [[:alnum:]]|\x5F = \p{alnum}|\x5F (?i)[[:upper:]] = (?i)[[:lower:]] a LETTER, whatever its CASE 125 = (?-i)[[:upper:][:lower:]] [[:alnum:]] = \p{alnum} an ALPHANUMERIC character 138 = (?-i)[[:upper:][:lower:][:digit:]] [[:alpha:]] = \p{alpha} any LETTER character 125 = (?-i)[[:upper:][:lower:]] [[:graph:]] = \p{graph} any VISIBLE character 212 = [^\x00-\x1F\x20\x7F\x80\x81\x88\x8D\x8F\x90\x98\x99\x9D\xA0] [[:print:]] = \p{print} any PRINTABLE character 219 = [[:graph:]]|\s [[:punct:]] = \p{punct} any PUNCTUATION or SYMBOL character 80 = [\x21-\x2F\x3A-\x40\x5B-\x60\x7B-\x7E\x82\x84-\x87\x89\x8B\x91-\x97\x9B\xA1-\xBF\xD7\xF7] [[:xdigit:]] an HEXADECIMAL character 22 = [0-9A-Fa-f] = (?i)[0-9A-F]
NO results regarding the Unicode character classes, against the
Total_ANSI.txt
file because it logically returns an Invalid Regular Expression message for any classRemark :
- A negative POSIX character class can be expressed as
[^[:........:]]
or[[:^........:]]
No INVALID
UTF-8
chars can be found as we’re dealing with anANSI
file !
I tested ALL the
Equivalence
classes feature :You can use any other equivalent character of the
a
letter to get the15
matches ( for instance :((=ª=]]
,[[=Å=]]
,[[=ã=]]
, … )Here is below the list of all the equivalences of any char of the
Windows-1252
code-page, from\x00
till\xDE
against theTotal_ANSI.txt
file. Note that I did not consider the equivalence classes which returns only one match ![[==]] => 33 [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD] [[==]] => 33 [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD] [[==]] => 33 [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD] [[==]] => 33 [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD] [[==]] => 33 [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD] [[==]] => 33 [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD] [[==]] = [[=alert=]] => 33 [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD] [[==]] = [[=backspace=]] => 33 [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD] [[==]] => 33 [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD] [[==]] => 33 [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD] [[==]] => 33 [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD] [[==]] => 33 [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD] [[==]] => 33 [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD] [[==]] => 33 [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD] [[==]] => 33 [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD] [[==]] => 33 [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD] [[==]] => 33 [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD] [[==]] => 33 [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD] [[==]] => 33 [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD] [[==]] => 33 [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD] [[==]] => 33 [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD] [[==]] => 33 [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD] [[==]] = [[=IS4=]] => 33 [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD] [[==]] = [[=IS3=]] => 33 [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD] [[==]] = [[=IS2=]] => 33 [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD] [[==]] = [[=IS1=]] => 33 [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD] [[='=]] = [[=apostrophe=]] => 33 [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD] [[=-=]] = [[=hyphen=]] => 33 [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD] [[==]] => 33 [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD] [[=–=]] => 33 [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD] [[=—=]] => 33 [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD] [[==]] => 33 [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD] [[=1=]] = [[=one=]] => 2 [1¹] [[=2=]] = [[=two=]] => 2 [2²] [[=3=]] = [[=three=]] => 2 [3³] [[=A=]] => 15 [AaªÀÁÂÃÄÅàáâãäå] [[=B=]] => 2 [Bb] [[=C=]] => 4 [CcÇç] [[=D=]] => 4 [DdÐð] [[=E=]] => 10 [EeÈÉÊËèéêë] [[=F=]] => 3 [Ffƒ] [[=G=]] => 2 [Gg] [[=H=]] => 2 [Hh] [[=I=]] => 10 [IiÌÍÎÏìíîï] [[=J=]] => 2 [Jj] [[=K=]] => 2 [Kk] [[=L=]] => 2 [Ll] [[=M=]] => 2 [Mm] [[=N=]] => 4 [NnÑñ] [[=O=]] => 15 [OoºÒÓÔÕÖØòóôõöø] [[=P=]] => 2 [Pp] [[=Q=]] => 2 [Qq] [[=R=]] => 2 [Rr] [[=S=]] => 4 [SsŠš] [[=T=]] => 2 [Tt] [[=U=]] => 10 [UuÙÚÛÜùúûü] [[=V=]] => 2 [Vv] [[=W=]] => 2 [Ww] [[=X=]] => 2 [Xx] [[=Y=]] => 6 [YyÝýÿŸ] [[=Z=]] => 4 [ZzŹź] [[=^=]] = [[=circumflex=]] => 2 [^ˆ] [[=Œ=]] => 2 [Œœ] [[=Þ=]] => 2 [Þþ]
Some double-letter characters give some equivalences which allow you to get the right single char to use, instead of the two trivial letters :
[[=AE=]] = [[=Ae=]] = [[=ae=]] => 2 [Ææ] [[=SS=]] = [[=Ss=]] = [[=ss=]] => 1 [ß]
An example : let’s suppose that we run this regex
(?-i)[A-F[:lower:]]
, against myTotal_ANSI.txt
file. It does give71
matches, so6
UPPER letters +65
LOWER lettersAs, in an
ANSI
file, theMatch case
option or the?i)
modifier is effective forPOSIX
character classes, if we run the same regex, in an insensitive way, the(?i)[A-F[:lower:]]
regex returns, this time,125
matches.And note that the regex
(?-i)[[:upper:][:lower:]]
or(?i)[[:upper:][:lower:]]
acts as an insensitive regex and return125
matches ( So60
UPPER letters +65
LOWER letters )The regexes
(?-i)\u(?<=\l)
and(?-i)(?=\l)\u
do not find any match. This implies that the sets of UPPER and LOWER letters are totally disjoint
Finally, for
ANSI
files, the regex syntax\X
is rather useless. Indeed, the UNICODE block ofCombining diacritical
marks is cannot be used, anyway and theEmoji
are UNICODE characters are totally inaccessible toANSI
files. Thus,\X
regex is just equivalent to the simple regex(?s).
So, from this set of ANSI results, which ones seem quite odd, compared with the UNICODE results ?
-
Maybe, the regex
(?-s).
should just be equal to[^\x0A\x0D]
and return254
matches -
The
[[:cntrl:]]
or\p{cntrl}
should be equal to[\x00-\x1F\x7F\x81\x8D\x8F\x90\x9D]
and returns38
characters or, maybe,[\x00-\x1F\x7F]
so33
chars only
Regarding the
[[:graph:]]
character class, I created an identicalUTF-8
file, namedTotal_UTF-8.txt
. Here are the results for the characters between\x80
and\xBF
, in both files ( all the other chars being identical ) :•---------•---------•--------------------• | ANSI | UTF-8 | UNICODE Category | •---------•---------•--------------------• | | € | Sc | ‚ | ‚ | | ƒ | ƒ | | „ | „ | | … | … | | † | † | | ‡ | ‡ | | | ˆ | Po | ‰ | ‰ | | Š | Š | | ‹ | ‹ | | Œ | Œ | | Ž | Ž | | ‘ | ‘ | | ’ | ’ | | “ | “ | | ” | ” | | • | • | | – | – | | — | — | | | ˜ | Sk | | ™ | So | š | š | | › | › | | œ | œ | | ž | ž | | Ÿ | Ÿ | | ¡ | ¡ | | ¢ | ¢ | | £ | £ | | ¤ | ¤ | | ¥ | ¥ | | ¦ | ¦ | | § | § | | ¨ | ¨ | | © | © | | ª | ª | | « | « | | ¬ | ¬ | | | | Cf | ® | ® | | ¯ | ¯ | | ° | ° | | ± | ± | | ² | ² | | ³ | ³ | | ´ | ´ | | µ | µ | | ¶ | ¶ | | · | · | | ¸ | ¸ | | ¹ | ¹ | | º | º | | » | » | | ¼ | ¼ | | ½ | ½ | | ¾ | ¾ | | ¿ | ¿ | •---------•---------•
Surprisingly, the ANSI chars
\x80
,\x88
,\x98
,\x99
are not supposed to be part of the[[:graph:]]
which represents the class of visible characters !?So, to harmonize the results, the rule should be :
-
When using the
[[:graph:]]
POSIX character class, against anANSI
file :-
The
[\x80\x88\x98\x99]
ANSI list of characters ( corresponding the[\x{20AC}\x{02C6}\x{02DC}\x{2122}]
UTF-8 list ) should be included in that class -
The
\xAD
character ( or \x{00AD} ) should be excluded from that class !
-
Now, as the
[[:print:]]
POSIX character class is simply identical to the regex[[:graph:]]|\s
, no need to investigate about that character class !See next post
- A negative POSIX character class can be expressed as
-
Hi @Coises and All,
End of my reply :
In the same way, regarding the
[[:punct:]]
character class, here are the results for both, theTotal_ANSI.txt
andTotal_UTF-8.txt
files :•---------•---------•--------------------• | ANSI | UTF-8 | UNICODE Category | •---------•---------•--------------------• | ! | ! | Po | " | " | Po | # | # | Po | $ | $ | Sc | % | % | Po | & | & | Po | ' | ' | Po | ( | ( | Ps | ) | ) | Pe | * | * | Po | + | + | Sm | , | , | Po | - | - | Pd | . | . | Po | / | / | Po | : | : | Po | ; | ; | Po | < | < | Sm | = | = | Sm | > | > | Sm | ? | ? | Po | @ | @ | Po | [ | [ | Ps | \ | \ | Po | ] | ] | Pe | ^ | ^ | Sk | _ | _ | Pc | ` | ` | Sk | { | { | Ps | | | | | Sm | } | } | Pe | ~ | ~ | Sm •---------•---------•-----------• | | € | Sc | ‚ | ‚ | Ps | „ | „ | Ps | … | … | Po | † | † | Po | ‡ | ‡ | Po | ‰ | ‰ | Po | ‹ | ‹ | Pi | ‘ | ‘ | Pi | ’ | ’ | Pf | “ | “ | Pi | ” | ” | Pf | • | • | Po | – | – | Pd | — | — | Pd | | ˜ | Sk | | ™ | So | › | › | Pf | ¡ | ¡ | Po | ¢ | ¢ | Sc | £ | £ | Sc | ¤ | ¤ | Sc | ¥ | ¥ | Sc | ¦ | ¦ | So | § | § | Po | ¨ | ¨ | Sk | © | © | So | ª | | Lo | « | « | Pi | ¬ | ¬ | Sm | | | Cf | ® | ® | So | ¯ | ¯ | Sk | ° | ° | So | ± | ± | Sm | ² | | No | ³ | | No | ´ | ´ | Sk | µ | | Ll | ¶ | ¶ | Po | · | · | Po | ¸ | ¸ | Sk | ¹ | | No | º | | Lo | » | » | Pf | ¼ | | No | ½ | | No | ¾ | | No | ¿ | ¿ | Po | × | × | Sm | ÷ | ÷ | Sm •---------•---------•-----------•
And, as we know that the
[[:punct:]]
POSIX character class is the union of the TWO Unicode classes\p{P*}
and\p{S*}
, this means that all the[[:punct:]]
characters, found inTotal_UTF-8.txt
, are exact !However, it’s obvious that it’s not the case for the
[[:punct:]]
characters found inTotal_ANSI.txt
:So, again, to harmonize the results, the rule should be :
-
When using the
[[:punct:]]
POSIX character class, against anANSI
file :-
The
[\xAA\xAD\xB2\xB3\xB5\xB9\xBA\xBC\xBD\xBE]
list of characters should be excluded from that class ! -
The
[\x80\x98\x99]
ANSI list of characters ( corresponding the[\x{20AC}\x{02DC}\x{2122}]
UTF-8 list ) should be included in that class
-
And this result would confirm that the POSIX
[[:punct:]]
character class is equal to the\p{P*}|\p{S*}
regex, in all cases !
-
Regarding the
Equivalence
classes whose results are presently33
, the rule should be :-
All the
Control
codes should just match their own character. For example[[==]]
should return 1 match[x7F]
-
[[='=]]
=[[=apostrophe=]]
should return 1 match[\x27]
-
[[=-=]]
=[[=hyphen=]]
should return 1 match[\x2D]
-
[[=–=]]
should return 1 match[\x96]
-
[[=—=]]
should return 1 match[\x97]
-
[[==]]
should return 1 match[\xAD]
-
Now, when doing tests with UNICODE files, I forgot the equivalence classes of the
Control C0/C1
andControl Format
characters !. So the results, against myTotal_Chars.txt
UTF-8 file, are :[[=nul=]] => 3,309 [\x{0000}\X{00AD}....] Cc [[=soh=]] => 1 [\x{0001}] Cc [[=stx=]] => 1 [\x{0002}] Cc [[=etx=]] => 1 [\x{0003}] Cc [[=eot=]] => 1 [\x{0004}] Cc [[=enq=]] => 1 [\x{0005}] Cc [[=ack=]] => 1 [\x{0006}] Cc [[=bel=]] = [[=alert=]] => 1 [\x{0007}] Cc [[=bs=]] = [[=backspace=]] => 1 [\x{0008}] Cc [[=ht=]] = [[=tab=]] => 1 [\x{0009}] Cc [[=lf=]] = [[=newline=]] => 1 [\x{000A}] Cc [[=vt=]] = [[=vertical-tab=]] => 1 [\x{000B}] Cc [[=ff=]] = [[=form-feed=]] => 1 [\x{000C}] Cc [[=cr=]] = [[=carriage-return=]] => 1 [\x{000D}] Cc [[=so=]] => 1 [\x{000E}] Cc [[=si=]] => 1 [\x{000F}] Cc [[=dle=]] => 1 [\x{0010}] Cc [[=dc1=]] => 1 [\x{0011}] Cc [[=dc2=]] => 1 [\x{0012}] Cc [[=dc3=]] => 1 [\x{0013}] Cc [[=dc4=]] => 1 [\x{0014}] Cc [[=nak=]] => 1 [\x{0015}] Cc [[=syn=]] => 1 [\x{0016}] Cc [[=etb=]] => 1 [\x{0017}] Cc [[=can=]] => 1 [\x{0018}] Cc [[=em=]] => 1 [\x{0019}] Cc [[=sub=]] => 1 [\x{001A}] Cc [[=esc=]] => 1 [\x{001B}] Cc [[=fs=]] => 1 [\x{001C}] Cc [[=gs=]] => 1 [\x{001D}] Cc [[=rs=]] => 1 [\x{001E}] Cc [[=us=]] => 1 [\x{001F}] Cc [[= =]] => 3 [\x{0020}\x{205F}\x{3000}] Zs [[=del=]] => 1 [\x{007F}] Cc [[=pad=]] => 1 [\x{0080}] Cc [[=hop=]] => 1 [\x{0081}] Cc [[=bph=]] => 1 [\x{0082}] Cc [[=nbh=]] => 1 [\x{0083}] Cc [[=ind=]] => 1 [\x{0084}] Cc [[=nel=]] => 1 [\x{0085}] Cc [[=ssa=]] => 1 [\x{0086}] Cc [[=esa=]] => 1 [\x{0087}] Cc [[=hts=]] => 1 [\x{0088}] Cc [[=htj=]] => 1 [\x{0089}] Cc [[=lts=]] => 1 [\x{008A}] Cc [[=pld=]] => 1 [\x{008B}] Cc [[=plu=]] => 1 [\x{008C}] Cc [[=ri=]] => 1 [\x{008D}] Cc [[=ss2=]] => 1 [\x{008E}] Cc [[=ss3=]] => 1 [\x{008F}] Cc [[=dcs=]] => 1 [\x{0090}] Cc [[=pu1=]] => 1 [\x{0091}] Cc [[=pu2=]] => 1 [\x{0092}] Cc [[=sts=]] => 1 [\x{0093}] Cc [[=cch=]] => 1 [\x{0094}] Cc [[=mw=]] => 1 [\x{0095}] Cc [[=spa=]] => 1 [\x{0096}] Cc [[=epa=]] => 1 [\x{0097}] Cc [[=sos=]] => 1 [\x{0098}] Cc [[=sgci=]] => 1 [\x{0099}] Cc [[=sci=]] => 1 [\x{009A}] Cc [[=csi=]] => 1 [\x{009B}] Cc [[=st=]] => 1 [\x{009C}] Cc [[=osc=]] => 1 [\x{009D}] Cc [[=pm=]] => 1 [\x{009E}] Cc [[=apc=]] => 1 [\x{009F}] Cc [[=nbsp=]] => 1 [\x{00A0}] Cc [[=shy=]] => 3,309 [\x{0000}\X{00AD}....] Cf [[=alm=]] => 3,309 [\x{0000}\X{00AD}....] Cf [[=sam=]] => 2 [\x{070F}\x{2E1A}] Po [[=ospm=]] => 1 [\x{1680}] Zs [[=mvs=]] => 1 [\x{180E}] Cf [[=nqsp=]] => 2 [\x{2000}\X[2002}] Zs [[=mqsp=]] => 2 [\x{2001}\X{2003}] Zs [[=ensp=]] => 2 [\x{2000}\X[2002}] Zs [[=emsp=]] => 2 [\x{2001}\X{2003}] Zs [[=3/msp=]] => 1 [\x{2004}] Zs [[=4/msp=]] => 1 [\x{2005}] Zs [[=6/msp=]] => 1 [\x{2006}] Zs [[=fsp=]] => 1 [\x{2007}] Zs [[=psp=]] => 1 [\x{2008}] Zs [[=thsp=]] => 1 [\x{2009}] Zs [[=hsp=]] => 1 [\x{200A}] Zs [[=zwsp=]] => 1 [\x{200B}] Cf [[=zwnj=]] => 3,309 [\x{0000}\X{00AD}....] Cf [[=zwj=]] => 3,309 [\x{0000}\X{00AD}....] Cf [[=lrm=]] => 3,309 [\x{0000}\X{00AD}....] Cf [[=rlm=]] => 3,309 [\x{0000}\X{00AD}....] Cf [[=ls=]] => 2 [\x{2028}\x{FE47}] Zl [[=ps=]] => 1 [\x{2029}] Zp [[=lre=]] => 3,309 [\x{0000}\X{00AD}....] Cf [[=rle=]] => 3,309 [\x{0000}\X{00AD}....] Cf [[=pdf=]] => 3,309 [\x{0000}\X{00AD}....] Cf [[=lro=]] => 3,309 [\x{0000}\X{00AD}....] Cf [[=rlo=]] => 3,309 [\x{0000}\X{00AD}....] Cf [[=nnbsp=]] => 1 [\x{202F}] Zs [[=mmsp=]] => 3 [\x{0020}\x{205F}\x{3000}] Zs [[=wj=]] => 3,309 [\x{0000}\X{00AD}....] Cf [[=(fa)=]] => 3,309 [\x{0000}\X{00AD}....] Cf [[=(it)=]] => 3,309 [\x{0000}\X{00AD}....] Cf [[=(is)=]] => 3,309 [\x{0000}\X{00AD}....] Cf [[=(ip)=]] => 3,309 [\x{0000}\X{00AD}....] Cf [[=lri=]] => 3,309 [\x{0000}\X{00AD}....] Cf [[=rli=]] => 3,309 [\x{0000}\X{00AD}....] Cf [[=fsi=]] => 3,309 [\x{0000}\X{00AD}....] Cf [[=pdi=]] => 3,309 [\x{0000}\X{00AD}....] Cf [[=iss=]] => 3,309 [\x{0000}\X{00AD}....] Cf [[=ass=]] => 3,309 [\x{0000}\X{00AD}....] Cf [[=iafs=]] => 3,309 [\x{0000}\X{00AD}....] Cf [[=aafs=]] => 3,309 [\x{0000}\X{00AD}....] Cf [[=nads=]] => 3,309 [\x{0000}\X{00AD}....] Cf [[=nods=]] => 3,309 [\x{0000}\X{00AD}....] Cf [[=idsp=]] => 3 [\x{0020}\x{205F}\x{3000}] Zs [[=zwnbsp=]] => 3,309 [\x{0000}\X{00AD}....] Cf [[=iaa=]] => 3,309 [\x{0000}\X{00AD}....] Cf [[=ias=]] => 3,309 [\x{0000}\X{00AD}....] Cf [[=iat=]] => 3,309 [\x{0000}\X{00AD}....] Cf [[=sflo=]] => 1 [\x{1BCA0}] Cf [[=sfco=]] => 1 [\x{1BCA1}] Cf [[=sfds=]] => 1 [\x{1BCA2}] Cf [[=sfus=]] => 1 [\x{1BCA3}] Cf
As you can see, a lot of
Format
characters give the erroneous result of3,309
occurrences. But we’re not going to bother about these wrongequivalence
classes, as long as the similarcollating
names, with the[[.XXX.]]
syntax, are totally correct !Luckily, all the other equivalence classes are quite correct, except for
[[=ls=]]
which returns2
matches\x{2028}
and\x{FE47}
??Also a detail !
Best Regards,
guy038
-
-
@guy038 said in Columns++ version 1.2: better Unicode search:
But, on the other hand, the search of the regex :
[[.NUL.][.SOH.][.STX.][.ETX.][.EOT.][.ENQ.][.ACK.][.BEL.][.BS.][.HT.][.VT.][.FF.][.SO.][.SI.][.DLE.][.DC1.][.DC2.][.DC3.][.DC4.][.NAK.][.SYN.][.ETB.][.CAN.][.EM.][.SUB.][.ESC.][.FS.][.GS.][.RS.][.US.][.DEL.][.HOP.][.RI.][.SS3.][.DCS.][.OCS.][.SHY.]]
Leads to an Invalid Regex message. Logical, as this kind of search concerns Unicode files, only.
There is a typo in that. The next-to-last symbolic name should be OSC, not OCS. (See list at the end of this help section.)
However, it still won’t work in ANSI search, because ANSI search only supports these POSIX symbolic names as defined by Boost::regex.
The regular expression language for ANSI files is exactly the same as it is in Notepad++ search, because I have not changed the underlying Boost::regex engine’s behavior for ANSI files. I only changed the way the engine works for UTF-8 files.
Some things, like stepwise find and replace with \K, formulas in replacement strings and counting null matches (my Count counts them, Notepad++’s doesn’t) differ for both ANSI and UTF-8 because I changed the surrounding code that uses the Boost::regex engine; but the matching itself is unchanged for ANSI.
This is why the character classes behave differently as well. Boost::regex relies on GetStringTypeExA (which is similar to GetStringTypeExW except for the third argument being
char*
instead ofwchar_t*
) to classify 8-bit characters according to the Ctype 1 list here. The classification depends on the current locale (which should imply the system default code page, which is the only code page Notepad++ ever uses as ANSI — documents in other code pages are converted to UTF-8). ANSI regular expressions, per Boost::regex design, are using whatever information Windows gives them. -
Hi, @coises and All,
I think this will be the last answer concerning your
Columns++_v1.2
plugin !Here is the recapitulation of the way to access the invisible characters, whatever the file type :
For
ANSI
files : just one possible syntax for thesecollating
names :[[.NUL.][.SOH.][.STX.][.ETX.][.EOT.][.ENQ.][.ACK.][.alert.][.backspace.][.tab.][.newline.][.vertical-tab.][.form-feed.][.carriage-return.][.SO.][.SI.][.DLE.][.DC1.][.DC2.][.DC3.][.DC4.][.NAK.][.SYN.][.ETB.][.CAN.][.EM.][.SUB.][.ESC.][.IS4.][.IS3.][.IS2.][.IS1.][.DEL.]]
which returns33
matches, against the Total_ANSI.txt file, wihich contains the256
characters of theWin-1252
encoding-
Note that the lowercase syntax is NOT allowed, in
ANSI
files, for ANYcollating
names, presently in UPPER case -
Note also that the four chars, from
\x1c
to\x1f
must be referred as fromIS4
toIS1
, in UPPER case ( and NOT fromfs
tous
! )
For
UTF-8
files : two possible syntaxes for thesecollating
names :[[.nul.][.soh.][.stx.][.etx.][.eot.][.enq.][.ack.][.bel.][.bs.][.ht.][.lf.][.vt.][.ff.][.cr.][.so.][.si.][.dle.][.dc1.][.dc2.][.dc3.][.dc4.][.nak.][.syn.][.etb.][.can.][.em.][.sub.][.esc.][.fs.][.gs.][.rs.][.us.][.del.][.pad.][.hop.][.bph.][.nbh.][.ind.][.nel.][.ssa.][.esa.][.hts.][.htj.][.lts.][.pld.][.plu.][.ri.][.ss2.][.ss3.][.dcs.][.pu1.][.pu2.][.sts.][.cch.][.mw.][.spa.][.epa.][.sos.][.sgci.][.sci.][.csi.][.st.][.osc.][.pm.][.apc.][.nbsp.][.shy.][.alm.][.sam.][.ospm.][.mvs.][.nqsp.][.mqsp.][.ensp.][.emsp.][.3/msp.][.4/msp.][.6/msp.][.fsp.][.psp.][.thsp.][.hsp.][.zwsp.][.zwnj.][.zwj.][.lrm.][.rlm.][.ls.][.ps.][.lre.][.rle.][.pdf.][.lro.][.rlo.][.nnbsp.][.mmsp.][.wj.][.(fa).][.(it).][.(is).][.(ip).][.lri.][.rli.][.fsi.][.pdi.][.iss.][.ass.][.iafs.][.aafs.][.nads.][.nods.][.idsp.][.zwnbsp.][.iaa.][.ias.][.iat.][.sflo.][.sfco.][.sfds.][.sfus.]]
which return120
matches, against the Total_Chars.txt file[[.nul.][.soh.][.stx.][.etx.][.eot.][.enq.][.ack.][.alert.][.backspace.][.tab.][.newline.][.vertical-tab.][.form-feed.][.carriage-return.][.so.][.si.][.dle.][.dc1.][.dc2.][.dc3.][.dc4.][.nak.][.syn.][.etb.][.can.][.em.][.sub.][.esc.][.fs.][.gs.][.rs.][.us.][.del.][.pad.][.hop.][.bph.][.nbh.][.ind.][.nel.][.ssa.][.esa.][.hts.][.htj.][.lts.][.pld.][.plu.][.ri.][.ss2.][.ss3.][.dcs.][.pu1.][.pu2.][.sts.][.cch.][.mw.][.spa.][.epa.][.sos.][.sgci.][.sci.][.csi.][.st.][.osc.][.pm.][.apc.][.nbsp.][.shy.][.alm.][.sam.][.ospm.][.mvs.][.nqsp.][.mqsp.][.ensp.][.emsp.][.3/msp.][.4/msp.][.6/msp.][.fsp.][.psp.][.thsp.][.hsp.][.zwsp.][.zwnj.][.zwj.][.lrm.][.rlm.][.ls.][.ps.][.lre.][.rle.][.pdf.][.lro.][.rlo.][.nnbsp.][.mmsp.][.wj.][.(fa).][.(it).][.(is).][.(ip).][.lri.][.rli.][.fsi.][.pdi.][.iss.][.ass.][.iafs.][.aafs.][.nads.][.nods.][.idsp.][.zwnbsp.][.iaa.][.ias.][.iat.][.sflo.][.sfco.][.sfds.][.sfus.]]
which returns120
matches, against the Total_Chars.txt file-Note that the Uppercase syntax is allowed, in
UTF-8
files, for ANYcollating
name, presently in LOWER case
Finally, for an
ANSI
file, containing the256
chars of theWin-1252
encoding and converted as anUTF-8
file (Encoding > Convert to UTF-8
), two syntaxes are possible :[[.nul.][.soh.][.stx.][.etx.][.eot.][.enq.][.ack.][.bel.][.bs.][.ht.][.lf.][.vt.][.ff.][.cr.][.so.][.si.][.dle.][.dc1.][.dc2.][.dc3.][.dc4.][.nak.][.syn.][.etb.][.can.][.em.][.sub.][.esc.][.fs.][.gs.][.rs.][.us.][.del.][.pad.][.hop.][.ri.][.ss3.][.dcs.][.osc.][.nbsp.][.shy.]]
which returns40
matches, agasint the Total_UTF-8.txt file[[.nul.][.soh.][.stx.][.etx.][.eot.][.enq.][.ack.][.alert.][.backspace.][.tab.][.newline.][.vertical-tab.][.form-feed.][.carriage-return.][.so.][.si.][.dle.][.dc1.][.dc2.][.dc3.][.dc4.][.nak.][.syn.][.etb.][.can.][.em.][.sub.][.esc.][.fs.][.gs.][.rs.][.us.][.del.][.hop.][.ri.][.ss3.][.dcs.][.osc.][.nbsp.][.shy.]]
which returns40
matches, against the Total_UTF-8.txt file- Note that the Uppercase syntax is allowed, in
UTF-8
files, for ANYcollating
name, presently in LOWER case
Now, against the
Total_ANSI.txt
file, containing the first256
UNICODE characters, we get these results :(?s). ANY character => 256 (?-s). ANY character different from LIKE-BREAKS => 253 = [^\x0A\x0C\x0D] [[:unicode:]] = \p{unicode} an OVER \x{00FF} character => 0 = [^\x00-\xFF}] [[:cntrl:]] = \p{cntrl} a CONTROL code character => 39 = [\x00-\x1F\x7F\x81\x8D\x8F\x90\x9D\xAD] [[:space:]] = \p{space} a WHITE-SPACE character => 7 = [\t\n\x0B\f\r\x20\xA0] [[:blank:]] = \p{blank} a BLANK character => 3 = [\t\x20\xA0] [[:upper:]] = \p{upper} an UPPER case letter => 60 = [A-ZŠŒŽŸÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞ] [[:lower:]] = \p{lower} a LOWER case letter => 65 = [a-zƒšœžªµºßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ] [[:digit:]] = \p{digit} a DECIMAL number => 13 = [0-9²³¹] [[:word:]] = \p{word} a WORD character => 139 = [[:alnum:]]|\x5F = \p{alnum}|\x5F [[:punct:]] = \p{punct} any PUNCTUATION or SYMBOL character => 80 = [\x21-\x2F\x3A-\x40\x5B-\x60\x7B-\x7E\x82\x84-\x87\x89\x8B\x91-\x97\x9B\xA1-\xBF\xD7\xF7] [[:alpha:]] = \p{alpha} any LETTER character => 125 = (?-i)[[:upper:][:lower:]] [[:alnum:]] = \p{alnum} an ALPHANUMERIC character => 138 = (?-i)[[:upper:][:lower:][:digit:]] [[:graph:]] = \p{graph} any VISIBLE character => 212 = [^\x00-\x1F\x20\x7F\x80\x81\x88\x8D\x8F\x90\x98\x99\x9D\xA0] [[:print:]] = \p{print} any PRINTABLE character => 219 = [[:graph:][:space:]] = [^\x00-\x1F\x20\x7F\x80\x81\x88\x8D\x8F\x90\x98\x99\x9D\xA0]|[[:space:]] [[:xdigit:]] an HEXADECIMAL character => 22 = [0-9A-Fa-f] = (?i)[0-9A-F]
Remark : the
[[:unicode:]]
class; for characters OVER\x{00FF}
, must correspond to the C1_DEFINED type from Ctype 1 list here.
From this same article, and after I realized that the POSIX classes are not totally independent, I deduced this layout :
C1_DEFINED Other characters 0 C1_CNTRL Control characters 39 C1_SPACE Space characters 2 ( only the SPACE and NBSP chars, OUT of 7, as ALL other are ALREADY included in the CNTRL chars class ) C1_UPPER Uppercase 60 C1_LOWER Lowercase 65 C1_DIGIT Decimal digits 13 C1_PUNCT Punctuation 73 ( and NOT 80, because the \xAD char is ALREADY included in the CNTRL chars class because the \xAA, \xB5 and \xBA are ALREADY included in the LOWER chars class because the \xB2, \xB3 and \xB9 are ALREADY included in the DIGIT chars class ) ----- TOTAL : 252 chars
So, if I exclude, from my
Total_ANSI.txt
file, all the following classes with the S/R :FIND
[[:cntrl:][:space:][:upper:][:lower:][:digit:][:punct:]]
REPLACE
Leave EMPTY
Either, with your plugin or with native N++, it remains
4
characters (256 - 252
) which are the € (\x{20AC}
), ˆ (\x{02C6}
), ˜ (\x{02DC}
) and ™ (\x{2122}
) charactersMoreover, absolutely no POSIX character class and no UNICODE character class, of course, can find these
4
characters !Thus, the only way to find out one of these
4
characters, in anANSI
file, is to use the regex[\x80\x88\x98\x99]
or to use the characters themselves :-((
In this article, it is also said :
Printable | Graphic characters and blanks (all C1_* types except C1_CNTRL). Thus …
So, from the previous total of chars of my
Total_ANSI.txt
file, the[[:print:]]
class should detect252 - 39
, so213
matches.Thus, as
[[graph:]]
=[[:print:]]
-[[space:]]
, this means that[[:graph:]]
should be :213 - 2
, so211
matches.But current result is
212
matches. The difference of one unit comes from the\xAD
char whith is, both, part of the[[:cntrl:]]
and[[graph:]]
POSIX character classes !If we remember of the
4
lacking chars, which, obviously, are visible and printable, this means that[[:graph:]]
and[[:print:]
should return, respectively215
( 211 + 4 ) and217
( 213 + 4 ) matches, forANSI
files.And it easy to verify that
[[:print:]]
+[[:cntrl:]]
= 217 + 39 =256
!
Just for info : from the
Total_UTF-8.txt
file, containing these same chars, we get these results :(?s). ANY character => 256 (?-s). ANY character different from LIKE-BREAKS => 254 = [^\x0A\x0D] [[:ascii:]] an UNDER \x{0080} character => 128 = [\x{0000}-\x{007F}] = \p{ascii} [[:unicode:]] = \p{unicode} an OVER \x{00FF} character => 27 = [^\x00-\xFF}] = [\x{20AC}\x{201A}\x{0192}\x{201E}\x{2026}\x{2020}\x{2021}\x{02C6}\x{2030}\x{0160}\x{2039}\x{0152}\x{017D}\x{2018}\x{2019}\x{201C}\x{201D}\x{2022}\x{2013}\x{2014}\x{02DC}\x{2122}\x{0161}\x{203A}\x{0153}\x{017E}\x{0178}] [[:cntrl:]] = \p{cntrl} a CONTROL code character => 38 = [\x00-\x1F\x7F\x81\x8D\x8F\x90\x9D] = \p{Cc} [[:space:]] = \p{space} a WHITE-SPACE character => 7 = [\t\n\x0B\f\r\x20\xA0] [[:blank:]] = \p{blank} a BLANK character => 3 = [\t\x{0020}\x{00A0}] = \p{Zs}|\t [[:upper:]] = \p{upper} an UPPER case letter => 60 = [A-ZŠŒŽŸÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞ] = \p{Lu} [[:lower:]] = \p{lower} a LOWER case letter => 63 = [a-zƒšœžµßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ] = \p{Ll} [[:digit:]] = \p{digit} a DECIMAL number => 10 = [0-9] = \p{Nd} [[:word:]] = \p{word} a WORD character => 137 = \p{L*}|\p{Nd}|_ [[:graph:]] = \p{graph} any VISIBLE character => 215 = [^\x00-\x1F\x20\x7F\x81\x8D\x8F\x90\x9D\xA0\xAD] = (?![\x20\xA0\xAD])\P{Cc} [[:print:]] = \p{print} any PRINTABLE character => 222 = [[:graph:][:space:]] = [^\x00-\x1F\x20\x7F\x81\x8D\x8F\x90\x9D\xA0\xAD]|[[:space:]] [[:punct:]] = \p{punct} any PUNCTUATION or SYMBOL character => 73 = \p{P*}|\p{S*} = [\x21-\x2F\x3A-\x40\x5B-\x60\x7B-\x7E\x{20AC}\x{201A}\x{201E}\x{2026}\x{2020}\x{2021}\x{2030}\x{2039}\x{2018}\x{2019}\x{201C}\x{201D}\x{2022}\x{2013}\x{2014}\x{02DC}\x{2122}\x{203A}\xA1-\xA9\xAB\xAC\xAE-\xB1\xB4\xB6-\xB8\xBB\xBF\xD7\xF7] [[:alpha:]] = \p{alpha} any LETTER character => 126 = \p{L*} = \p{Lu}|\p{Ll}|[ˆªº] [[:alnum:]] = \p{alnum} an ALPHANUMERIC character => 136 = \p{L*}|\p{Nd} [[:xdigit:]] an HEXADECIMAL character => 22 = [0-9A-Fa-f] = (?i)[0-9A-F]
Best regards,
guy038
-