Columns++ version 1.2: better Unicode search
-
@Alan-Kilborn said in Columns++ version 1.2: better Unicode search:
I would go so far as to suggest something that looks and operates like the Notepad++ Find dialog and its tabs.
Then someone using the plugin would not have to learn anything new, and would feel “right at home”.
You know why I suggest this, right? :-)I think I do, but to be honest, if and when I take on such a project, non-trivial user interface changes would be the whole point. Given that, I’m not sure I’d want to tie myself to recreating a legacy user interface and using it as an underlying model. Familiarity would be a plus, but I am unlikely to impose it on myself as a constraint.
This is all far enough down the road that someone else might well get to it before I do, anyway. I have at least two other self-assignments that would come first, and that’s just in the realm of computer programming.
-
G guy038 referenced this topic on
-
Hello, @coises and All,
Refer to this FAQ that I’ve just updated with references to your last
Columns++-1.2release :Best Regards,
guy038
-
@guy038 said in Columns++ version 1.2: better Unicode search:
Refer to this FAQ that I’ve just updated with references to your last
Columns++-1.2release :Thank you for mentioning Columns++. Might I suggest a couple things?
-
Columns++ version 1.2 is available using Plugins Admin in Notepad++ 8.7.8; there is no need to do the more complicated installation procedure.
-
I fear that your FAQ might make some readers expect that the search in Columns++ is a replacement for Notepad++ search, when it really isn’t. There are many things Notepad++ search can do (finding in all open files, finding and replacing in multiple files, etc.) that Columns++ search does not do and almost certainly never will. Its original, and still primary, reason for existence is to make it possible to find and replace within a rectangular selection — something Notepad++ search cannot do. There is also the extension of using mathematical formulas in replacements. I would recommend perhaps a link to the online help file sections about Search and Regular Expressions to clarify when Columns++ might be useful.
-
It might be unclear that while the progress dialog change applies to all Count, Select and Replace All actions in Columns++ search, the other changes you mentioned apply only to searches in Unicode files. Searches in “ANSI” files work the same as always. (The ability to search in regions based on a rectangular or multiple selection also applies to all searches, and the ability to use formulas in replacements applies to all regular expression searches.)
-
-
Hello, @coises and All,
You said :
Columns++ version 1.2 is available using Plugins Admin in Notepad++ 8.7.8; there is no need to do the more complicated installation procedure.
Well, I updated the regex documentation with the N++ release
v8.7.6and, at that time,Columns++did not seem to belong to the plugins’s list !?
You said :
I fear that your FAQ might make some readers expect that the search in Columns++ is a replacement for Notepad++ search, when it really isn’t.
I agree that I did not presented your plugin the right way. So, I did some modifications and I hope you’ll agree with the new phrasing !
You said :
… the other changes you mentioned apply only to searches in Unicode files. Searches in “ANSI” files work the same as always.
Well, your assertion is a bit paradoxal, regarding the title of this post ! Indeed, your title says :
Columns++ version 1.2: better Unicode search !!
And anyway, against an
ANSIfile, any search of an UNICODE property triggers anInvalid Regex message! So, the benefit of this improved version is not so obvious forANSIfiles. However, I did add a mention which clearly says that the search and replace are correct withANSIfiles,too.However, I noticed a odd thing :
-
Write these five characters
,¼½¾,in a newUTF-8tab -
Ask Columns++ to select all the punctuation characters with the
[[:punct:]]regex
=> It correctly find the two commas only, as the fractions has the UNICODE
\p{other Number}property and are not punctuation chars-
Now, convert this
UTF-8file to anANSIfile, with theEncoding > Convert to ANSIoption -
Re-try the
[[:punct:]]regex against this, from now on,ANSIfile
=> This time, the five characters are selected !?
If you try the
\p{other Number}regex it returns, as expected, an error message !
In your documentation, regarding your last sequence
[[.x80.]]–[[.xff.]], right before theSearchfile section :At first sight, it’s not a class of characters and it seems to be an invalid regular expression. Surprisingly, it’s not ! Actually, it’s a sequence of three consecutive characters :
-
An invalid
\x80UTF-8 byte -
An EN DASH character (
\x{2013}) -
An invalid
\xffUTF-8 byte
So, @Coises, just modify this regex as
[[.x80.]-[.xff.]], with an Hyphen-Minus character which, indeed, finds any invalidUTF-8character !BTW, I like your two buttons, at the bottom right of your documentation (
txt- TXT+), which allows us to zoom in or out. It surely help a lot of people !
Best Regards
guy038
P.S. :
In the first of my three consecutive posts which ended up my test’s period of your plugin ( https://community.notepad-plus-plus.org/post/100087 ) I wrote :
\p{Ascii}=(?s)\o=>128when applied against myTotal_Chars.txtfile ! Now, I understand that the(?s)modifier does not change anything for the Count results. Indeed, the(?s)or(?-s)modifiers are ONLY needed if there is, at least, one.regex character in the entire regex !So, if we want to omit the
\rand\nin the above regex, we must use the(?![\r\n])\p{Ascii}or the(?![\r\n])\osyntax, which correctly return126matchesNote that this it only true for a NON
ANSIfile. For an ANSI file :-
The regex
(?![\r\n])\p{Ascii}is invalid, as explained above. -
The regex
(?![\r\n])\odoes work but returns just one match : the lower-case lettero!! ( theMatch caseoption was set )
-
-
@guy038 said in Columns++ version 1.2: better Unicode search:
Well, I updated the regex documentation with the N++ release
v8.7.6and, at that time,Columns++did not seem to belong to the plugins’s list !?The previous “stable” version was there, but I got the pull request to update to version 1.2 in just barely in time to make it into Notepad++ 8.7.8.
I did some modifications and I hope you’ll agree with the new phrasing !
Thank you. I like that. I just didn’t want people to install it and then be disappointed that it’s no help if they want to use one of the many features of Notepad++ search that Columns++ does not attempt to replicate.
However, I noticed a odd thing :
-
Write these five characters
,¼½¾,in a newUTF-8tab -
Ask Columns++ to select all the punctuation characters with the
[[:punct:]]regex
=> It correctly find the two commas only, as the fractions has the UNICODE
\p{other Number} propertyand are not punctuation chars-
Now, convert this
UTF-8file to anANSIfile, with theEncoding > Convert to ANSIoption -
Re-try the
[[:punct:]]regex against this, from now on,ANSIfile
=> This time, the five characters are selected !?
Yes, that is something I don’t like about my own work: there are now inconsistencies between ANSI and UTF-8, because I changed nothing about ANSI regular expressions. For example,
(?i)\ustill matches all alphabetic characters in ANSI files. (For obscure technical reasons involving C++ template specialization and how Boost::regex is implemented, it may prove to be more difficult to make the corresponding changes to ANSI than it was to make them to Unicode. So far, I haven’t even tried.)In your documentation, regarding your last sequence
[[.x80.]]–[[.xff.]], right before theSearchfile section :At first sight, it’s not a class of characters and it seems to be an invalid regular expression. Surprisingly, it’s not ! Actually, it’s a sequence of three consecutive characters
It looks like I failed to convey what I meant in that entry. What I was trying to say was that you can use [[.xhh.]] as a symbolic character reference to find an invalid byte; so that, for example,
[[.xB2.]]will find any byte 0xB2 that is part of an invalid UTF-8 sequence. (There is no way to isolate bytes 0xB2 that are parts of valid UTF-8 sequences, though; for that, you’d have to reinterpret as — not convert to — ANSI.) I added those as I was updating the documentation, because I thought it was less confusing than telling people they could use expressions like\x{DCB2}to find specific invalid bytes. This mirrors how control and invisible characters have symbolic names that match the way Scintilla displays them. -
-
Hi, @coises,
You said :
It looks like I failed to convey what I meant in that entry
Ah…, now I understand what you meant ! Thus, may be the two following entries would just mean what you expected to :
•-----------------------------•-----------•------------------------------------------------------• | From [[.x00.]] to [[.xff.]] | [[.x##.]] | The invalid UTF-8 byte [[.x##.]] | | [[.x80.]-[.xff.]] | | Any invalid UTF-8 byte | •-----------------------------•-----------•------------------------------------------------------•
Like you, I’m a bit upset about the differences of behavior, of your
Columns++plugin, betweenANSIandUNICODEfiles. So, I will do additional tests to narrow down where these differences occur ! Like myTotal_Chars.txtUNICODE file, I’ll create an ANSI file containing the256characters of theWinsows-1252encoding to this purpose !https://en.wikipedia.org/wiki/Windows-1252
See you later,
BR
guy038
-
Hello, @coises and All,
I’ve decided to use the same canvas to describe results with an
ANSIfile as I did for results with aUNICODEfile. This description will spread over two posts !So, I first created this ANSI file, named
Total_ANSI.txt:•---------------•-----------------•------------•-----------------------------•-----------•-----------------•-----------• | Range | Description | Status | COUNT / MARK of ALL chars | # Chars | ANSI Encoding | # Bytes | •---------------•-----------------•------------•-----------------------------•-----------•-----------------•- ---------• | 0000 - 007F | PLANE 0 - BMP | Included | [\x00-\x7F] | 128 | | 128 | | | | | | | 1 Byte | | | 0080 - 00FF | PLANE 0 - BMP | Included | [\x80-\xFF] | 128 | | 128 | •---------------•-----------------•------------•-----------------------------•-----------•-----------------•-----------•
Against this file, the following results are correct :
[\x00-\xFF] => 256 chars, coded with one byte = TOTAL of characters [[:unicode:]] => 0 char = Total chars OVER \x{00FF}
I tried some expressions with look-aheads and look-behinds, containing overlapping zones !
For instance, against this text
aaaabaaababbbaabbabb, pasted in a newANSItab, with a final line-break, all the regexes, below, give the correct number of matches :ba*(?=a) => 4 matches ba*(?!a) => 9 matches ba*(?=b) => 8 matches ba*(?!b) => 5 matches (?<=a)ba* => 5 matches (?<!b)ba* => 5 matches (?<=b)ba* => 4 matches (?<!a)ba* => 4 matches
But, on the other hand, the search of the regex :
[[.NUL.][.SOH.][.STX.][.ETX.][.EOT.][.ENQ.][.ACK.][.BEL.][.BS.][.HT.][.VT.][.FF.][.SO.][.SI.][.DLE.][.DC1.][.DC2.][.DC3.][.DC4.][.NAK.][.SYN.][.ETB.][.CAN.][.EM.][.SUB.][.ESC.][.FS.][.GS.][.RS.][.US.][.DEL.][.HOP.][.RI.][.SS3.][.DCS.][.OCS.][.SHY.]]Leads to an Invalid Regex message. Logical, as this kind of search concerns
Unicodefiles, only.
Now, against the
Total_ANSI.txtfile, all the following results are correct :(?s). = [\x00-\xFF] => 256 Total = 256 (?-s). = [^\x0A\x0C\x0D] => 253 \p{Unicode} = [[:Unicode:]] => 0 | | Total = 256 \P{Unicode} = [[:^Unicode:]] => 256 | \X => 256 | | Total = 256 (?!\X). => 0 |
Here are the correct results, concerning all the Posix character classes, against the
Total_ANSI.txtfile[[:unicode:]] = \p{unicode} an OVER \x{00FF} character 0 = [^\x00-\xFF}] [[:space:]] = \p{space} = [[:s:]] = \p{s} = \ps = \s a WHITE-SPACE character 7 = [\t\n\x0B\f\r\x20\xA0] [[:h:]] = \p{h} = \ph = \h an HORIZONTAL white space character 3 = [\t\x20\xA0] [[:blank:]] = \p{blank} a BLANK character 3 = [\t\x20\xA0] [[:v:]] = \p{v} = \pv = \v a VERTICAL white space character 4 = [\n\x0B\f\r] [[:cntrl:]] = \p{cntrl} a CONTROL code character 39 = [\x00-\x1F\x7F\x81\x8D\x8F\x90\x9D\xAD] [[:upper:]] = \p{upper} = [[:u:]] = \p{u} = \pu = \u an UPPER case letter 60 = [A-ZŠŒŽÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞß] [[:lower:]] = \p{lower} = [[:l:]] = \p{l} = \pl = \l a LOWER case letter 65 = [a-zƒšœžŸªµºàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ] [[:digit:]] = \p{digit} = [[:d:]] = \p{d} = \pd = \d a DECIMAL number 13 = [0-9²³¹] _ = \x{005F} the LOW_LINE character 1 ------- [[:word:]] = \p{word} = [[:w:]] = \p{w} = \pw = \w a WORD character 139 = [[:alnum:]]|\x5F = \p{alnum}|\x5F (?i)[[:upper:]] = (?i)[[:lower:]] a LETTER, whatever its CASE 125 = (?-i)[[:upper:][:lower:]] [[:alnum:]] = \p{alnum} an ALPHANUMERIC character 138 = (?-i)[[:upper:][:lower:][:digit:]] [[:alpha:]] = \p{alpha} any LETTER character 125 = (?-i)[[:upper:][:lower:]] [[:graph:]] = \p{graph} any VISIBLE character 212 = [^\x00-\x1F\x20\x7F\x80\x81\x88\x8D\x8F\x90\x98\x99\x9D\xA0] [[:print:]] = \p{print} any PRINTABLE character 219 = [[:graph:]]|\s [[:punct:]] = \p{punct} any PUNCTUATION or SYMBOL character 80 = [\x21-\x2F\x3A-\x40\x5B-\x60\x7B-\x7E\x82\x84-\x87\x89\x8B\x91-\x97\x9B\xA1-\xBF\xD7\xF7] [[:xdigit:]] an HEXADECIMAL character 22 = [0-9A-Fa-f] = (?i)[0-9A-F]
NO results regarding the Unicode character classes, against the
Total_ANSI.txtfile because it logically returns an Invalid Regular Expression message for any classRemark :
- A negative POSIX character class can be expressed as
[^[:........:]]or[[:^........:]]
No INVALID
UTF-8chars can be found as we’re dealing with anANSIfile !
I tested ALL the
Equivalenceclasses feature :You can use any other equivalent character of the
aletter to get the15matches ( for instance :((=ª=]],[[=Å=]],[[=ã=]], … )Here is below the list of all the equivalences of any char of the
Windows-1252code-page, from\x00till\xDEagainst theTotal_ANSI.txtfile. Note that I did not consider the equivalence classes which returns only one match ![[==]] => 33 [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD] [[==]] => 33 [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD] [[==]] => 33 [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD] [[==]] => 33 [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD] [[==]] => 33 [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD] [[==]] => 33 [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD] [[==]] = [[=alert=]] => 33 [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD] [[==]] = [[=backspace=]] => 33 [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD] [[==]] => 33 [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD] [[==]] => 33 [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD] [[==]] => 33 [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD] [[==]] => 33 [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD] [[==]] => 33 [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD] [[==]] => 33 [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD] [[==]] => 33 [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD] [[==]] => 33 [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD] [[==]] => 33 [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD] [[==]] => 33 [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD] [[==]] => 33 [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD] [[==]] => 33 [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD] [[==]] => 33 [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD] [[==]] => 33 [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD] [[==]] = [[=IS4=]] => 33 [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD] [[==]] = [[=IS3=]] => 33 [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD] [[==]] = [[=IS2=]] => 33 [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD] [[==]] = [[=IS1=]] => 33 [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD] [[='=]] = [[=apostrophe=]] => 33 [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD] [[=-=]] = [[=hyphen=]] => 33 [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD] [[==]] => 33 [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD] [[=–=]] => 33 [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD] [[=—=]] => 33 [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD] [[==]] => 33 [\x00-\x08\x0E-\x1F\x27\x2D\x7F\x96\x97\AD] [[=1=]] = [[=one=]] => 2 [1¹] [[=2=]] = [[=two=]] => 2 [2²] [[=3=]] = [[=three=]] => 2 [3³] [[=A=]] => 15 [AaªÀÁÂÃÄÅàáâãäå] [[=B=]] => 2 [Bb] [[=C=]] => 4 [CcÇç] [[=D=]] => 4 [DdÐð] [[=E=]] => 10 [EeÈÉÊËèéêë] [[=F=]] => 3 [Ffƒ] [[=G=]] => 2 [Gg] [[=H=]] => 2 [Hh] [[=I=]] => 10 [IiÌÍÎÏìíîï] [[=J=]] => 2 [Jj] [[=K=]] => 2 [Kk] [[=L=]] => 2 [Ll] [[=M=]] => 2 [Mm] [[=N=]] => 4 [NnÑñ] [[=O=]] => 15 [OoºÒÓÔÕÖØòóôõöø] [[=P=]] => 2 [Pp] [[=Q=]] => 2 [Qq] [[=R=]] => 2 [Rr] [[=S=]] => 4 [SsŠš] [[=T=]] => 2 [Tt] [[=U=]] => 10 [UuÙÚÛÜùúûü] [[=V=]] => 2 [Vv] [[=W=]] => 2 [Ww] [[=X=]] => 2 [Xx] [[=Y=]] => 6 [YyÝýÿŸ] [[=Z=]] => 4 [ZzŹź] [[=^=]] = [[=circumflex=]] => 2 [^ˆ] [[=Œ=]] => 2 [Œœ] [[=Þ=]] => 2 [Þþ]
Some double-letter characters give some equivalences which allow you to get the right single char to use, instead of the two trivial letters :
[[=AE=]] = [[=Ae=]] = [[=ae=]] => 2 [Ææ] [[=SS=]] = [[=Ss=]] = [[=ss=]] => 1 [ß]
An example : let’s suppose that we run this regex
(?-i)[A-F[:lower:]], against myTotal_ANSI.txtfile. It does give71matches, so6UPPER letters +65LOWER lettersAs, in an
ANSIfile, theMatch caseoption or the?i)modifier is effective forPOSIXcharacter classes, if we run the same regex, in an insensitive way, the(?i)[A-F[:lower:]]regex returns, this time,125matches.And note that the regex
(?-i)[[:upper:][:lower:]]or(?i)[[:upper:][:lower:]]acts as an insensitive regex and return125matches ( So60UPPER letters +65LOWER letters )The regexes
(?-i)\u(?<=\l)and(?-i)(?=\l)\udo not find any match. This implies that the sets of UPPER and LOWER letters are totally disjoint
Finally, for
ANSIfiles, the regex syntax\Xis rather useless. Indeed, the UNICODE block ofCombining diacriticalmarks is cannot be used, anyway and theEmojiare UNICODE characters are totally inaccessible toANSIfiles. Thus,\Xregex is just equivalent to the simple regex(?s).
So, from this set of ANSI results, which ones seem quite odd, compared with the UNICODE results ?
-
Maybe, the regex
(?-s).should just be equal to[^\x0A\x0D]and return254matches -
The
[[:cntrl:]]or\p{cntrl}should be equal to[\x00-\x1F\x7F\x81\x8D\x8F\x90\x9D]and returns38characters or, maybe,[\x00-\x1F\x7F]so33chars only
Regarding the
[[:graph:]]character class, I created an identicalUTF-8file, namedTotal_UTF-8.txt. Here are the results for the characters between\x80and\xBF, in both files ( all the other chars being identical ) :•---------•---------•--------------------• | ANSI | UTF-8 | UNICODE Category | •---------•---------•--------------------• | | € | Sc | ‚ | ‚ | | ƒ | ƒ | | „ | „ | | … | … | | † | † | | ‡ | ‡ | | | ˆ | Po | ‰ | ‰ | | Š | Š | | ‹ | ‹ | | Œ | Œ | | Ž | Ž | | ‘ | ‘ | | ’ | ’ | | “ | “ | | ” | ” | | • | • | | – | – | | — | — | | | ˜ | Sk | | ™ | So | š | š | | › | › | | œ | œ | | ž | ž | | Ÿ | Ÿ | | ¡ | ¡ | | ¢ | ¢ | | £ | £ | | ¤ | ¤ | | ¥ | ¥ | | ¦ | ¦ | | § | § | | ¨ | ¨ | | © | © | | ª | ª | | « | « | | ¬ | ¬ | | | | Cf | ® | ® | | ¯ | ¯ | | ° | ° | | ± | ± | | ² | ² | | ³ | ³ | | ´ | ´ | | µ | µ | | ¶ | ¶ | | · | · | | ¸ | ¸ | | ¹ | ¹ | | º | º | | » | » | | ¼ | ¼ | | ½ | ½ | | ¾ | ¾ | | ¿ | ¿ | •---------•---------•Surprisingly, the ANSI chars
\x80,\x88,\x98,\x99are not supposed to be part of the[[:graph:]]which represents the class of visible characters !?So, to harmonize the results, the rule should be :
-
When using the
[[:graph:]]POSIX character class, against anANSIfile :-
The
[\x80\x88\x98\x99]ANSI list of characters ( corresponding the[\x{20AC}\x{02C6}\x{02DC}\x{2122}]UTF-8 list ) should be included in that class -
The
\xADcharacter ( or \x{00AD} ) should be excluded from that class !
-
Now, as the
[[:print:]]POSIX character class is simply identical to the regex[[:graph:]]|\s, no need to investigate about that character class !See next post
- A negative POSIX character class can be expressed as
-
Hi @Coises and All,
End of my reply :
In the same way, regarding the
[[:punct:]]character class, here are the results for both, theTotal_ANSI.txtandTotal_UTF-8.txtfiles :•---------•---------•--------------------• | ANSI | UTF-8 | UNICODE Category | •---------•---------•--------------------• | ! | ! | Po | " | " | Po | # | # | Po | $ | $ | Sc | % | % | Po | & | & | Po | ' | ' | Po | ( | ( | Ps | ) | ) | Pe | * | * | Po | + | + | Sm | , | , | Po | - | - | Pd | . | . | Po | / | / | Po | : | : | Po | ; | ; | Po | < | < | Sm | = | = | Sm | > | > | Sm | ? | ? | Po | @ | @ | Po | [ | [ | Ps | \ | \ | Po | ] | ] | Pe | ^ | ^ | Sk | _ | _ | Pc | ` | ` | Sk | { | { | Ps | | | | | Sm | } | } | Pe | ~ | ~ | Sm •---------•---------•-----------• | | € | Sc | ‚ | ‚ | Ps | „ | „ | Ps | … | … | Po | † | † | Po | ‡ | ‡ | Po | ‰ | ‰ | Po | ‹ | ‹ | Pi | ‘ | ‘ | Pi | ’ | ’ | Pf | “ | “ | Pi | ” | ” | Pf | • | • | Po | – | – | Pd | — | — | Pd | | ˜ | Sk | | ™ | So | › | › | Pf | ¡ | ¡ | Po | ¢ | ¢ | Sc | £ | £ | Sc | ¤ | ¤ | Sc | ¥ | ¥ | Sc | ¦ | ¦ | So | § | § | Po | ¨ | ¨ | Sk | © | © | So | ª | | Lo | « | « | Pi | ¬ | ¬ | Sm | | | Cf | ® | ® | So | ¯ | ¯ | Sk | ° | ° | So | ± | ± | Sm | ² | | No | ³ | | No | ´ | ´ | Sk | µ | | Ll | ¶ | ¶ | Po | · | · | Po | ¸ | ¸ | Sk | ¹ | | No | º | | Lo | » | » | Pf | ¼ | | No | ½ | | No | ¾ | | No | ¿ | ¿ | Po | × | × | Sm | ÷ | ÷ | Sm •---------•---------•-----------•And, as we know that the
[[:punct:]]POSIX character class is the union of the TWO Unicode classes\p{P*}and\p{S*}, this means that all the[[:punct:]]characters, found inTotal_UTF-8.txt, are exact !However, it’s obvious that it’s not the case for the
[[:punct:]]characters found inTotal_ANSI.txt:So, again, to harmonize the results, the rule should be :
-
When using the
[[:punct:]]POSIX character class, against anANSIfile :-
The
[\xAA\xAD\xB2\xB3\xB5\xB9\xBA\xBC\xBD\xBE]list of characters should be excluded from that class ! -
The
[\x80\x98\x99]ANSI list of characters ( corresponding the[\x{20AC}\x{02DC}\x{2122}]UTF-8 list ) should be included in that class
-
And this result would confirm that the POSIX
[[:punct:]]character class is equal to the\p{P*}|\p{S*}regex, in all cases !
-
Regarding the
Equivalenceclasses whose results are presently33, the rule should be :-
All the
Controlcodes should just match their own character. For example[[==]]should return 1 match[x7F] -
[[='=]]=[[=apostrophe=]]should return 1 match[\x27] -
[[=-=]]=[[=hyphen=]]should return 1 match[\x2D] -
[[=–=]]should return 1 match[\x96] -
[[=—=]]should return 1 match[\x97] -
[[==]]should return 1 match[\xAD]
-
Now, when doing tests with UNICODE files, I forgot the equivalence classes of the
Control C0/C1andControl Formatcharacters !. So the results, against myTotal_Chars.txtUTF-8 file, are :[[=nul=]] => 3,309 [\x{0000}\X{00AD}....] Cc [[=soh=]] => 1 [\x{0001}] Cc [[=stx=]] => 1 [\x{0002}] Cc [[=etx=]] => 1 [\x{0003}] Cc [[=eot=]] => 1 [\x{0004}] Cc [[=enq=]] => 1 [\x{0005}] Cc [[=ack=]] => 1 [\x{0006}] Cc [[=bel=]] = [[=alert=]] => 1 [\x{0007}] Cc [[=bs=]] = [[=backspace=]] => 1 [\x{0008}] Cc [[=ht=]] = [[=tab=]] => 1 [\x{0009}] Cc [[=lf=]] = [[=newline=]] => 1 [\x{000A}] Cc [[=vt=]] = [[=vertical-tab=]] => 1 [\x{000B}] Cc [[=ff=]] = [[=form-feed=]] => 1 [\x{000C}] Cc [[=cr=]] = [[=carriage-return=]] => 1 [\x{000D}] Cc [[=so=]] => 1 [\x{000E}] Cc [[=si=]] => 1 [\x{000F}] Cc [[=dle=]] => 1 [\x{0010}] Cc [[=dc1=]] => 1 [\x{0011}] Cc [[=dc2=]] => 1 [\x{0012}] Cc [[=dc3=]] => 1 [\x{0013}] Cc [[=dc4=]] => 1 [\x{0014}] Cc [[=nak=]] => 1 [\x{0015}] Cc [[=syn=]] => 1 [\x{0016}] Cc [[=etb=]] => 1 [\x{0017}] Cc [[=can=]] => 1 [\x{0018}] Cc [[=em=]] => 1 [\x{0019}] Cc [[=sub=]] => 1 [\x{001A}] Cc [[=esc=]] => 1 [\x{001B}] Cc [[=fs=]] => 1 [\x{001C}] Cc [[=gs=]] => 1 [\x{001D}] Cc [[=rs=]] => 1 [\x{001E}] Cc [[=us=]] => 1 [\x{001F}] Cc [[= =]] => 3 [\x{0020}\x{205F}\x{3000}] Zs [[=del=]] => 1 [\x{007F}] Cc [[=pad=]] => 1 [\x{0080}] Cc [[=hop=]] => 1 [\x{0081}] Cc [[=bph=]] => 1 [\x{0082}] Cc [[=nbh=]] => 1 [\x{0083}] Cc [[=ind=]] => 1 [\x{0084}] Cc [[=nel=]] => 1 [\x{0085}] Cc [[=ssa=]] => 1 [\x{0086}] Cc [[=esa=]] => 1 [\x{0087}] Cc [[=hts=]] => 1 [\x{0088}] Cc [[=htj=]] => 1 [\x{0089}] Cc [[=lts=]] => 1 [\x{008A}] Cc [[=pld=]] => 1 [\x{008B}] Cc [[=plu=]] => 1 [\x{008C}] Cc [[=ri=]] => 1 [\x{008D}] Cc [[=ss2=]] => 1 [\x{008E}] Cc [[=ss3=]] => 1 [\x{008F}] Cc [[=dcs=]] => 1 [\x{0090}] Cc [[=pu1=]] => 1 [\x{0091}] Cc [[=pu2=]] => 1 [\x{0092}] Cc [[=sts=]] => 1 [\x{0093}] Cc [[=cch=]] => 1 [\x{0094}] Cc [[=mw=]] => 1 [\x{0095}] Cc [[=spa=]] => 1 [\x{0096}] Cc [[=epa=]] => 1 [\x{0097}] Cc [[=sos=]] => 1 [\x{0098}] Cc [[=sgci=]] => 1 [\x{0099}] Cc [[=sci=]] => 1 [\x{009A}] Cc [[=csi=]] => 1 [\x{009B}] Cc [[=st=]] => 1 [\x{009C}] Cc [[=osc=]] => 1 [\x{009D}] Cc [[=pm=]] => 1 [\x{009E}] Cc [[=apc=]] => 1 [\x{009F}] Cc [[=nbsp=]] => 1 [\x{00A0}] Cc [[=shy=]] => 3,309 [\x{0000}\X{00AD}....] Cf [[=alm=]] => 3,309 [\x{0000}\X{00AD}....] Cf [[=sam=]] => 2 [\x{070F}\x{2E1A}] Po [[=ospm=]] => 1 [\x{1680}] Zs [[=mvs=]] => 1 [\x{180E}] Cf [[=nqsp=]] => 2 [\x{2000}\X[2002}] Zs [[=mqsp=]] => 2 [\x{2001}\X{2003}] Zs [[=ensp=]] => 2 [\x{2000}\X[2002}] Zs [[=emsp=]] => 2 [\x{2001}\X{2003}] Zs [[=3/msp=]] => 1 [\x{2004}] Zs [[=4/msp=]] => 1 [\x{2005}] Zs [[=6/msp=]] => 1 [\x{2006}] Zs [[=fsp=]] => 1 [\x{2007}] Zs [[=psp=]] => 1 [\x{2008}] Zs [[=thsp=]] => 1 [\x{2009}] Zs [[=hsp=]] => 1 [\x{200A}] Zs [[=zwsp=]] => 1 [\x{200B}] Cf [[=zwnj=]] => 3,309 [\x{0000}\X{00AD}....] Cf [[=zwj=]] => 3,309 [\x{0000}\X{00AD}....] Cf [[=lrm=]] => 3,309 [\x{0000}\X{00AD}....] Cf [[=rlm=]] => 3,309 [\x{0000}\X{00AD}....] Cf [[=ls=]] => 2 [\x{2028}\x{FE47}] Zl [[=ps=]] => 1 [\x{2029}] Zp [[=lre=]] => 3,309 [\x{0000}\X{00AD}....] Cf [[=rle=]] => 3,309 [\x{0000}\X{00AD}....] Cf [[=pdf=]] => 3,309 [\x{0000}\X{00AD}....] Cf [[=lro=]] => 3,309 [\x{0000}\X{00AD}....] Cf [[=rlo=]] => 3,309 [\x{0000}\X{00AD}....] Cf [[=nnbsp=]] => 1 [\x{202F}] Zs [[=mmsp=]] => 3 [\x{0020}\x{205F}\x{3000}] Zs [[=wj=]] => 3,309 [\x{0000}\X{00AD}....] Cf [[=(fa)=]] => 3,309 [\x{0000}\X{00AD}....] Cf [[=(it)=]] => 3,309 [\x{0000}\X{00AD}....] Cf [[=(is)=]] => 3,309 [\x{0000}\X{00AD}....] Cf [[=(ip)=]] => 3,309 [\x{0000}\X{00AD}....] Cf [[=lri=]] => 3,309 [\x{0000}\X{00AD}....] Cf [[=rli=]] => 3,309 [\x{0000}\X{00AD}....] Cf [[=fsi=]] => 3,309 [\x{0000}\X{00AD}....] Cf [[=pdi=]] => 3,309 [\x{0000}\X{00AD}....] Cf [[=iss=]] => 3,309 [\x{0000}\X{00AD}....] Cf [[=ass=]] => 3,309 [\x{0000}\X{00AD}....] Cf [[=iafs=]] => 3,309 [\x{0000}\X{00AD}....] Cf [[=aafs=]] => 3,309 [\x{0000}\X{00AD}....] Cf [[=nads=]] => 3,309 [\x{0000}\X{00AD}....] Cf [[=nods=]] => 3,309 [\x{0000}\X{00AD}....] Cf [[=idsp=]] => 3 [\x{0020}\x{205F}\x{3000}] Zs [[=zwnbsp=]] => 3,309 [\x{0000}\X{00AD}....] Cf [[=iaa=]] => 3,309 [\x{0000}\X{00AD}....] Cf [[=ias=]] => 3,309 [\x{0000}\X{00AD}....] Cf [[=iat=]] => 3,309 [\x{0000}\X{00AD}....] Cf [[=sflo=]] => 1 [\x{1BCA0}] Cf [[=sfco=]] => 1 [\x{1BCA1}] Cf [[=sfds=]] => 1 [\x{1BCA2}] Cf [[=sfus=]] => 1 [\x{1BCA3}] CfAs you can see, a lot of
Formatcharacters give the erroneous result of3,309occurrences. But we’re not going to bother about these wrongequivalenceclasses, as long as the similarcollatingnames, with the[[.XXX.]]syntax, are totally correct !Luckily, all the other equivalence classes are quite correct, except for
[[=ls=]]which returns2matches\x{2028}and\x{FE47}??Also a detail !
Best Regards,
guy038
-
-
@guy038 said in Columns++ version 1.2: better Unicode search:
But, on the other hand, the search of the regex :
[[.NUL.][.SOH.][.STX.][.ETX.][.EOT.][.ENQ.][.ACK.][.BEL.][.BS.][.HT.][.VT.][.FF.][.SO.][.SI.][.DLE.][.DC1.][.DC2.][.DC3.][.DC4.][.NAK.][.SYN.][.ETB.][.CAN.][.EM.][.SUB.][.ESC.][.FS.][.GS.][.RS.][.US.][.DEL.][.HOP.][.RI.][.SS3.][.DCS.][.OCS.][.SHY.]]
Leads to an Invalid Regex message. Logical, as this kind of search concerns Unicode files, only.
There is a typo in that. The next-to-last symbolic name should be OSC, not OCS. (See list at the end of this help section.)
However, it still won’t work in ANSI search, because ANSI search only supports these POSIX symbolic names as defined by Boost::regex.
The regular expression language for ANSI files is exactly the same as it is in Notepad++ search, because I have not changed the underlying Boost::regex engine’s behavior for ANSI files. I only changed the way the engine works for UTF-8 files.
Some things, like stepwise find and replace with \K, formulas in replacement strings and counting null matches (my Count counts them, Notepad++’s doesn’t) differ for both ANSI and UTF-8 because I changed the surrounding code that uses the Boost::regex engine; but the matching itself is unchanged for ANSI.
This is why the character classes behave differently as well. Boost::regex relies on GetStringTypeExA (which is similar to GetStringTypeExW except for the third argument being
char*instead ofwchar_t*) to classify 8-bit characters according to the Ctype 1 list here. The classification depends on the current locale (which should imply the system default code page, which is the only code page Notepad++ ever uses as ANSI — documents in other code pages are converted to UTF-8). ANSI regular expressions, per Boost::regex design, are using whatever information Windows gives them. -
Hi, @coises and All,
I think this will be the last answer concerning your
Columns++_v1.2plugin !Here is the recapitulation of the way to access the invisible characters, whatever the file type :
For
ANSIfiles : just one possible syntax for thesecollatingnames :[[.NUL.][.SOH.][.STX.][.ETX.][.EOT.][.ENQ.][.ACK.][.alert.][.backspace.][.tab.][.newline.][.vertical-tab.][.form-feed.][.carriage-return.][.SO.][.SI.][.DLE.][.DC1.][.DC2.][.DC3.][.DC4.][.NAK.][.SYN.][.ETB.][.CAN.][.EM.][.SUB.][.ESC.][.IS4.][.IS3.][.IS2.][.IS1.][.DEL.]]which returns33matches, against the Total_ANSI.txt file, wihich contains the256characters of theWin-1252encoding-
Note that the lowercase syntax is NOT allowed, in
ANSIfiles, for ANYcollatingnames, presently in UPPER case -
Note also that the four chars, from
\x1cto\x1fmust be referred as fromIS4toIS1, in UPPER case ( and NOT fromfstous! )
For
UTF-8files : two possible syntaxes for thesecollatingnames :[[.nul.][.soh.][.stx.][.etx.][.eot.][.enq.][.ack.][.bel.][.bs.][.ht.][.lf.][.vt.][.ff.][.cr.][.so.][.si.][.dle.][.dc1.][.dc2.][.dc3.][.dc4.][.nak.][.syn.][.etb.][.can.][.em.][.sub.][.esc.][.fs.][.gs.][.rs.][.us.][.del.][.pad.][.hop.][.bph.][.nbh.][.ind.][.nel.][.ssa.][.esa.][.hts.][.htj.][.lts.][.pld.][.plu.][.ri.][.ss2.][.ss3.][.dcs.][.pu1.][.pu2.][.sts.][.cch.][.mw.][.spa.][.epa.][.sos.][.sgci.][.sci.][.csi.][.st.][.osc.][.pm.][.apc.][.nbsp.][.shy.][.alm.][.sam.][.ospm.][.mvs.][.nqsp.][.mqsp.][.ensp.][.emsp.][.3/msp.][.4/msp.][.6/msp.][.fsp.][.psp.][.thsp.][.hsp.][.zwsp.][.zwnj.][.zwj.][.lrm.][.rlm.][.ls.][.ps.][.lre.][.rle.][.pdf.][.lro.][.rlo.][.nnbsp.][.mmsp.][.wj.][.(fa).][.(it).][.(is).][.(ip).][.lri.][.rli.][.fsi.][.pdi.][.iss.][.ass.][.iafs.][.aafs.][.nads.][.nods.][.idsp.][.zwnbsp.][.iaa.][.ias.][.iat.][.sflo.][.sfco.][.sfds.][.sfus.]]which return120matches, against the Total_Chars.txt file[[.nul.][.soh.][.stx.][.etx.][.eot.][.enq.][.ack.][.alert.][.backspace.][.tab.][.newline.][.vertical-tab.][.form-feed.][.carriage-return.][.so.][.si.][.dle.][.dc1.][.dc2.][.dc3.][.dc4.][.nak.][.syn.][.etb.][.can.][.em.][.sub.][.esc.][.fs.][.gs.][.rs.][.us.][.del.][.pad.][.hop.][.bph.][.nbh.][.ind.][.nel.][.ssa.][.esa.][.hts.][.htj.][.lts.][.pld.][.plu.][.ri.][.ss2.][.ss3.][.dcs.][.pu1.][.pu2.][.sts.][.cch.][.mw.][.spa.][.epa.][.sos.][.sgci.][.sci.][.csi.][.st.][.osc.][.pm.][.apc.][.nbsp.][.shy.][.alm.][.sam.][.ospm.][.mvs.][.nqsp.][.mqsp.][.ensp.][.emsp.][.3/msp.][.4/msp.][.6/msp.][.fsp.][.psp.][.thsp.][.hsp.][.zwsp.][.zwnj.][.zwj.][.lrm.][.rlm.][.ls.][.ps.][.lre.][.rle.][.pdf.][.lro.][.rlo.][.nnbsp.][.mmsp.][.wj.][.(fa).][.(it).][.(is).][.(ip).][.lri.][.rli.][.fsi.][.pdi.][.iss.][.ass.][.iafs.][.aafs.][.nads.][.nods.][.idsp.][.zwnbsp.][.iaa.][.ias.][.iat.][.sflo.][.sfco.][.sfds.][.sfus.]]which returns120matches, against the Total_Chars.txt file-Note that the Uppercase syntax is allowed, in
UTF-8files, for ANYcollatingname, presently in LOWER case
Finally, for an
ANSIfile, containing the256chars of theWin-1252encoding and converted as anUTF-8file (Encoding > Convert to UTF-8), two syntaxes are possible :[[.nul.][.soh.][.stx.][.etx.][.eot.][.enq.][.ack.][.bel.][.bs.][.ht.][.lf.][.vt.][.ff.][.cr.][.so.][.si.][.dle.][.dc1.][.dc2.][.dc3.][.dc4.][.nak.][.syn.][.etb.][.can.][.em.][.sub.][.esc.][.fs.][.gs.][.rs.][.us.][.del.][.pad.][.hop.][.ri.][.ss3.][.dcs.][.osc.][.nbsp.][.shy.]]which returns40matches, agasint the Total_UTF-8.txt file[[.nul.][.soh.][.stx.][.etx.][.eot.][.enq.][.ack.][.alert.][.backspace.][.tab.][.newline.][.vertical-tab.][.form-feed.][.carriage-return.][.so.][.si.][.dle.][.dc1.][.dc2.][.dc3.][.dc4.][.nak.][.syn.][.etb.][.can.][.em.][.sub.][.esc.][.fs.][.gs.][.rs.][.us.][.del.][.hop.][.ri.][.ss3.][.dcs.][.osc.][.nbsp.][.shy.]]which returns40matches, against the Total_UTF-8.txt file- Note that the Uppercase syntax is allowed, in
UTF-8files, for ANYcollatingname, presently in LOWER case
Now, against the
Total_ANSI.txtfile, containing the first256UNICODE characters, we get these results :(?s). ANY character => 256 (?-s). ANY character different from LIKE-BREAKS => 253 = [^\x0A\x0C\x0D] [[:unicode:]] = \p{unicode} an OVER \x{00FF} character => 0 = [^\x00-\xFF}] [[:cntrl:]] = \p{cntrl} a CONTROL code character => 39 = [\x00-\x1F\x7F\x81\x8D\x8F\x90\x9D\xAD] [[:space:]] = \p{space} a WHITE-SPACE character => 7 = [\t\n\x0B\f\r\x20\xA0] [[:blank:]] = \p{blank} a BLANK character => 3 = [\t\x20\xA0] [[:upper:]] = \p{upper} an UPPER case letter => 60 = [A-ZŠŒŽŸÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞ] [[:lower:]] = \p{lower} a LOWER case letter => 65 = [a-zƒšœžªµºßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ] [[:digit:]] = \p{digit} a DECIMAL number => 13 = [0-9²³¹] [[:word:]] = \p{word} a WORD character => 139 = [[:alnum:]]|\x5F = \p{alnum}|\x5F [[:punct:]] = \p{punct} any PUNCTUATION or SYMBOL character => 80 = [\x21-\x2F\x3A-\x40\x5B-\x60\x7B-\x7E\x82\x84-\x87\x89\x8B\x91-\x97\x9B\xA1-\xBF\xD7\xF7] [[:alpha:]] = \p{alpha} any LETTER character => 125 = (?-i)[[:upper:][:lower:]] [[:alnum:]] = \p{alnum} an ALPHANUMERIC character => 138 = (?-i)[[:upper:][:lower:][:digit:]] [[:graph:]] = \p{graph} any VISIBLE character => 212 = [^\x00-\x1F\x20\x7F\x80\x81\x88\x8D\x8F\x90\x98\x99\x9D\xA0] [[:print:]] = \p{print} any PRINTABLE character => 219 = [[:graph:][:space:]] = [^\x00-\x1F\x20\x7F\x80\x81\x88\x8D\x8F\x90\x98\x99\x9D\xA0]|[[:space:]] [[:xdigit:]] an HEXADECIMAL character => 22 = [0-9A-Fa-f] = (?i)[0-9A-F]Remark : the
[[:unicode:]]class; for characters OVER\x{00FF}, must correspond to the C1_DEFINED type from Ctype 1 list here.
From this same article, and after I realized that the POSIX classes are not totally independent, I deduced this layout :
C1_DEFINED Other characters 0 C1_CNTRL Control characters 39 C1_SPACE Space characters 2 ( only the SPACE and NBSP chars, OUT of 7, as ALL other are ALREADY included in the CNTRL chars class ) C1_UPPER Uppercase 60 C1_LOWER Lowercase 65 C1_DIGIT Decimal digits 13 C1_PUNCT Punctuation 73 ( and NOT 80, because the \xAD char is ALREADY included in the CNTRL chars class because the \xAA, \xB5 and \xBA are ALREADY included in the LOWER chars class because the \xB2, \xB3 and \xB9 are ALREADY included in the DIGIT chars class ) ----- TOTAL : 252 charsSo, if I exclude, from my
Total_ANSI.txtfile, all the following classes with the S/R :FIND
[[:cntrl:][:space:][:upper:][:lower:][:digit:][:punct:]]REPLACE
Leave EMPTYEither, with your plugin or with native N++, it remains
4characters (256 - 252) which are the € (\x{20AC}), ˆ (\x{02C6}), ˜ (\x{02DC}) and ™ (\x{2122}) charactersMoreover, absolutely no POSIX character class and no UNICODE character class, of course, can find these
4characters !Thus, the only way to find out one of these
4characters, in anANSIfile, is to use the regex[\x80\x88\x98\x99]or to use the characters themselves :-((
In this article, it is also said :
Printable | Graphic characters and blanks (all C1_* types except C1_CNTRL). Thus …
So, from the previous total of chars of my
Total_ANSI.txtfile, the[[:print:]]class should detect252 - 39, so213matches.Thus, as
[[graph:]]=[[:print:]]-[[space:]], this means that[[:graph:]]should be :213 - 2, so211matches.But current result is
212matches. The difference of one unit comes from the\xADchar whith is, both, part of the[[:cntrl:]]and[[graph:]]POSIX character classes !If we remember of the
4lacking chars, which, obviously, are visible and printable, this means that[[:graph:]]and[[:print:]should return, respectively215( 211 + 4 ) and217( 213 + 4 ) matches, forANSIfiles.And it easy to verify that
[[:print:]]+[[:cntrl:]]= 217 + 39 =256!
Just for info : from the
Total_UTF-8.txtfile, containing these same chars, we get these results :(?s). ANY character => 256 (?-s). ANY character different from LIKE-BREAKS => 254 = [^\x0A\x0D] [[:ascii:]] an UNDER \x{0080} character => 128 = [\x{0000}-\x{007F}] = \p{ascii} [[:unicode:]] = \p{unicode} an OVER \x{00FF} character => 27 = [^\x00-\xFF}] = [\x{20AC}\x{201A}\x{0192}\x{201E}\x{2026}\x{2020}\x{2021}\x{02C6}\x{2030}\x{0160}\x{2039}\x{0152}\x{017D}\x{2018}\x{2019}\x{201C}\x{201D}\x{2022}\x{2013}\x{2014}\x{02DC}\x{2122}\x{0161}\x{203A}\x{0153}\x{017E}\x{0178}] [[:cntrl:]] = \p{cntrl} a CONTROL code character => 38 = [\x00-\x1F\x7F\x81\x8D\x8F\x90\x9D] = \p{Cc} [[:space:]] = \p{space} a WHITE-SPACE character => 7 = [\t\n\x0B\f\r\x20\xA0] [[:blank:]] = \p{blank} a BLANK character => 3 = [\t\x{0020}\x{00A0}] = \p{Zs}|\t [[:upper:]] = \p{upper} an UPPER case letter => 60 = [A-ZŠŒŽŸÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞ] = \p{Lu} [[:lower:]] = \p{lower} a LOWER case letter => 63 = [a-zƒšœžµßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ] = \p{Ll} [[:digit:]] = \p{digit} a DECIMAL number => 10 = [0-9] = \p{Nd} [[:word:]] = \p{word} a WORD character => 137 = \p{L*}|\p{Nd}|_ [[:graph:]] = \p{graph} any VISIBLE character => 215 = [^\x00-\x1F\x20\x7F\x81\x8D\x8F\x90\x9D\xA0\xAD] = (?![\x20\xA0\xAD])\P{Cc} [[:print:]] = \p{print} any PRINTABLE character => 222 = [[:graph:][:space:]] = [^\x00-\x1F\x20\x7F\x81\x8D\x8F\x90\x9D\xA0\xAD]|[[:space:]] [[:punct:]] = \p{punct} any PUNCTUATION or SYMBOL character => 73 = \p{P*}|\p{S*} = [\x21-\x2F\x3A-\x40\x5B-\x60\x7B-\x7E\x{20AC}\x{201A}\x{201E}\x{2026}\x{2020}\x{2021}\x{2030}\x{2039}\x{2018}\x{2019}\x{201C}\x{201D}\x{2022}\x{2013}\x{2014}\x{02DC}\x{2122}\x{203A}\xA1-\xA9\xAB\xAC\xAE-\xB1\xB4\xB6-\xB8\xBB\xBF\xD7\xF7] [[:alpha:]] = \p{alpha} any LETTER character => 126 = \p{L*} = \p{Lu}|\p{Ll}|[ˆªº] [[:alnum:]] = \p{alnum} an ALPHANUMERIC character => 136 = \p{L*}|\p{Nd} [[:xdigit:]] an HEXADECIMAL character => 22 = [0-9A-Fa-f] = (?i)[0-9A-F]Best regards,
guy038
-
-
C Coises referenced this topic on