Hi, @coises and All,
I think this will be the last answer concerning your Columns++_v1.2 plugin !
Here is the recapitulation of the way to access the invisible characters, whatever the file type :
For ANSI files : just one possible syntax for these collating names :
[[.NUL.][.SOH.][.STX.][.ETX.][.EOT.][.ENQ.][.ACK.][.alert.][.backspace.][.tab.][.newline.][.vertical-tab.][.form-feed.][.carriage-return.][.SO.][.SI.][.DLE.][.DC1.][.DC2.][.DC3.][.DC4.][.NAK.][.SYN.][.ETB.][.CAN.][.EM.][.SUB.][.ESC.][.IS4.][.IS3.][.IS2.][.IS1.][.DEL.]] which returns 33 matches, against the Total_ANSI.txt file, wihich contains the 256 characters of the Win-1252 encoding
Note that the lowercase syntax is NOT allowed, in ANSI files, for ANY collating names, presently in UPPER case
Note also that the four chars, from \x1c to \x1f must be referred as from IS4 to IS1, in UPPER case ( and NOT from fs to us ! )
For UTF-8 files : two possible syntaxes for these collating names :
[[.nul.][.soh.][.stx.][.etx.][.eot.][.enq.][.ack.][.bel.][.bs.][.ht.][.lf.][.vt.][.ff.][.cr.][.so.][.si.][.dle.][.dc1.][.dc2.][.dc3.][.dc4.][.nak.][.syn.][.etb.][.can.][.em.][.sub.][.esc.][.fs.][.gs.][.rs.][.us.][.del.][.pad.][.hop.][.bph.][.nbh.][.ind.][.nel.][.ssa.][.esa.][.hts.][.htj.][.lts.][.pld.][.plu.][.ri.][.ss2.][.ss3.][.dcs.][.pu1.][.pu2.][.sts.][.cch.][.mw.][.spa.][.epa.][.sos.][.sgci.][.sci.][.csi.][.st.][.osc.][.pm.][.apc.][.nbsp.][.shy.][.alm.][.sam.][.ospm.][.mvs.][.nqsp.][.mqsp.][.ensp.][.emsp.][.3/msp.][.4/msp.][.6/msp.][.fsp.][.psp.][.thsp.][.hsp.][.zwsp.][.zwnj.][.zwj.][.lrm.][.rlm.][.ls.][.ps.][.lre.][.rle.][.pdf.][.lro.][.rlo.][.nnbsp.][.mmsp.][.wj.][.(fa).][.(it).][.(is).][.(ip).][.lri.][.rli.][.fsi.][.pdi.][.iss.][.ass.][.iafs.][.aafs.][.nads.][.nods.][.idsp.][.zwnbsp.][.iaa.][.ias.][.iat.][.sflo.][.sfco.][.sfds.][.sfus.]] which return 120 matches, against the Total_Chars.txt file
[[.nul.][.soh.][.stx.][.etx.][.eot.][.enq.][.ack.][.alert.][.backspace.][.tab.][.newline.][.vertical-tab.][.form-feed.][.carriage-return.][.so.][.si.][.dle.][.dc1.][.dc2.][.dc3.][.dc4.][.nak.][.syn.][.etb.][.can.][.em.][.sub.][.esc.][.fs.][.gs.][.rs.][.us.][.del.][.pad.][.hop.][.bph.][.nbh.][.ind.][.nel.][.ssa.][.esa.][.hts.][.htj.][.lts.][.pld.][.plu.][.ri.][.ss2.][.ss3.][.dcs.][.pu1.][.pu2.][.sts.][.cch.][.mw.][.spa.][.epa.][.sos.][.sgci.][.sci.][.csi.][.st.][.osc.][.pm.][.apc.][.nbsp.][.shy.][.alm.][.sam.][.ospm.][.mvs.][.nqsp.][.mqsp.][.ensp.][.emsp.][.3/msp.][.4/msp.][.6/msp.][.fsp.][.psp.][.thsp.][.hsp.][.zwsp.][.zwnj.][.zwj.][.lrm.][.rlm.][.ls.][.ps.][.lre.][.rle.][.pdf.][.lro.][.rlo.][.nnbsp.][.mmsp.][.wj.][.(fa).][.(it).][.(is).][.(ip).][.lri.][.rli.][.fsi.][.pdi.][.iss.][.ass.][.iafs.][.aafs.][.nads.][.nods.][.idsp.][.zwnbsp.][.iaa.][.ias.][.iat.][.sflo.][.sfco.][.sfds.][.sfus.]] which returns 120 matches, against the Total_Chars.txt file
-Note that the Uppercase syntax is allowed, in UTF-8 files, for ANY collating name, presently in LOWER case
Finally, for an ANSI file, containing the 256 chars of the Win-1252 encoding and converted as an UTF-8 file ( Encoding > Convert to UTF-8 ), two syntaxes are possible :
[[.nul.][.soh.][.stx.][.etx.][.eot.][.enq.][.ack.][.bel.][.bs.][.ht.][.lf.][.vt.][.ff.][.cr.][.so.][.si.][.dle.][.dc1.][.dc2.][.dc3.][.dc4.][.nak.][.syn.][.etb.][.can.][.em.][.sub.][.esc.][.fs.][.gs.][.rs.][.us.][.del.][.pad.][.hop.][.ri.][.ss3.][.dcs.][.osc.][.nbsp.][.shy.]] which returns 40 matches, agasint the Total_UTF-8.txt file
[[.nul.][.soh.][.stx.][.etx.][.eot.][.enq.][.ack.][.alert.][.backspace.][.tab.][.newline.][.vertical-tab.][.form-feed.][.carriage-return.][.so.][.si.][.dle.][.dc1.][.dc2.][.dc3.][.dc4.][.nak.][.syn.][.etb.][.can.][.em.][.sub.][.esc.][.fs.][.gs.][.rs.][.us.][.del.][.hop.][.ri.][.ss3.][.dcs.][.osc.][.nbsp.][.shy.]] which returns 40 matches, against the Total_UTF-8.txt file
Note that the
Uppercase syntax is
allowed, in
UTF-8 files, for ANY
collating name, presently in LOWER case
Now, against the Total_ANSI.txt file, containing the first 256 UNICODE characters, we get these results :
(?s). ANY character => 256
(?-s). ANY character different from LIKE-BREAKS => 253 = [^\x0A\x0C\x0D]
[[:unicode:]] = \p{unicode} an OVER \x{00FF} character => 0 = [^\x00-\xFF}]
[[:cntrl:]] = \p{cntrl} a CONTROL code character => 39 = [\x00-\x1F\x7F\x81\x8D\x8F\x90\x9D\xAD]
[[:space:]] = \p{space} a WHITE-SPACE character => 7 = [\t\n\x0B\f\r\x20\xA0]
[[:blank:]] = \p{blank} a BLANK character => 3 = [\t\x20\xA0]
[[:upper:]] = \p{upper} an UPPER case letter => 60 = [A-ZŠŒŽŸÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞ]
[[:lower:]] = \p{lower} a LOWER case letter => 65 = [a-zƒšœžªµºßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ]
[[:digit:]] = \p{digit} a DECIMAL number => 13 = [0-9²³¹]
[[:word:]] = \p{word} a WORD character => 139 = [[:alnum:]]|\x5F = \p{alnum}|\x5F
[[:punct:]] = \p{punct} any PUNCTUATION or SYMBOL character => 80 = [\x21-\x2F\x3A-\x40\x5B-\x60\x7B-\x7E\x82\x84-\x87\x89\x8B\x91-\x97\x9B\xA1-\xBF\xD7\xF7]
[[:alpha:]] = \p{alpha} any LETTER character => 125 = (?-i)[[:upper:][:lower:]]
[[:alnum:]] = \p{alnum} an ALPHANUMERIC character => 138 = (?-i)[[:upper:][:lower:][:digit:]]
[[:graph:]] = \p{graph} any VISIBLE character => 212 = [^\x00-\x1F\x20\x7F\x80\x81\x88\x8D\x8F\x90\x98\x99\x9D\xA0]
[[:print:]] = \p{print} any PRINTABLE character => 219 = [[:graph:][:space:]] = [^\x00-\x1F\x20\x7F\x80\x81\x88\x8D\x8F\x90\x98\x99\x9D\xA0]|[[:space:]]
[[:xdigit:]] an HEXADECIMAL character => 22 = [0-9A-Fa-f] = (?i)[0-9A-F]
Remark : the [[:unicode:]] class; for characters OVER \x{00FF}, must correspond to the C1_DEFINED type from Ctype 1 list here.
From this same article, and after I realized that the POSIX classes are not totally independent, I deduced this layout :
C1_DEFINED Other characters 0
C1_CNTRL Control characters 39
C1_SPACE Space characters 2 ( only the SPACE and NBSP chars, OUT of 7, as ALL other are ALREADY included in the CNTRL chars class )
C1_UPPER Uppercase 60
C1_LOWER Lowercase 65
C1_DIGIT Decimal digits 13
C1_PUNCT Punctuation 73 ( and NOT 80, because the \xAD char is ALREADY included in the CNTRL chars class
because the \xAA, \xB5 and \xBA are ALREADY included in the LOWER chars class
because the \xB2, \xB3 and \xB9 are ALREADY included in the DIGIT chars class )
-----
TOTAL : 252 chars
So, if I exclude, from my Total_ANSI.txt file, all the following classes with the S/R :
FIND [[:cntrl:][:space:][:upper:][:lower:][:digit:][:punct:]]
REPLACE Leave EMPTY
Either, with your plugin or with native N++, it remains 4 characters ( 256 - 252 ) which are the € ( \x{20AC} ), ˆ ( \x{02C6} ), ˜ ( \x{02DC} ) and ™ ( \x{2122} ) characters
Moreover, absolutely no POSIX character class and no UNICODE character class, of course, can find these 4 characters !
Thus, the only way to find out one of these 4 characters, in an ANSI file, is to use the regex [\x80\x88\x98\x99] or to use the characters themselves :-((
In this article, it is also said :
Printable | Graphic characters and blanks (all C1_* types except C1_CNTRL). Thus …
So, from the previous total of chars of my Total_ANSI.txt file, the [[:print:]] class should detect 252 - 39, so 213 matches.
Thus, as [[graph:]] = [[:print:]] - [[space:]], this means that [[:graph:]] should be : 213 - 2, so 211 matches.
But current result is 212 matches. The difference of one unit comes from the \xAD char whith is, both, part of the [[:cntrl:]] and [[graph:]] POSIX character classes !
If we remember of the 4 lacking chars, which, obviously, are visible and printable, this means that [[:graph:]] and [[:print:] should return, respectively 215 ( 211 + 4 ) and 217 ( 213 + 4 ) matches, for ANSI files.
And it easy to verify that [[:print:]] + [[:cntrl:]] = 217 + 39 = 256 !
Just for info : from the Total_UTF-8.txt file, containing these same chars, we get these results :
(?s). ANY character => 256
(?-s). ANY character different from LIKE-BREAKS => 254 = [^\x0A\x0D]
[[:ascii:]] an UNDER \x{0080} character => 128 = [\x{0000}-\x{007F}] = \p{ascii}
[[:unicode:]] = \p{unicode} an OVER \x{00FF} character => 27 = [^\x00-\xFF}] = [\x{20AC}\x{201A}\x{0192}\x{201E}\x{2026}\x{2020}\x{2021}\x{02C6}\x{2030}\x{0160}\x{2039}\x{0152}\x{017D}\x{2018}\x{2019}\x{201C}\x{201D}\x{2022}\x{2013}\x{2014}\x{02DC}\x{2122}\x{0161}\x{203A}\x{0153}\x{017E}\x{0178}]
[[:cntrl:]] = \p{cntrl} a CONTROL code character => 38 = [\x00-\x1F\x7F\x81\x8D\x8F\x90\x9D] = \p{Cc}
[[:space:]] = \p{space} a WHITE-SPACE character => 7 = [\t\n\x0B\f\r\x20\xA0]
[[:blank:]] = \p{blank} a BLANK character => 3 = [\t\x{0020}\x{00A0}] = \p{Zs}|\t
[[:upper:]] = \p{upper} an UPPER case letter => 60 = [A-ZŠŒŽŸÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞ] = \p{Lu}
[[:lower:]] = \p{lower} a LOWER case letter => 63 = [a-zƒšœžµßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ] = \p{Ll}
[[:digit:]] = \p{digit} a DECIMAL number => 10 = [0-9] = \p{Nd}
[[:word:]] = \p{word} a WORD character => 137 = \p{L*}|\p{Nd}|_
[[:graph:]] = \p{graph} any VISIBLE character => 215 = [^\x00-\x1F\x20\x7F\x81\x8D\x8F\x90\x9D\xA0\xAD] = (?![\x20\xA0\xAD])\P{Cc}
[[:print:]] = \p{print} any PRINTABLE character => 222 = [[:graph:][:space:]] = [^\x00-\x1F\x20\x7F\x81\x8D\x8F\x90\x9D\xA0\xAD]|[[:space:]]
[[:punct:]] = \p{punct} any PUNCTUATION or SYMBOL character => 73 = \p{P*}|\p{S*} = [\x21-\x2F\x3A-\x40\x5B-\x60\x7B-\x7E\x{20AC}\x{201A}\x{201E}\x{2026}\x{2020}\x{2021}\x{2030}\x{2039}\x{2018}\x{2019}\x{201C}\x{201D}\x{2022}\x{2013}\x{2014}\x{02DC}\x{2122}\x{203A}\xA1-\xA9\xAB\xAC\xAE-\xB1\xB4\xB6-\xB8\xBB\xBF\xD7\xF7]
[[:alpha:]] = \p{alpha} any LETTER character => 126 = \p{L*} = \p{Lu}|\p{Ll}|[ˆªº]
[[:alnum:]] = \p{alnum} an ALPHANUMERIC character => 136 = \p{L*}|\p{Nd}
[[:xdigit:]] an HEXADECIMAL character => 22 = [0-9A-Fa-f] = (?i)[0-9A-F]
Best regards,
guy038