Hi All,
Continuation …
From this last table, we can reasonably ignore :
The space and soft hyphen characters
Some musical characters
All the characters, specific to a language, modern or archaic
The tag characters, whose usage is strongly discouraged by the Unicode consortium.
In other words, all characters with the No indication, in the To search column, of the previous table !
As a result, we should only take care of this restricted list of 50 characters :
•-------•---------•---------------------------------------------•----•------------------•-----•
| Code | Abbrev. | Character Name | Cg | N++ Regex | Chr |
•-------•---------•---------------------------------------------•----•------------------•-----•
| 00A0 | NBSP | NO-BREAK SPACE | Zs | \x{00A0} | |
| | | | | | |
| 2000 | NQSP | EN QUAD | Zs | \x{2000} | |
| 2001 | MQSP | EM QUAD | Zs | \x{2001} | |
| 2002 | ENSP | EN SPACE | Zs | \x{2002} | |
| 2003 | EMSP | EM SPACE | Zs | \x{2003} | |
| 2004 | 3/MSP | THREE-PER-EM SPACE | Zs | \x{2004} | |
| 2005 | 4/MSP | FOUR-PER-EM SPACE | Zs | \x{2005} | |
| 2006 | 6/MSP | SIX-PER-EM SPACE | Zs | \x{2006} | |
| 2007 | FSP | FIGURE SPACE | Zs | \x{2007} | |
| 2008 | PSP | PUNCTUATION SPACE | Zs | \x{2008} | |
| 2009 | THSP | THIN SPACE | Zs | \x{2009} | |
| 200A | HSP | HAIR SPACE | Zs | \x{200A} | |
| | | | | | |
| 200B | ZWSP | ZERO WIDTH SPACE | Cf | \x{200B} | |
| 200C | ZWNJ | ZERO WIDTH NON-JOINER | Cf | \x{200C} | |
| 200D | ZWJ | ZERO WIDTH JOINER | Cf | \x{200D} | |
| 200E | LRM | LEFT-TO-RIGHT MARK | Cf | \x{200E} | |
| 200F | RTM | RIGHT-TO-LEFT MARK | Cf | \x{200F} | |
| 202A | LRE | LEFT-TO-RIGHT EMBEDDING | Cf | \x{202A} | |
| 202B | RLE | RIGHT-TO-LEFT EMBEDDING | Cf | \x{202B} | |
| 202C | PDF | POP DIRECTIONAL FORMATTING | Cf | \x{202C} | |
| 202D | LRO | LEFT-TO-RIGHT OVERRIDE | Cf | \x{202D} | |
| 202E | RLO | RIGHT-TO-LEFT OVERRIDE | Cf | \x{202E} | |
| | | | | | |
| 202F | NNBSP | NARROW NO-BREAK SPACE | Zs | \x{202F} | |
| | | | | | |
| 205F | MMSP | MEDIUM MATHEMATICAL SPACE | Zs | \x{205F} | |
| | | | | | |
| 2060 | WJ | WORD JOINER | Cf | \x{2060} | |
| | | | | | |
| 2061 | (FA) | FUNCTION APPLICATION | Cf | \x{2061} | |
| 2062 | (IT) | INVISIBLE TIMES | Cf | \x{2062} | |
| 2063 | (IS) | INVISIBLE SEPARATOR | Cf | \x{2063} | |
| 2064 | (IP) | INVISIBLE PLUS | Cf | \x{2064} | |
| | | | | | |
| 2066 | LRI | LEFT-TO-RIGHT ISOLATE | Cf | \x{2066} | |
| 2067 | RLI | RIGHT-TO-LEFT ISOLATE | Cf | \x{2067} | |
| 2068 | FSI | FIRST STRONG ISOLATE | Cf | \x{2068} | |
| 2069 | PDI | POP DIRECTIONAL ISOLATE | Cf | \x{2069} | |
| 206A | ISS | INHIBIT SYMMETRIC SWAPPING | Cf | \x{206A} | |
| 206B | ASS | ACTIVATE SYMMETRIC SWAPPING | Cf | \x{206B} | |
| 206C | IAFS | INHIBIT ARABIC FORM SHAPING | Cf | \x{206C} | |
| 206D | AAFS | ACTIVATE ARABIC FORM SHAPING | Cf | \x{206D} | |
| 206E | NADS | NATIONAL DIGIT SHAPES | Cf | \x{206E} | |
| 206F | NOSP | NOMINAL DIGIT SHAPES | Cf | \x{206F} | |
| | | | | | |
| 3000 | IDSP | IDEOGRAPHIC SPACE | Zs | \x{3000} | |
| | | | | | |
| FEFF | ZWNBSP | ZERO WIDTH NO-BREAK SPACE / BYTE ORDER MARK | Cf | \x{FEFF} | |
| | | | | | |
| FFF9 | IAA | INTERLINEAR ANNOTATION ANCHOR | Cf | \x{FFF9} | |
| FFFA | IAS | INTERLINEAR ANNOTATION SEPARATOR | Cf | \x{FFFA} | |
| FFFB | IAT | INTERLINEAR ANNOTATION TERMINATOR | Cf | \x{FFFB} | |
| | | | | | |
| FFFC | OBJ | OBJECT REPLACEMENT CHARACTER | So | \x{FFFC} |  |
| FFFD | ? | REPLACEMENT CHARACTER | So | \x{FFFD} | � |
| | | | | | |
| 1BCA0 | (SFLO) | SHORTHAND FORMAT LETTER OVERLAP | Cf | \x{D82F}\x{DCA0} | |
| 1BCA1 | (SFCO) | SHORTHAND FORMAT CONTINUING OVERLAP | Cf | \x{D82F}\x{DCA1} | |
| 1BCA2 | (SFDS) | SHORTHAND FORMAT DOWN STEP | Cf | \x{D82F}\x{DCA2} | |
| 1BCA3 | (SFUS) | SHORTHAND FORMAT UP STEP | Cf | \x{D82F}\x{DCA3} | |
•-------•---------•---------------------------------------------•----•------------------•-----•
Remark that I added, to that list, the two characters Object Replacement Character \x{FFFC} and Replacement Character \x{FFFD} often used in case of encoding problems !
Then the updated Mark regex would be :
MARK [\x{00A0}\x{2000}-\x{200A}\x{200B}-\x{200F}\x{202A}-\x{202E}\x{202F}\x{205F}-\x{206F}\x{3000}\x{FEFF}\x{FFF9}-\x{FFFD}\x{D82F}\x{DCA0}\x{D82F}\x{DCA1}\x{D82F}\x{DCA2}\x{D82F}\x{DCA3}]
And the updated Python script is :
# -*- coding: utf-8 -*-
from Npp import editor, notepad, NOTIFICATION
class SRFSC(object):
def __init__(self):
notepad.callback(self.callback_npp_BUFFERACTIVATED, [NOTIFICATION.BUFFERACTIVATED])
self.callback_npp_BUFFERACTIVATED(None)
def callback_npp_BUFFERACTIVATED(self, args):
# SPACE chars ( Zs )
editor.setRepresentation(u'\u00A0', "NBSP") # no-break space
editor.setRepresentation(u'\u2000', "NQSP") # EN quad
editor.setRepresentation(u'\u2001', "MQSP") # EM quad
editor.setRepresentation(u'\u2002', "ENSP") # EN space
editor.setRepresentation(u'\u2003', "EMSP") # EN space
editor.setRepresentation(u'\u2004', "3/MSP") # three-per-EM space
editor.setRepresentation(u'\u2005', "4/MSP") # four-per-EM space
editor.setRepresentation(u'\u2006', "6/MSP") # six-per-EM space
editor.setRepresentation(u'\u2007', "FSP") # figure space
editor.setRepresentation(u'\u2008', "PSP") # punctuation space
editor.setRepresentation(u'\u2009', "THSP") # thin space
editor.setRepresentation(u'\u200A', "HSP") # hair space
# FORMAT chars ( Cf )
editor.setRepresentation(u'\u200B', "ZWSP") # zero width space
editor.setRepresentation(u'\u200C', "ZWNJ") # zero width non-joiner
editor.setRepresentation(u'\u200D', "ZWJ") # zero width joiner
editor.setRepresentation(u'\u200E', "LRM") # left-to-right mark
editor.setRepresentation(u'\u200F', "RLM") # right-to-left mark
editor.setRepresentation(u'\u202A', "LRE") # left-to-right embedding
editor.setRepresentation(u'\u202B', "RLE") # right-to-left embedding
editor.setRepresentation(u'\u202C', "PDF") # pop directional formatting
editor.setRepresentation(u'\u202D', "LRO") # left-to-right override
editor.setRepresentation(u'\u202E', "RLO") # right-to-left override
# SPACE chars ( Zs )
editor.setRepresentation(u'\u202F', "NNBSP") # narrow no-break space
editor.setRepresentation(u'\u205F', "NNBSP") # medium mathematical space
# FORMAT chars ( Cf )
editor.setRepresentation(u'\u2060', "WJ") # word joiner ( zero width no-break space )
editor.setRepresentation(u'\u2061', "FA") # function application
editor.setRepresentation(u'\u2062', "IT") # invisible times
editor.setRepresentation(u'\u2063', "IS") # invisible separator
editor.setRepresentation(u'\u2064', "IP") # invisible plus
editor.setRepresentation(u'\u2066', "LRI") # left-to-right isolate
editor.setRepresentation(u'\u2067', "RLI") # right-to-left isolate
editor.setRepresentation(u'\u2068', "FSI") # first strong isolate
editor.setRepresentation(u'\u2069', "PDI") # pop directional isolate
# FORMAT chars ( Cf ) DEPRECATED
editor.setRepresentation(u'\u206A', "ISS") # inhibit symmetric swapping
editor.setRepresentation(u'\u206B', "ASS") # activate symmetric swapping
editor.setRepresentation(u'\u206C', "IAFS") # inhibit arabic form shaping
editor.setRepresentation(u'\u206D', "AAFS") # activate arabic form shaping
editor.setRepresentation(u'\u206E', "NADS") # national digit shapes
editor.setRepresentation(u'\u206F', "NODS") # nominal digit shapes
# SPACE chars ( Zs )
editor.setRepresentation(u'\u3000', "IDSP") # ideographic space
# FORMAT chars ( Cf ) SPECIALS
editor.setRepresentation(u'\uFEFF', "ZWNBSP") # zero width no-break space : deprecated ( see U+2060 ) / byte order mark
editor.setRepresentation(u'\uFFF9', "IAA") # interlinear annotation anchor
editor.setRepresentation(u'\uFFFA', "IAS") # interlinear annotation separator
editor.setRepresentation(u'\uFFFB', "IAT") # interlinear annotation terminator
# OTHER symbols ( So )
editor.setRepresentation(u'\uFFFC', "OBJ") # object replacement character
editor.setRepresentation(u'\uFFFD', "<?>") # replacement character
# FORMAT chars ( Cf )
# For characters OVER the BMP, with code > FFFF, we can use, EITHER, the syntaxes :
# - editor.setRepresentation(u'\U0001BCA0', "SFLO") TRUE "32-bits" representation
# - editor.setRepresentation(u'\uD82F\uDCA0', "SFLO") The 16-bits "SURROGATES PAIR"
editor.setRepresentation(u'\uD82F\uDCA0', "SFLO") # shorthand format letter overlap
editor.setRepresentation(u'\uD82F\uDCA1', "SFCO") # shorthand format continuing overlap
editor.setRepresentation(u'\uD82F\uDCA2', "SFDS") # shorthand format down step
editor.setRepresentation(u'\uD82F\uDCA3', "SFUS") # shorthand format up step
# Active the character representation
notepad.menuCommand(MENUCOMMAND.VIEW_ALL_CHARACTERS)
notepad.menuCommand(MENUCOMMAND.VIEW_ALL_CHARACTERS)
SRFSC()
I did not investigate in the S/R, because of the number of chars to handle ( 50 ) and because I’m just feeling… lazy for such a task !
However with the Mark operation, which helps you to locate exactly where are these special characters and the Python script which clearly identify them, you should be safe with your file’s contents ;-))
Best Regards
guy038