Clean up text of non-printing characters
-
Hello everyone,
I keep stumbling around non-printing characters such as zero-width space, soft hyphen…
They are extremely annoying, especially for (permitted/legal) copy/paste actions from web content.Is there a way to convert these characters into other visible ones using Notepad++?
Is there an add-on (or a 3rd party tool) with which I can set the appropriate rules and thus “clean” text?Thank you in advance for any tips!
Regards, Martin -
I don’t know of any “add on” that does exactly what you’re asking for.
But really, you’re just asking for a kind of “replace from a list of find/replace pairs”, and you could find many such scripts and macros that do such things if you search this site. A recently active topic thread that had one was https://community.notepad-plus-plus.org/topic/23638. But, there’s many others.
If you feel like doing a little hacking of
shortcuts.xml
, you could build your own macro via text editing to do a series of replacements; example:<Macro name="Make multiple replacements" Ctrl="no" Alt="no" Shift="no" Key="0"> <Action type="3" message="1700" wParam="0" lParam="0" sParam="" /> <Action type="3" message="1601" wParam="0" lParam="0" sParam="find1" /> <Action type="3" message="1625" wParam="0" lParam="0" sParam="" /> <Action type="3" message="1602" wParam="0" lParam="0" sParam="replace1" /> <Action type="3" message="1702" wParam="0" lParam="768" sParam="" /> <Action type="3" message="1701" wParam="0" lParam="1609" sParam="" /> <Action type="3" message="1700" wParam="0" lParam="0" sParam="" /> <Action type="3" message="1601" wParam="0" lParam="0" sParam="find2" /> <Action type="3" message="1625" wParam="0" lParam="0" sParam="" /> <Action type="3" message="1602" wParam="0" lParam="0" sParam="replace2" /> <Action type="3" message="1702" wParam="0" lParam="768" sParam="" /> <Action type="3" message="1701" wParam="0" lParam="1609" sParam="" /> </Macro>
Here, each replace operation is contained between the lines containing 1700 and 1701. So if you copy and paste that group of lines to below the last “Action” line, you’d define a third replacement. You’d obviously change “find1”, “replace1”, etc. to be a substitution pair that you’d want to make. You could build up a big set of replacements this way.
-
Just to add to @Alan-Kilborn 's suggestion: depending on the nature of the characters that you need to find/replace, you may need to change the
1625
messages to uselParam="2"
so that regular expressions are used. Other advanced details of this Action code can be found here. -
@mathlete2 said in Clean up text of non-printing characters:
you may need to change the 1625 messages to use lParam=“2”
I was not going to complicate a technique that’s already a bit esoteric with something like THAT, especially since the need for it wasn’t in what the OP expressed. For the most part, I try to stick to proposing solutions that solve the stated problem, not some “somebody might need this” stretch.
-
@Alan-Kilborn said in Clean up text of non-printing characters:
the need for it wasn’t in what the OP expressed
It’s true that the OP didn’t explicitly state that regex support was needed, but it’s also true that it didn’t explicitly state that regex wasn’t needed. Since you have already directed the user to the Macro code in
shortcuts.xml
, it seemed worthwhile to mention a simple tweak that is commonly used in these sorts of situations. -
Arguably “match case” or “whole word” is even more important than the search mode.
And again, I didn’t mention those either because the need just wasn’t there… -
@Alan-Kilborn said in Clean up text of non-printing characters:
Arguably “match case” or “whole word” is even more important than the search mode.
Agreed, which is one of the reasons why I added a link to the Action code page; it’s the one that documents these sorts of things.
I specifically mentioned the regex configuration because it’s a very useful one that users might not think to look for. Even if they do, they may find it difficult to find the mentions of it when they
scrollsearch through the page manually; I certainly do, so I thought users would appreciate the explicit instructions for implementing them. -
@m-fessler said in Clean up text of non-printing characters:
I keep stumbling around non-printing characters such as zero-width space, soft hyphen…
They are extremely annoying, especially for (permitted/legal) copy/paste actions from web content.Is there a way to convert these characters into other visible ones using Notepad++?
If you just want to see them, so you can clean them up manually, select View | Show Symbol | Show Non-Printing Characters (or Show All Characters).
Otherwise, if you want to replace them with something else, try this: Select Search | Replace; then, in the dialog, enter:
Find what:
[^[:graph:] \r\n\t]|\xad
(don’t miss that there is a space between:]
and\r
)
Replace with: (empty, or whatever you want)
Wrap around: checked
Search Mode: Regular expressionand click Replace All.
-
Hello,@ @m-fessler, @mathlete2, @alan-kilborn, @coises and All,
@m-fessler, here is, below, a list of all the special Unicode characters which belong, either, to :
-
The
Z separator
category (Zs
,Zl
andZp
categories ) -
The
Cc Control character
category ( except for theTAB
,LF
andCR
ones ) -
The
Cf Format character
category -
Two
So Other Symbol
characters (\x{FFFC}
and\x{FFFD}
)
This list contains
121
characters•---------•--------------------•--------------------------------------------•----------•------•--------• | Code | Regex | Character | Abbre. | GC | Chr. | •---------•--------------------•--------------------------------------------•----------•------•--------• | 0000 | \x{0000} | NULL | NUL | Cc | | 0001 | \x{0001} | START OF HEADING | SOH | Cc | | 0002 | \x{0002} | START OF TEXT | STX | Cc | | 0003 | \x{0003} | END OF TEXT | ETX | Cc | | 0004 | \x{0004} | END OF TRANSMISSION | EOT | Cc | | 0005 | \x{0005} | ENQUIRY | ENQ | Cc | | 0006 | \x{0006} | ACKNOWLEDGE | ACK | Cc | | 0007 | \x{0007} | BELL | BEL | Cc | | 0008 | \x{0008} | BACKSPACE | BS | Cc | | 000B | \x{000B} | VERTICAL TABULATION | VT | Cc | | 000C | \x{000C} | FORM FEED | FF | Cc | | 000E | \x{000E} | SHIFT OUT | SO | Cc | | 000F | \x{000F} | SHIFT IN | SI | Cc | | 0010 | \x{0010} | DATA LINK ESCAPE | DLE | Cc | | 0011 | \x{0011} | DEVICE CONTROL ONE | DC1 | Cc | | 0012 | \x{0012} | DEVICE CONTROL TWO | DC2 | Cc | | 0013 | \x{0013} | DEVICE CONTROL THREE | DC3 | Cc | | 0014 | \x{0014} | DEVICE CONTROL FOUR | DC4 | Cc | | 0015 | \x{0015} | NEGATIVE ACKNOWLEDGE | NAK | Cc | | 0016 | \x{0016} | SYNCHRONOUS IDLE | SYN | Cc | | 0017 | \x{0017} | END OF TRANSMISSION BLOCK | ETB | Cc | | 0018 | \x{0018} | CANCEL | CAN | Cc | | 0019 | \x{0019} | END OF MEDIUM | EM | Cc | | 001A | \x{001A} | SUBSTITUTE | SUB | Cc | | 001B | \x{001B} | ESCAPE | ESC | Cc | | 001C | \x{001C} | FILE SEPARATOR | FS | Cc | | 001D | \x{001D} | GROUP SEPARATOR | GS | Cc | | 001E | \x{001E} | RECORD SEPARATOR | RS | Cc | | 001F | \x{001F} | UNIT SEPARATOR | US | Cc | •---------•-------------------•--------------------------------------------•----------•------•--------• | 007F | \x{007F} | DELETE | DEL | Cc | •---------•--------------------•--------------------------------------------•----------•------•-------• | 0080 | \x{0080} | PADDING CHARACTER | PAD | Cc | | 0081 | \x{0081} | HIGH OCTET PRESET | HOP | Cc | | 0082 | \x{0082} | BREAK PERMITTED HERE | BPH | Cc | | 0083 | \x{0083} | NO BREAK HERE | NBH | Cc | | 0084 | \x{0084} | INDEX | IND | Cc | | 0085 | \x{0085} | NEXT LINE | NEL | Cc | | 0086 | \x{0086} | START OF SELECTED AREA | SSA | Cc | | 0087 | \x{0087} | END OF SELECTED AREA | ESA | Cc | | 0088 | \x{0088} | HORIZONTAL TABULATION SET | HTS | Cc | | 0089 | \x{0089} | HORIZONTAL TABULATION WITH JUSTIFICATION | HTJ | Cc | | 008A | \x{008A} | VERTICAL TABULATION SET | VTS | Cc | | 008B | \x{008B} | PARTIAL LINE DOWN | PLD | Cc | | 008C | \x{008C} | PARTIAL LINE UP | PLU | Cc | | 008D | \x{008D} | REVERSE INDEX | RI | Cc | | 008E | \x{008E} | SINGLE-SHIFT 2 | SS2 | Cc | | 008F | \x{008F} | SINGLE-SHIFT 3 | SS3 | Cc | | 0090 | \x{0090} | DEVICE CONTROL STRING | DCS | Cc | | 0091 | \x{0091} | PRIVATE USE 1 | PU1 | Cc | | 0092 | \x{0092} | PRIVATE USE 2 | PU2 | Cc | | 0093 | \x{0093} | SET TRANSMIT STATE | STS | Cc | | 0094 | \x{0094} | CANCEL CHARACTER | CCH | Cc | | 0095 | \x{0095} | MESSAGE WAITING | MW | Cc | | 0096 | \x{0096} | START OF PROTECTED AREA | SPA | Cc | | 0097 | \x{0097} | END OF PROTECTED AREA | EPA | Cc | | 0098 | \x{0098} | START OF STRING | SOS | Cc | | 0099 | \x{0099} | SINGLE GRAPHIC CHARACTER INTRODUCER | SGCI | Cc | | 009A | \x{009A} | SINGLE CHARACTER INTRODUCER | SCI | Cc | | 009B | \x{009B} | CONTROL SEQUENCE INTRODUCER | CSI | Cc | | 009C | \x{009C} | STRING TERMINATOR | ST | Cc | | 009D | \x{009D} | OPERATING SYSTEM COMMAND | OSC | Cc | | 009E | \x{009E} | PRIVACY MESSAGE | PM | Cc | | 009F | \x{009F} | APPLICATION PROGRAM COMMAND | APC | Cc | •---------•--------------------•--------------------------------------------•----------•------•--------• | 00A0 | \x{00A0} | NO-BREAK SPACE | NBSP | Zs | •---------•--------------------•--------------------------------------------•----------•------•--------• | 00AD | \x{00AD} | SOFT HYPHEN | SHY | Cf | •---------•--------------------•--------------------------------------------•----------•------•--------• | 061C | \x{061C} | ARABIC LETTER MARK | ALM | Cf | •---------•--------------------•--------------------------------------------•----------•------•--------• | 070F | \x{070F} | SYRIAC ABBREVIATION MARK | SAM | Cf | •---------•--------------------•--------------------------------------------•----------•------•--------• | 0890 | \x{0890} | ARABIC POUND MARK ABOVE | | Cf | | 0891 | \x{0891} | ARABIC PIASTRE MARK ABOVE | | Cf | •---------•--------------------•--------------------------------------------•----------•------•--------• | 1680 | \x{1680} | OGHAM SPACE MARK | OSPM | Zs | •---------•--------------------•--------------------------------------------•----------•------•--------• | 180E | \x{180E} | MONGOLIAN VOWEL SEPARATOR | MVS | Cf | •---------•--------------------•--------------------------------------------•----------•------•--------• | 2000 | \x{2000} | EN QUAD | NQSP | Zs | | 2001 | \x{2001} | EM QUAD | MQSP | Zs | | 2002 | \x{2002} | EN SPACE | ENSP | Zs | | 2003 | \x{2003} | EM SPACE | EMSP | Zs | | 2004 | \x{2004} | THREE-PER-EM SPACE | 3/MSP | Zs | | 2005 | \x{2005} | FOUR-PER-EM SPACE | 4/MSP | Zs | | 2006 | \x{2006} | SIX-PER-EM SPACE | 6/MSP | Zs | | 2007 | \x{2007} | FIGURE SPACE | FSP | Zs | | 2008 | \x{2008} | PUNCTUATION SPACE | PSP | Zs | | 2009 | \x{2009} | THIN SPACE | THSP | Zs | | 200A | \x{200A} | HAIR SPACE | HSP | Zs | •---------•--------------------•--------------------------------------------•----------•------•--------• | 200B | \x{200B} | ZERO WIDTH SPACE | ZWSP | Cf | | 200C | \x{200C} | ZERO WIDTH NON-JOINER | ZWNJ | Cf | | 200D | \x{200D} | ZERO WIDTH JOINER | ZWJ | Cf | | 200E | \x{200E} | LEFT-TO-RIGHT MARK | LRM | Cf | | 200F | \x{200F} | RIGHT-TO-LEFT MARK | RLM | Cf | •---------•--------------------•--------------------------------------------•----------•------•--------• | 2028 | \x{2028} | LINE SEPARATOR | LS | Zl | | 2029 | \x{2029} | PARAGRAPH SEPARATOR | PS | Zp | •---------•--------------------•--------------------------------------------•----------•------•--------• | 202A | \x{202A} | LEFT-TO-RIGHT EMBEDDING | LRE | Cf | | 202B | \x{202B} | RIGHT-TO-LEFT EMBEDDING | RLE | Cf | | 202C | \x{202C} | POP DIRECTIONAL FORMATTING | PDF | Cf | | 202D | \x{202D} | LEFT-TO-RIGHT OVERRIDE | LRO | Cf | | 202E | \x{202E} | RIGHT-TO-LEFT OVERRIDE | RLO | Cf | | •---------•--------------------•--------------------------------------------•----------•------•--------• | 202F | \x{202F} | NARROW NO-BREAK SPACE | NNBSP | Zs | | 205F | \x{205F} | MEDIUM MATHEMATICAL SPACE | MMSP | Zs | •---------•--------------------•--------------------------------------------•----------•------•--------• | 2060 | \x{2060} | WORD JOINER | WJ | Cf | •---------•--------------------•--------------------------------------------•----------•------•--------• | 2061 | \x{2061} | FUNCTION APPLICATION | (FA) | Cf | | 2062 | \x{2062} | INVISIBLE TIMES | (IT) | Cf | | 2063 | \x{2063} | INVISIBLE SEPARATOR | (IS) | Cf | | 2064 | \x{2064} | INVISIBLE PLUS | (IP) | Cf | •---------•--------------------•--------------------------------------------•----------•------•--------• | 2066 | \x{2066} | LEFT-TO-RIGHT ISOLATE | LRI | Cf | | 2067 | \x{2067} | RIGHT-TO-LEFT ISOLATE | RLI | Cf | | 2068 | \x{2068} | FIRST STRONG ISOLATE | FSI | Cf | | 2069 | \x{2069} | POP DIRECTIONAL ISOLATE | PDI | Cf | | 206A | \x{206A} | INHIBIT SYMMETRIC SWAPPING | ISS | Cf | | 206B | \x{206B} | ACTIVATE SYMMETRIC SWAPPING | ASS | Cf | | 206C | \x{206C} | INHIBIT ARABIC FORM SHAPING | IAFS | Cf | | 206D | \x{206D} | ACTIVATE ARABIC FORM SHAPING | AAFS | Cf | | 206E | \x{206E} | NATIONAL DIGIT SHAPES | NADS | Cf | | 206F | \x{206F} | NOMINAL DIGIT SHAPES | NODS | Cf | •---------•--------------------•--------------------------------------------•----------•------•--------• | 3000 | \x{3000} | IDEOGRAPHIC SPACE | IDSP | Zs | •---------•--------------------•--------------------------------------------•----------•------•--------• | FEFF | \x{FEFF} | ZERO WIDTH NO-BREAK SPACE | ZWNBSP | Cf | •---------•--------------------•--------------------------------------------•----------•------•--------• | FFF9 | \x{FFF9} | INTERLINEAR ANNOTATION ANCHOR | IAA | Cf | | FFFA | \x{FFFA} | INTERLINEAR ANNOTATION SEPARATOR | IAS | Cf | | FFFB | \x{FFFB} | INTERLINEAR ANNOTATION TERMINATOR | IAT | Cf | •---------•--------------------•--------------------------------------------•----------•------•--------• | FFFC | \x{FFFC} | OBJECT REPLACEMENT CHARACTER | OBJ | So |  | FFFD | \x{FFFD} | REPLACEMENT CHARACTER | ? | So | � •---------•--------------------•--------------------------------------------•----------•------•--------• | 1BCA0 | \x{D82F}\x{DCA0} | SHORTHAND FORMAT LETTER OVERLAP | SFLO | Cf | | 1BCA1 | \x{D82F}\x{DCA1} | SHORTHAND FORMAT CONTINUING OVERLAP | SFCO | Cf | | 1BCA2 | \x{D82F}\x{DCA2} | SHORTHAND FORMAT DOWN STEP | SFDS | Cf | | 1BCA3 | \x{D82F}\x{DCA3} | SHORTHAND FORMAT UP STEP | SFUS | Cf | •---------•--------------------•--------------------------------------------•----------•------•--------•
From this list, @m-fessler, which characters do you want to Search / Mark / Replace ?
Moreover, do you want to ignore all characters above the BMP ( so, over
\x{FFFF}
) or do you consider these characters as normal chars ?Once, you’ll know which characters you want to consider, it will be easy to get the appropriate REGEX search !
Best Regards,
guy038
-