How to search for unknown 3-digit characters with black background
-
I got a utf-8 coded text file with about 500.000 lines. Surprisingly I found several different unknown characters, 3 digits in length, all of them white letters on black background, e.g. NEL, SSA, SPA, xC3, IND, STS and some others.
How can I find those characters to replace them? Umlauts and other language specific characters should remain.
I already tried the TextFX plugin, but this also changed all the umlauts and other European special characters to ##. -
Hello, @diedrich-hesmer, and All
Explanations on the
C0
andC1
Control codes may be seen, on Wikipedia :https://en.wikipedia.org/wiki/C0_and_C1_control_codes
And also, on the Unicode consortium :
http://www.unicode.org/charts/PDF/U0000.pdf
http://www.unicode.org/charts/PDF/U0080.pdf
Below, here is a summary of these characters :
•--------•--------•-------------------------------------------•------• | Code | Glyph | Character Name | Cat. | •--------•--------•-------------------------------------------•------• | 0000 | NUL | NULL \0 | Cc | | 0001 | SOH | START OF HEADING | Cc | | 0002 | STX | START OF TEXT | Cc | | 0003 | ETX | END OF TEXT | Cc | | 0004 | EOT | END OF TRANSMISSION | Cc | | 0005 | ENQ | ENQUIRY | Cc | | 0006 | ACK | ACKNOWLEDGE | Cc | | 0007 | BEL | BELL | Cc | | 0008 | BS | BACKSPACE [\b] | Cc | | 0009 | HT | CHARACTER TABULATION \t | Cc | | 000A | LF | LINE FEED \n | Cc | | 000B | VT | LINE TABULATION [\v] | Cc | | 000C | FF | FORM FEED \f | Cc | | 000D | CR | CARRIAGE RETURN \r | Cc | | 000E | SO | SHIFT OUT | Cc | | 000F | SI | SHIFT IN | Cc | | 0010 | DLE | DATA LINK ESCAPE | Cc | | 0011 | DC1 | DEVICE CONTROL ONE | Cc | | 0012 | DC2 | DEVICE CONTROL TWO | Cc | | 0013 | DC3 | DEVICE CONTROL THREE | Cc | | 0014 | DC4 | DEVICE CONTROL FOUR | Cc | | 0015 | NAK | NEGATIVE ACKNOWLEDGE | Cc | | 0016 | SYN | SYNCHRONOUS IDLE | Cc | | 0017 | ETB | END OF TRANSMISSION BLOCK | Cc | | 0018 | CAN | CANCEL | Cc | | 0019 | EM | END OF MEDIUM | Cc | | 001A | SUB | SUBSTITUTE | Cc | | 001B | ESC | ESCAPE | Cc | | 001C | FS | INFORMATION SEPARATOR FOUR | Cc | | 001D | GS | INFORMATION SEPARATOR THREE | Cc | | 001E | RS | INFORMATION SEPARATOR TWO | Cc | | 001F | US | INFORMATION SEPARATOR ONE | Cc | •--------•--------•-------------------------------------------•------• | 0080 | PAD | PADDING CHARACTER | Cc | | 0081 | HOP | HIGH OCTET PRESET | Cc | | 0082 | BPH | BREAK PERMITTED HERE | Cc | | 0083 | NBH | NO BREAK HERE | Cc | | 0084 | IND | INDEX | Cc | | 0085 | NEL | NEXT LINE (NEL) | Cc | | 0086 | SSA | START OF SELECTED AREA | Cc | | 0087 | ESA | END OF SELECTED AREA | Cc | | 0088 | HTS | CHARACTER TABULATION SET | Cc | | 0089 | HTJ | CHARACTER TABULATION WITH JUSTIFICATION | Cc | | 008A | VTS | LINE TABULATION SET | Cc | | 008B | PLD | PARTIAL LINE FORWARD / DOWN | Cc | | 008C | PLU | PARTIAL LINE BACKWARD / UP | Cc | | 008D | RI | REVERSE LINE FEED / REVERSE INDEX | Cc | | 008E | SS2 | SINGLE SHIFT TWO | Cc | | 008F | SS3 | SINGLE SHIFT THREE | Cc | | 0090 | DCS | DEVICE CONTROL STRING | Cc | | 0091 | PU1 | PRIVATE USE ONE | Cc | | 0092 | PU2 | PRIVATE USE TWO | Cc | | 0093 | STS | SET TRANSMIT STATE | Cc | | 0094 | CCH | CANCEL CHARACTER | Cc | | 0095 | MW | MESSAGE WAITING | Cc | | 0096 | SPA | START OF PROTECTED AREA | Cc | | 0097 | EPA | END OF PROTECTED AREA | Cc | | 0098 | SOS | START OF STRING | Cc | | 0099 | SGCI | SINGLE GRAPHIC CHARACTER INTRODUCER | Cc | | 009A | SCI | SINGLE CHARACTER INTRODUCER | Cc | | 009B | CSI | CONTROL SEQUENCE INTRODUCER | Cc | | 009C | ST | STRING TERMINATOR | Cc | | 009D | OSC | OPERATING SYSTEM COMMAND | Cc | | 009E | PM | PRIVACY MESSAGE | Cc | | 009F | APC | APPLICATION PROGRAM COMMAND | Cc | •--------•--------•-------------------------------------------•------•
These characters, in an UTF-8 encoded file, are displayed with two to four white characters, on a black background. Each character may be found, individually, with a regex expression, which is, either,
\xNN
,\x{NN}
or\x{00NN}
, whereN
represents an hexadecimal digit, from0
toF
In addition, some
C0
Control characters, may be searched, from their escape sequence :\t
,\0
,\n
,…Finally, any
C0
andC1
Control character can be found with the simple character class[\x00-\x1f\x80-\x9f]
. However, this regex will also find any End of Line character, as\r
and/or\n
!So I suppose you’ll likely prefer this regex,
(?!\r|\n)[\x00-\x1f\x80-\x9f]
, which does not detect these EOL characters :-)) If you want to avoid the tabulation character, too, just change it as(?!\r|\n|\t)[\x00-\x1f\x80-\x9f]
An other possibility is to use the POSIX character class
[:cntrl:]
, which must be enclosed in the usual square brackets. So, our regex becomes(?!\r|\n)[[:cntrl:]]
This regex will detect anyC0
andC1
Control codes, described above, but, also, some format characters, practically invisible, as well as the common DELETE character (\x7f
)Again, here is a summary, although non exhaustive, of these format characters :
•--------•--------•-------------------------------------------•------• | 007F | DEL | DELETE | Cc | •--------•--------•-------------------------------------------•------• | 200C | ZWNJ | ZERO WIDTH NON-JOINER | Cf | | 200D | ZWJ | ZERO WIDTH JOINER | Cf | | 200E | LRM | LEFT-TO-RIGHT MARK | Cf | | 200F | RLM | RIGHT-TO-LEFT MARK | Cf | •--------•--------•-------------------------------------------•------• | 202A | LRE | LEFT-TO-RIGHT EMBEDDING | Cf | | 202B | RLE | RIGHT-TO-LEFT EMBEDDING | Cf | | 202C | PDF | POP DIRECTIONAL FORMATTING | Cf | | 202D | LRO | LEFT-TO-RIGHT OVERRIDE | Cf | | 202E | RLO | RIGHT-TO-LEFT OVERRIDE | Cf | •--------•--------•-------------------------------------------•------• | 206A | ISS | INHIBIT SYMMETRIC SWAPPING | Cf | | 206B | ASS | ACTIVATE SYMMETRIC SWAPPING | Cf | | 206C | IAFS | INHIBIT ARABIC FORM SHAPING | Cf | | 206D | AAFS | ACTIVATE ARABIC FORM SHAPING | Cf | | 206E | NADS | NATIONAL DIGIT SHAPES | Cf | | 206F | NODS | NOMINAL DIGIT SHAPES | Cf | •--------•--------•-------------------------------------------•------• | FEFF | ZWNBSP | ZERO WIDTH NO-BREAK SPACE | Cf | •--------•--------•-------------------------------------------•------• | FFF9 | IAA | INTERLINEAR ANNOTATION ANCHOR | Cf | | FFFA | IAS | INTERLINEAR ANNOTATION SEPARATOR | Cf | | FFFB | IAT | INTERLINEAR ANNOTATION TERMINATOR | Cf | •--------•--------•-------------------------------------------•------•
You can see a description of these characters, from the following links :
http://www.unicode.org/charts/PDF/U2000.pdf
http://www.unicode.org/charts/PDF/UFE70.pdf
http://www.unicode.org/charts/PDF/UFFF0.pdf
And any of these characters may be found, individually, with the simple regex
\x{NNNN}
Notes :
-
In the Replace dialog, you must select the Regular expression search mode
-
Of course, if you want to delete all the “black background” characters, just leave the Replace with: box
EMPTY
!
Best Regards,
guy038
PS :
The last column, of the 2 tables, represents the Unicode General Category Property (
Cc
means a Control character andCf
means a Format character ! ) -
-
Hi guy038,
thanks for the excellent responseBest regards,
Diedrich -
Thanks for the solution, It helped me to replace that unwanted data for many records at a time. Very helpful.
-
Thank you soo much for your instructions! They saved me a whole bunch of time!