How to search for unknown 3-digit characters with black background
-
I got a utf-8 coded text file with about 500.000 lines. Surprisingly I found several different unknown characters, 3 digits in length, all of them white letters on black background, e.g. NEL, SSA, SPA, xC3, IND, STS and some others.
How can I find those characters to replace them? Umlauts and other language specific characters should remain.
I already tried the TextFX plugin, but this also changed all the umlauts and other European special characters to ##. -
Hello, @diedrich-hesmer, and All
Explanations on the
C0andC1Control codes may be seen, on Wikipedia :https://en.wikipedia.org/wiki/C0_and_C1_control_codes
And also, on the Unicode consortium :
http://www.unicode.org/charts/PDF/U0000.pdf
http://www.unicode.org/charts/PDF/U0080.pdf
Below, here is a summary of these characters :
•--------•--------•-------------------------------------------•------• | Code | Glyph | Character Name | Cat. | •--------•--------•-------------------------------------------•------• | 0000 | NUL | NULL \0 | Cc | | 0001 | SOH | START OF HEADING | Cc | | 0002 | STX | START OF TEXT | Cc | | 0003 | ETX | END OF TEXT | Cc | | 0004 | EOT | END OF TRANSMISSION | Cc | | 0005 | ENQ | ENQUIRY | Cc | | 0006 | ACK | ACKNOWLEDGE | Cc | | 0007 | BEL | BELL | Cc | | 0008 | BS | BACKSPACE [\b] | Cc | | 0009 | HT | CHARACTER TABULATION \t | Cc | | 000A | LF | LINE FEED \n | Cc | | 000B | VT | LINE TABULATION [\v] | Cc | | 000C | FF | FORM FEED \f | Cc | | 000D | CR | CARRIAGE RETURN \r | Cc | | 000E | SO | SHIFT OUT | Cc | | 000F | SI | SHIFT IN | Cc | | 0010 | DLE | DATA LINK ESCAPE | Cc | | 0011 | DC1 | DEVICE CONTROL ONE | Cc | | 0012 | DC2 | DEVICE CONTROL TWO | Cc | | 0013 | DC3 | DEVICE CONTROL THREE | Cc | | 0014 | DC4 | DEVICE CONTROL FOUR | Cc | | 0015 | NAK | NEGATIVE ACKNOWLEDGE | Cc | | 0016 | SYN | SYNCHRONOUS IDLE | Cc | | 0017 | ETB | END OF TRANSMISSION BLOCK | Cc | | 0018 | CAN | CANCEL | Cc | | 0019 | EM | END OF MEDIUM | Cc | | 001A | SUB | SUBSTITUTE | Cc | | 001B | ESC | ESCAPE | Cc | | 001C | FS | INFORMATION SEPARATOR FOUR | Cc | | 001D | GS | INFORMATION SEPARATOR THREE | Cc | | 001E | RS | INFORMATION SEPARATOR TWO | Cc | | 001F | US | INFORMATION SEPARATOR ONE | Cc | •--------•--------•-------------------------------------------•------• | 0080 | PAD | PADDING CHARACTER | Cc | | 0081 | HOP | HIGH OCTET PRESET | Cc | | 0082 | BPH | BREAK PERMITTED HERE | Cc | | 0083 | NBH | NO BREAK HERE | Cc | | 0084 | IND | INDEX | Cc | | 0085 | NEL | NEXT LINE (NEL) | Cc | | 0086 | SSA | START OF SELECTED AREA | Cc | | 0087 | ESA | END OF SELECTED AREA | Cc | | 0088 | HTS | CHARACTER TABULATION SET | Cc | | 0089 | HTJ | CHARACTER TABULATION WITH JUSTIFICATION | Cc | | 008A | VTS | LINE TABULATION SET | Cc | | 008B | PLD | PARTIAL LINE FORWARD / DOWN | Cc | | 008C | PLU | PARTIAL LINE BACKWARD / UP | Cc | | 008D | RI | REVERSE LINE FEED / REVERSE INDEX | Cc | | 008E | SS2 | SINGLE SHIFT TWO | Cc | | 008F | SS3 | SINGLE SHIFT THREE | Cc | | 0090 | DCS | DEVICE CONTROL STRING | Cc | | 0091 | PU1 | PRIVATE USE ONE | Cc | | 0092 | PU2 | PRIVATE USE TWO | Cc | | 0093 | STS | SET TRANSMIT STATE | Cc | | 0094 | CCH | CANCEL CHARACTER | Cc | | 0095 | MW | MESSAGE WAITING | Cc | | 0096 | SPA | START OF PROTECTED AREA | Cc | | 0097 | EPA | END OF PROTECTED AREA | Cc | | 0098 | SOS | START OF STRING | Cc | | 0099 | SGCI | SINGLE GRAPHIC CHARACTER INTRODUCER | Cc | | 009A | SCI | SINGLE CHARACTER INTRODUCER | Cc | | 009B | CSI | CONTROL SEQUENCE INTRODUCER | Cc | | 009C | ST | STRING TERMINATOR | Cc | | 009D | OSC | OPERATING SYSTEM COMMAND | Cc | | 009E | PM | PRIVACY MESSAGE | Cc | | 009F | APC | APPLICATION PROGRAM COMMAND | Cc | •--------•--------•-------------------------------------------•------•These characters, in an UTF-8 encoded file, are displayed with two to four white characters, on a black background. Each character may be found, individually, with a regex expression, which is, either,
\xNN,\x{NN}or\x{00NN}, whereNrepresents an hexadecimal digit, from0toFIn addition, some
C0Control characters, may be searched, from their escape sequence :\t,\0,\n,…Finally, any
C0andC1Control character can be found with the simple character class[\x00-\x1f\x80-\x9f]. However, this regex will also find any End of Line character, as\rand/or\n!So I suppose you’ll likely prefer this regex,
(?!\r|\n)[\x00-\x1f\x80-\x9f], which does not detect these EOL characters :-)) If you want to avoid the tabulation character, too, just change it as(?!\r|\n|\t)[\x00-\x1f\x80-\x9f]
An other possibility is to use the POSIX character class
[:cntrl:], which must be enclosed in the usual square brackets. So, our regex becomes(?!\r|\n)[[:cntrl:]]This regex will detect anyC0andC1Control codes, described above, but, also, some format characters, practically invisible, as well as the common DELETE character (\x7f)Again, here is a summary, although non exhaustive, of these format characters :
•--------•--------•-------------------------------------------•------• | 007F | DEL | DELETE | Cc | •--------•--------•-------------------------------------------•------• | 200C | ZWNJ | ZERO WIDTH NON-JOINER | Cf | | 200D | ZWJ | ZERO WIDTH JOINER | Cf | | 200E | LRM | LEFT-TO-RIGHT MARK | Cf | | 200F | RLM | RIGHT-TO-LEFT MARK | Cf | •--------•--------•-------------------------------------------•------• | 202A | LRE | LEFT-TO-RIGHT EMBEDDING | Cf | | 202B | RLE | RIGHT-TO-LEFT EMBEDDING | Cf | | 202C | PDF | POP DIRECTIONAL FORMATTING | Cf | | 202D | LRO | LEFT-TO-RIGHT OVERRIDE | Cf | | 202E | RLO | RIGHT-TO-LEFT OVERRIDE | Cf | •--------•--------•-------------------------------------------•------• | 206A | ISS | INHIBIT SYMMETRIC SWAPPING | Cf | | 206B | ASS | ACTIVATE SYMMETRIC SWAPPING | Cf | | 206C | IAFS | INHIBIT ARABIC FORM SHAPING | Cf | | 206D | AAFS | ACTIVATE ARABIC FORM SHAPING | Cf | | 206E | NADS | NATIONAL DIGIT SHAPES | Cf | | 206F | NODS | NOMINAL DIGIT SHAPES | Cf | •--------•--------•-------------------------------------------•------• | FEFF | ZWNBSP | ZERO WIDTH NO-BREAK SPACE | Cf | •--------•--------•-------------------------------------------•------• | FFF9 | IAA | INTERLINEAR ANNOTATION ANCHOR | Cf | | FFFA | IAS | INTERLINEAR ANNOTATION SEPARATOR | Cf | | FFFB | IAT | INTERLINEAR ANNOTATION TERMINATOR | Cf | •--------•--------•-------------------------------------------•------•You can see a description of these characters, from the following links :
http://www.unicode.org/charts/PDF/U2000.pdf
http://www.unicode.org/charts/PDF/UFE70.pdf
http://www.unicode.org/charts/PDF/UFFF0.pdf
And any of these characters may be found, individually, with the simple regex
\x{NNNN}
Notes :
-
In the Replace dialog, you must select the Regular expression search mode
-
Of course, if you want to delete all the “black background” characters, just leave the Replace with: box
EMPTY!
Best Regards,
guy038
PS :
The last column, of the 2 tables, represents the Unicode General Category Property (
Ccmeans a Control character andCfmeans a Format character ! ) -
-
Hi guy038,
thanks for the excellent responseBest regards,
Diedrich -
Thanks for the solution, It helped me to replace that unwanted data for many records at a time. Very helpful.
-
Thank you soo much for your instructions! They saved me a whole bunch of time!
Hello! It looks like you're interested in this conversation, but you don't have an account yet.
Getting fed up of having to scroll through the same posts each visit? When you register for an account, you'll always come back to exactly where you were before, and choose to be notified of new replies (either via email, or push notification). You'll also be able to save bookmarks and upvote posts to show your appreciation to other community members.
With your input, this post could be even better 💗
Register Login