How to search for unknown 3-digit characters with black background

Diedrich Hesmer · Nov 22, 2017, 4:16 PM

I got a utf-8 coded text file with about 500.000 lines. Surprisingly I found several different unknown characters, 3 digits in length, all of them white letters on black background, e.g. NEL, SSA, SPA, xC3, IND, STS and some others.
How can I find those characters to replace them? Umlauts and other language specific characters should remain.
I already tried the TextFX plugin, but this also changed all the umlauts and other European special characters to ##.

guy038 · Nov 23, 2017, 7:56 PM

Hello, @diedrich-hesmer, and All

Explanations on the C0 and C1 Control codes may be seen, on Wikipedia :

https://en.wikipedia.org/wiki/C0_and_C1_control_codes

And also, on the Unicode consortium :

http://www.unicode.org/charts/PDF/U0000.pdf

http://www.unicode.org/charts/PDF/U0080.pdf

Below, here is a summary of these characters :

    •--------•--------•-------------------------------------------•------•
    |  Code  | Glyph  |  Character Name                           | Cat. |
    •--------•--------•-------------------------------------------•------•
    |  0000  |  NUL   |  NULL                          \0         |  Cc  |
    |  0001  |  SOH   |  START OF HEADING                         |  Cc  |
    |  0002  |  STX   |  START OF TEXT                            |  Cc  |
    |  0003  |  ETX   |  END OF TEXT                              |  Cc  |
    |  0004  |  EOT   |  END OF TRANSMISSION                      |  Cc  |
    |  0005  |  ENQ   |  ENQUIRY                                  |  Cc  |
    |  0006  |  ACK   |  ACKNOWLEDGE                              |  Cc  |
    |  0007  |  BEL   |  BELL                                     |  Cc  |
    |  0008  |  BS    |  BACKSPACE                    [\b]        |  Cc  |
    |  0009  |  HT    |  CHARACTER TABULATION          \t         |  Cc  |
    |  000A  |  LF    |  LINE FEED                     \n         |  Cc  |
    |  000B  |  VT    |  LINE TABULATION              [\v]        |  Cc  |
    |  000C  |  FF    |  FORM FEED                     \f         |  Cc  |
    |  000D  |  CR    |  CARRIAGE RETURN               \r         |  Cc  |
    |  000E  |  SO    |  SHIFT OUT                                |  Cc  |
    |  000F  |  SI    |  SHIFT IN                                 |  Cc  |
    |  0010  |  DLE   |  DATA LINK ESCAPE                         |  Cc  |
    |  0011  |  DC1   |  DEVICE CONTROL ONE                       |  Cc  |
    |  0012  |  DC2   |  DEVICE CONTROL TWO                       |  Cc  |
    |  0013  |  DC3   |  DEVICE CONTROL THREE                     |  Cc  |
    |  0014  |  DC4   |  DEVICE CONTROL FOUR                      |  Cc  |
    |  0015  |  NAK   |  NEGATIVE ACKNOWLEDGE                     |  Cc  |
    |  0016  |  SYN   |  SYNCHRONOUS IDLE                         |  Cc  |
    |  0017  |  ETB   |  END OF TRANSMISSION BLOCK                |  Cc  |
    |  0018  |  CAN   |  CANCEL                                   |  Cc  |
    |  0019  |  EM    |  END OF MEDIUM                            |  Cc  |
    |  001A  |  SUB   |  SUBSTITUTE                               |  Cc  |
    |  001B  |  ESC   |  ESCAPE                                   |  Cc  |
    |  001C  |  FS    |  INFORMATION SEPARATOR FOUR               |  Cc  |
    |  001D  |  GS    |  INFORMATION SEPARATOR THREE              |  Cc  |
    |  001E  |  RS    |  INFORMATION SEPARATOR TWO                |  Cc  |
    |  001F  |  US    |  INFORMATION SEPARATOR ONE                |  Cc  |
    •--------•--------•-------------------------------------------•------•
    |  0080  |  PAD   |  PADDING CHARACTER                        |  Cc  |
    |  0081  |  HOP   |  HIGH OCTET PRESET                        |  Cc  |
    |  0082  |  BPH   |  BREAK PERMITTED HERE                     |  Cc  |
    |  0083  |  NBH   |  NO BREAK HERE                            |  Cc  |
    |  0084  |  IND   |  INDEX                                    |  Cc  |
    |  0085  |  NEL   |  NEXT LINE (NEL)                          |  Cc  |
    |  0086  |  SSA   |  START OF SELECTED AREA                   |  Cc  |
    |  0087  |  ESA   |  END OF SELECTED AREA                     |  Cc  |
    |  0088  |  HTS   |  CHARACTER TABULATION SET                 |  Cc  |
    |  0089  |  HTJ   |  CHARACTER TABULATION WITH JUSTIFICATION  |  Cc  |
    |  008A  |  VTS   |  LINE TABULATION SET                      |  Cc  |
    |  008B  |  PLD   |  PARTIAL LINE FORWARD / DOWN              |  Cc  |
    |  008C  |  PLU   |  PARTIAL LINE BACKWARD / UP               |  Cc  |
    |  008D  |  RI    |  REVERSE LINE FEED / REVERSE INDEX        |  Cc  |
    |  008E  |  SS2   |  SINGLE SHIFT TWO                         |  Cc  |
    |  008F  |  SS3   |  SINGLE SHIFT THREE                       |  Cc  |
    |  0090  |  DCS   |  DEVICE CONTROL STRING                    |  Cc  |
    |  0091  |  PU1   |  PRIVATE USE ONE                          |  Cc  |
    |  0092  |  PU2   |  PRIVATE USE TWO                          |  Cc  |
    |  0093  |  STS   |  SET TRANSMIT STATE                       |  Cc  |
    |  0094  |  CCH   |  CANCEL CHARACTER                         |  Cc  |
    |  0095  |  MW    |  MESSAGE WAITING                          |  Cc  |
    |  0096  |  SPA   |  START OF PROTECTED AREA                  |  Cc  |
    |  0097  |  EPA   |  END OF PROTECTED AREA                    |  Cc  |
    |  0098  |  SOS   |  START OF STRING                          |  Cc  |
    |  0099  |  SGCI  |  SINGLE GRAPHIC CHARACTER INTRODUCER      |  Cc  |
    |  009A  |  SCI   |  SINGLE CHARACTER INTRODUCER              |  Cc  |
    |  009B  |  CSI   |  CONTROL SEQUENCE INTRODUCER              |  Cc  |
    |  009C  |  ST    |  STRING TERMINATOR                        |  Cc  |
    |  009D  |  OSC   |  OPERATING SYSTEM COMMAND                 |  Cc  |
    |  009E  |  PM    |  PRIVACY MESSAGE                          |  Cc  |
    |  009F  |  APC   |  APPLICATION PROGRAM COMMAND              |  Cc  |
    •--------•--------•-------------------------------------------•------•

These characters, in an UTF-8 encoded file, are displayed with two to four white characters, on a black background. Each character may be found, individually, with a regex expression, which is, either, \xNN, \x{NN} or \x{00NN}, where N represents an hexadecimal digit, from 0 to F

In addition, some C0 Control characters, may be searched, from their escape sequence : \t, \0, \n,…

Finally, any C0 and C1 Control character can be found with the simple character class [\x00-\x1f\x80-\x9f]. However, this regex will also find any End of Line character, as \r and/or \n !

So I suppose you’ll likely prefer this regex, (?!\r|\n)[\x00-\x1f\x80-\x9f], which does not detect these EOL characters :-)) If you want to avoid the tabulation character, too, just change it as (?!\r|\n|\t)[\x00-\x1f\x80-\x9f]

An other possibility is to use the POSIX character class [:cntrl:], which must be enclosed in the usual square brackets. So, our regex becomes (?!\r|\n)[[:cntrl:]] This regex will detect any C0 and C1 Control codes, described above, but, also, some format characters, practically invisible, as well as the common DELETE character ( \x7f )

Again, here is a summary, although non exhaustive, of these format characters :

    •--------•--------•-------------------------------------------•------•
    |  007F  |  DEL   |  DELETE                                   |  Cc  |
    •--------•--------•-------------------------------------------•------•
    |  200C  |  ZWNJ  |  ZERO WIDTH NON-JOINER                    |  Cf  |
    |  200D  |  ZWJ   |  ZERO WIDTH JOINER                        |  Cf  |
    |  200E  |  LRM   |  LEFT-TO-RIGHT MARK                       |  Cf  |
    |  200F  |  RLM   |  RIGHT-TO-LEFT MARK                       |  Cf  |
    •--------•--------•-------------------------------------------•------•
    |  202A  |  LRE   |  LEFT-TO-RIGHT EMBEDDING                  |  Cf  |
    |  202B  |  RLE   |  RIGHT-TO-LEFT EMBEDDING                  |  Cf  |
    |  202C  |  PDF   |  POP DIRECTIONAL FORMATTING               |  Cf  |
    |  202D  |  LRO   |  LEFT-TO-RIGHT OVERRIDE                   |  Cf  |
    |  202E  |  RLO   |  RIGHT-TO-LEFT OVERRIDE                   |  Cf  |
    •--------•--------•-------------------------------------------•------•
    |  206A  |  ISS   |  INHIBIT SYMMETRIC SWAPPING               |  Cf  |
    |  206B  |  ASS   |  ACTIVATE SYMMETRIC SWAPPING              |  Cf  |
    |  206C  |  IAFS  |  INHIBIT ARABIC FORM SHAPING              |  Cf  |
    |  206D  |  AAFS  |  ACTIVATE ARABIC FORM SHAPING             |  Cf  |
    |  206E  |  NADS  |  NATIONAL DIGIT SHAPES                    |  Cf  |
    |  206F  |  NODS  |  NOMINAL DIGIT SHAPES                     |  Cf  |
    •--------•--------•-------------------------------------------•------•
    |  FEFF  | ZWNBSP |  ZERO WIDTH NO-BREAK SPACE                |  Cf  |
    •--------•--------•-------------------------------------------•------•
    |  FFF9  |  IAA   |  INTERLINEAR ANNOTATION ANCHOR            |  Cf  |
    |  FFFA  |  IAS   |  INTERLINEAR ANNOTATION SEPARATOR         |  Cf  |
    |  FFFB  |  IAT   |  INTERLINEAR ANNOTATION TERMINATOR        |  Cf  |
    •--------•--------•-------------------------------------------•------•

You can see a description of these characters, from the following links :

http://www.unicode.org/charts/PDF/U2000.pdf

http://www.unicode.org/charts/PDF/UFE70.pdf

http://www.unicode.org/charts/PDF/UFFF0.pdf

And any of these characters may be found, individually, with the simple regex \x{NNNN}

Notes :

In the Replace dialog, you must select the Regular expression search mode
Of course, if you want to delete all the “black background” characters, just leave the Replace with: box EMPTY !

Best Regards,

guy038

PS :

The last column, of the 2 tables, represents the Unicode General Category Property ( Cc means a Control character and Cf means a Format character ! )

Diedrich Hesmer · Nov 26, 2017, 3:48 PM

Hi guy038,
thanks for the excellent response

Best regards,
Diedrich

VamsiKrishna Penikalapati · Jul 18, 2019, 7:37 AM

Thanks for the solution, It helped me to replace that unwanted data for many records at a time. Very helpful.

Cybelle Saffa · Oct 16, 2019, 3:09 PM

Thank you soo much for your instructions! They saved me a whole bunch of time!