How to search for unknown 3-digit characters with black background



  • I got a utf-8 coded text file with about 500.000 lines. Surprisingly I found several different unknown characters, 3 digits in length, all of them white letters on black background, e.g. NEL, SSA, SPA, xC3, IND, STS and some others.
    How can I find those characters to replace them? Umlauts and other language specific characters should remain.
    I already tried the TextFX plugin, but this also changed all the umlauts and other European special characters to ##.



  • Hello, @diedrich-hesmer, and All

    Explanations on the C0 and C1 Control codes may be seen, on Wikipedia :

    https://en.wikipedia.org/wiki/C0_and_C1_control_codes

    And also, on the Unicode consortium :

    http://www.unicode.org/charts/PDF/U0000.pdf

    http://www.unicode.org/charts/PDF/U0080.pdf

    Below, here is a summary of these characters :

        •--------•--------•-------------------------------------------•------•
        |  Code  | Glyph  |  Character Name                           | Cat. |
        •--------•--------•-------------------------------------------•------•
        |  0000  |  NUL   |  NULL                          \0         |  Cc  |
        |  0001  |  SOH   |  START OF HEADING                         |  Cc  |
        |  0002  |  STX   |  START OF TEXT                            |  Cc  |
        |  0003  |  ETX   |  END OF TEXT                              |  Cc  |
        |  0004  |  EOT   |  END OF TRANSMISSION                      |  Cc  |
        |  0005  |  ENQ   |  ENQUIRY                                  |  Cc  |
        |  0006  |  ACK   |  ACKNOWLEDGE                              |  Cc  |
        |  0007  |  BEL   |  BELL                                     |  Cc  |
        |  0008  |  BS    |  BACKSPACE                    [\b]        |  Cc  |
        |  0009  |  HT    |  CHARACTER TABULATION          \t         |  Cc  |
        |  000A  |  LF    |  LINE FEED                     \n         |  Cc  |
        |  000B  |  VT    |  LINE TABULATION              [\v]        |  Cc  |
        |  000C  |  FF    |  FORM FEED                     \f         |  Cc  |
        |  000D  |  CR    |  CARRIAGE RETURN               \r         |  Cc  |
        |  000E  |  SO    |  SHIFT OUT                                |  Cc  |
        |  000F  |  SI    |  SHIFT IN                                 |  Cc  |
        |  0010  |  DLE   |  DATA LINK ESCAPE                         |  Cc  |
        |  0011  |  DC1   |  DEVICE CONTROL ONE                       |  Cc  |
        |  0012  |  DC2   |  DEVICE CONTROL TWO                       |  Cc  |
        |  0013  |  DC3   |  DEVICE CONTROL THREE                     |  Cc  |
        |  0014  |  DC4   |  DEVICE CONTROL FOUR                      |  Cc  |
        |  0015  |  NAK   |  NEGATIVE ACKNOWLEDGE                     |  Cc  |
        |  0016  |  SYN   |  SYNCHRONOUS IDLE                         |  Cc  |
        |  0017  |  ETB   |  END OF TRANSMISSION BLOCK                |  Cc  |
        |  0018  |  CAN   |  CANCEL                                   |  Cc  |
        |  0019  |  EM    |  END OF MEDIUM                            |  Cc  |
        |  001A  |  SUB   |  SUBSTITUTE                               |  Cc  |
        |  001B  |  ESC   |  ESCAPE                                   |  Cc  |
        |  001C  |  FS    |  INFORMATION SEPARATOR FOUR               |  Cc  |
        |  001D  |  GS    |  INFORMATION SEPARATOR THREE              |  Cc  |
        |  001E  |  RS    |  INFORMATION SEPARATOR TWO                |  Cc  |
        |  001F  |  US    |  INFORMATION SEPARATOR ONE                |  Cc  |
        •--------•--------•-------------------------------------------•------•
        |  0080  |  PAD   |  PADDING CHARACTER                        |  Cc  |
        |  0081  |  HOP   |  HIGH OCTET PRESET                        |  Cc  |
        |  0082  |  BPH   |  BREAK PERMITTED HERE                     |  Cc  |
        |  0083  |  NBH   |  NO BREAK HERE                            |  Cc  |
        |  0084  |  IND   |  INDEX                                    |  Cc  |
        |  0085  |  NEL   |  NEXT LINE (NEL)                          |  Cc  |
        |  0086  |  SSA   |  START OF SELECTED AREA                   |  Cc  |
        |  0087  |  ESA   |  END OF SELECTED AREA                     |  Cc  |
        |  0088  |  HTS   |  CHARACTER TABULATION SET                 |  Cc  |
        |  0089  |  HTJ   |  CHARACTER TABULATION WITH JUSTIFICATION  |  Cc  |
        |  008A  |  VTS   |  LINE TABULATION SET                      |  Cc  |
        |  008B  |  PLD   |  PARTIAL LINE FORWARD / DOWN              |  Cc  |
        |  008C  |  PLU   |  PARTIAL LINE BACKWARD / UP               |  Cc  |
        |  008D  |  RI    |  REVERSE LINE FEED / REVERSE INDEX        |  Cc  |
        |  008E  |  SS2   |  SINGLE SHIFT TWO                         |  Cc  |
        |  008F  |  SS3   |  SINGLE SHIFT THREE                       |  Cc  |
        |  0090  |  DCS   |  DEVICE CONTROL STRING                    |  Cc  |
        |  0091  |  PU1   |  PRIVATE USE ONE                          |  Cc  |
        |  0092  |  PU2   |  PRIVATE USE TWO                          |  Cc  |
        |  0093  |  STS   |  SET TRANSMIT STATE                       |  Cc  |
        |  0094  |  CCH   |  CANCEL CHARACTER                         |  Cc  |
        |  0095  |  MW    |  MESSAGE WAITING                          |  Cc  |
        |  0096  |  SPA   |  START OF PROTECTED AREA                  |  Cc  |
        |  0097  |  EPA   |  END OF PROTECTED AREA                    |  Cc  |
        |  0098  |  SOS   |  START OF STRING                          |  Cc  |
        |  0099  |  SGCI  |  SINGLE GRAPHIC CHARACTER INTRODUCER      |  Cc  |
        |  009A  |  SCI   |  SINGLE CHARACTER INTRODUCER              |  Cc  |
        |  009B  |  CSI   |  CONTROL SEQUENCE INTRODUCER              |  Cc  |
        |  009C  |  ST    |  STRING TERMINATOR                        |  Cc  |
        |  009D  |  OSC   |  OPERATING SYSTEM COMMAND                 |  Cc  |
        |  009E  |  PM    |  PRIVACY MESSAGE                          |  Cc  |
        |  009F  |  APC   |  APPLICATION PROGRAM COMMAND              |  Cc  |
        •--------•--------•-------------------------------------------•------•
    

    These characters, in an UTF-8 encoded file, are displayed with two to four white characters, on a black background. Each character may be found, individually, with a regex expression, which is, either, \xNN, \x{NN} or \x{00NN}, where N represents an hexadecimal digit, from 0 to F

    In addition, some C0 Control characters, may be searched, from their escape sequence : \t, \0, \n,…

    Finally, any C0 and C1 Control character can be found with the simple character class [\x00-\x1f\x80-\x9f]. However, this regex will also find any End of Line character, as \r and/or \n !

    So I suppose you’ll likely prefer this regex, (?!\r|\n)[\x00-\x1f\x80-\x9f], which does not detect these EOL characters :-)) If you want to avoid the tabulation character, too, just change it as (?!\r|\n|\t)[\x00-\x1f\x80-\x9f]


    An other possibility is to use the POSIX character class [:cntrl:], which must be enclosed in the usual square brackets. So, our regex becomes (?!\r|\n)[[:cntrl:]] This regex will detect any C0 and C1 Control codes, described above, but, also, some format characters, practically invisible, as well as the common DELETE character ( \x7f )

    Again, here is a summary, although non exhaustive, of these format characters :

        •--------•--------•-------------------------------------------•------•
        |  007F  |  DEL   |  DELETE                                   |  Cc  |
        •--------•--------•-------------------------------------------•------•
        |  200C  |  ZWNJ  |  ZERO WIDTH NON-JOINER                    |  Cf  |
        |  200D  |  ZWJ   |  ZERO WIDTH JOINER                        |  Cf  |
        |  200E  |  LRM   |  LEFT-TO-RIGHT MARK                       |  Cf  |
        |  200F  |  RLM   |  RIGHT-TO-LEFT MARK                       |  Cf  |
        •--------•--------•-------------------------------------------•------•
        |  202A  |  LRE   |  LEFT-TO-RIGHT EMBEDDING                  |  Cf  |
        |  202B  |  RLE   |  RIGHT-TO-LEFT EMBEDDING                  |  Cf  |
        |  202C  |  PDF   |  POP DIRECTIONAL FORMATTING               |  Cf  |
        |  202D  |  LRO   |  LEFT-TO-RIGHT OVERRIDE                   |  Cf  |
        |  202E  |  RLO   |  RIGHT-TO-LEFT OVERRIDE                   |  Cf  |
        •--------•--------•-------------------------------------------•------•
        |  206A  |  ISS   |  INHIBIT SYMMETRIC SWAPPING               |  Cf  |
        |  206B  |  ASS   |  ACTIVATE SYMMETRIC SWAPPING              |  Cf  |
        |  206C  |  IAFS  |  INHIBIT ARABIC FORM SHAPING              |  Cf  |
        |  206D  |  AAFS  |  ACTIVATE ARABIC FORM SHAPING             |  Cf  |
        |  206E  |  NADS  |  NATIONAL DIGIT SHAPES                    |  Cf  |
        |  206F  |  NODS  |  NOMINAL DIGIT SHAPES                     |  Cf  |
        •--------•--------•-------------------------------------------•------•
        |  FEFF  | ZWNBSP |  ZERO WIDTH NO-BREAK SPACE                |  Cf  |
        •--------•--------•-------------------------------------------•------•
        |  FFF9  |  IAA   |  INTERLINEAR ANNOTATION ANCHOR            |  Cf  |
        |  FFFA  |  IAS   |  INTERLINEAR ANNOTATION SEPARATOR         |  Cf  |
        |  FFFB  |  IAT   |  INTERLINEAR ANNOTATION TERMINATOR        |  Cf  |
        •--------•--------•-------------------------------------------•------•
    

    You can see a description of these characters, from the following links :

    http://www.unicode.org/charts/PDF/U2000.pdf

    http://www.unicode.org/charts/PDF/UFE70.pdf

    http://www.unicode.org/charts/PDF/UFFF0.pdf

    And any of these characters may be found, individually, with the simple regex \x{NNNN}


    Notes :

    • In the Replace dialog, you must select the Regular expression search mode

    • Of course, if you want to delete all the “black background” characters, just leave the Replace with: box EMPTY !

    Best Regards,

    guy038

    PS :

    The last column, of the 2 tables, represents the Unicode General Category Property ( Cc means a Control character and Cf means a Format character ! )



  • Hi guy038,
    thanks for the excellent response

    Best regards,
    Diedrich



  • Thanks for the solution, It helped me to replace that unwanted data for many records at a time. Very helpful.



  • Thank you soo much for your instructions! They saved me a whole bunch of time!


Log in to reply