Community
    • Login

    How to search for unknown 3-digit characters with black background

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    5 Posts 4 Posters 28.1k Views 2 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Diedrich HesmerD Offline
      Diedrich Hesmer
      last edited by

      I got a utf-8 coded text file with about 500.000 lines. Surprisingly I found several different unknown characters, 3 digits in length, all of them white letters on black background, e.g. NEL, SSA, SPA, xC3, IND, STS and some others.
      How can I find those characters to replace them? Umlauts and other language specific characters should remain.
      I already tried the TextFX plugin, but this also changed all the umlauts and other European special characters to ##.

      1 Reply Last reply Reply Quote 1
      • guy038G Offline
        guy038
        last edited by

        Hello, @diedrich-hesmer, and All

        Explanations on the C0 and C1 Control codes may be seen, on Wikipedia :

        https://en.wikipedia.org/wiki/C0_and_C1_control_codes

        And also, on the Unicode consortium :

        http://www.unicode.org/charts/PDF/U0000.pdf

        http://www.unicode.org/charts/PDF/U0080.pdf

        Below, here is a summary of these characters :

            •--------•--------•-------------------------------------------•------•
            |  Code  | Glyph  |  Character Name                           | Cat. |
            •--------•--------•-------------------------------------------•------•
            |  0000  |  NUL   |  NULL                          \0         |  Cc  |
            |  0001  |  SOH   |  START OF HEADING                         |  Cc  |
            |  0002  |  STX   |  START OF TEXT                            |  Cc  |
            |  0003  |  ETX   |  END OF TEXT                              |  Cc  |
            |  0004  |  EOT   |  END OF TRANSMISSION                      |  Cc  |
            |  0005  |  ENQ   |  ENQUIRY                                  |  Cc  |
            |  0006  |  ACK   |  ACKNOWLEDGE                              |  Cc  |
            |  0007  |  BEL   |  BELL                                     |  Cc  |
            |  0008  |  BS    |  BACKSPACE                    [\b]        |  Cc  |
            |  0009  |  HT    |  CHARACTER TABULATION          \t         |  Cc  |
            |  000A  |  LF    |  LINE FEED                     \n         |  Cc  |
            |  000B  |  VT    |  LINE TABULATION              [\v]        |  Cc  |
            |  000C  |  FF    |  FORM FEED                     \f         |  Cc  |
            |  000D  |  CR    |  CARRIAGE RETURN               \r         |  Cc  |
            |  000E  |  SO    |  SHIFT OUT                                |  Cc  |
            |  000F  |  SI    |  SHIFT IN                                 |  Cc  |
            |  0010  |  DLE   |  DATA LINK ESCAPE                         |  Cc  |
            |  0011  |  DC1   |  DEVICE CONTROL ONE                       |  Cc  |
            |  0012  |  DC2   |  DEVICE CONTROL TWO                       |  Cc  |
            |  0013  |  DC3   |  DEVICE CONTROL THREE                     |  Cc  |
            |  0014  |  DC4   |  DEVICE CONTROL FOUR                      |  Cc  |
            |  0015  |  NAK   |  NEGATIVE ACKNOWLEDGE                     |  Cc  |
            |  0016  |  SYN   |  SYNCHRONOUS IDLE                         |  Cc  |
            |  0017  |  ETB   |  END OF TRANSMISSION BLOCK                |  Cc  |
            |  0018  |  CAN   |  CANCEL                                   |  Cc  |
            |  0019  |  EM    |  END OF MEDIUM                            |  Cc  |
            |  001A  |  SUB   |  SUBSTITUTE                               |  Cc  |
            |  001B  |  ESC   |  ESCAPE                                   |  Cc  |
            |  001C  |  FS    |  INFORMATION SEPARATOR FOUR               |  Cc  |
            |  001D  |  GS    |  INFORMATION SEPARATOR THREE              |  Cc  |
            |  001E  |  RS    |  INFORMATION SEPARATOR TWO                |  Cc  |
            |  001F  |  US    |  INFORMATION SEPARATOR ONE                |  Cc  |
            •--------•--------•-------------------------------------------•------•
            |  0080  |  PAD   |  PADDING CHARACTER                        |  Cc  |
            |  0081  |  HOP   |  HIGH OCTET PRESET                        |  Cc  |
            |  0082  |  BPH   |  BREAK PERMITTED HERE                     |  Cc  |
            |  0083  |  NBH   |  NO BREAK HERE                            |  Cc  |
            |  0084  |  IND   |  INDEX                                    |  Cc  |
            |  0085  |  NEL   |  NEXT LINE (NEL)                          |  Cc  |
            |  0086  |  SSA   |  START OF SELECTED AREA                   |  Cc  |
            |  0087  |  ESA   |  END OF SELECTED AREA                     |  Cc  |
            |  0088  |  HTS   |  CHARACTER TABULATION SET                 |  Cc  |
            |  0089  |  HTJ   |  CHARACTER TABULATION WITH JUSTIFICATION  |  Cc  |
            |  008A  |  VTS   |  LINE TABULATION SET                      |  Cc  |
            |  008B  |  PLD   |  PARTIAL LINE FORWARD / DOWN              |  Cc  |
            |  008C  |  PLU   |  PARTIAL LINE BACKWARD / UP               |  Cc  |
            |  008D  |  RI    |  REVERSE LINE FEED / REVERSE INDEX        |  Cc  |
            |  008E  |  SS2   |  SINGLE SHIFT TWO                         |  Cc  |
            |  008F  |  SS3   |  SINGLE SHIFT THREE                       |  Cc  |
            |  0090  |  DCS   |  DEVICE CONTROL STRING                    |  Cc  |
            |  0091  |  PU1   |  PRIVATE USE ONE                          |  Cc  |
            |  0092  |  PU2   |  PRIVATE USE TWO                          |  Cc  |
            |  0093  |  STS   |  SET TRANSMIT STATE                       |  Cc  |
            |  0094  |  CCH   |  CANCEL CHARACTER                         |  Cc  |
            |  0095  |  MW    |  MESSAGE WAITING                          |  Cc  |
            |  0096  |  SPA   |  START OF PROTECTED AREA                  |  Cc  |
            |  0097  |  EPA   |  END OF PROTECTED AREA                    |  Cc  |
            |  0098  |  SOS   |  START OF STRING                          |  Cc  |
            |  0099  |  SGCI  |  SINGLE GRAPHIC CHARACTER INTRODUCER      |  Cc  |
            |  009A  |  SCI   |  SINGLE CHARACTER INTRODUCER              |  Cc  |
            |  009B  |  CSI   |  CONTROL SEQUENCE INTRODUCER              |  Cc  |
            |  009C  |  ST    |  STRING TERMINATOR                        |  Cc  |
            |  009D  |  OSC   |  OPERATING SYSTEM COMMAND                 |  Cc  |
            |  009E  |  PM    |  PRIVACY MESSAGE                          |  Cc  |
            |  009F  |  APC   |  APPLICATION PROGRAM COMMAND              |  Cc  |
            •--------•--------•-------------------------------------------•------•
        

        These characters, in an UTF-8 encoded file, are displayed with two to four white characters, on a black background. Each character may be found, individually, with a regex expression, which is, either, \xNN, \x{NN} or \x{00NN}, where N represents an hexadecimal digit, from 0 to F

        In addition, some C0 Control characters, may be searched, from their escape sequence : \t, \0, \n,…

        Finally, any C0 and C1 Control character can be found with the simple character class [\x00-\x1f\x80-\x9f]. However, this regex will also find any End of Line character, as \r and/or \n !

        So I suppose you’ll likely prefer this regex, (?!\r|\n)[\x00-\x1f\x80-\x9f], which does not detect these EOL characters :-)) If you want to avoid the tabulation character, too, just change it as (?!\r|\n|\t)[\x00-\x1f\x80-\x9f]


        An other possibility is to use the POSIX character class [:cntrl:], which must be enclosed in the usual square brackets. So, our regex becomes (?!\r|\n)[[:cntrl:]] This regex will detect any C0 and C1 Control codes, described above, but, also, some format characters, practically invisible, as well as the common DELETE character ( \x7f )

        Again, here is a summary, although non exhaustive, of these format characters :

            •--------•--------•-------------------------------------------•------•
            |  007F  |  DEL   |  DELETE                                   |  Cc  |
            •--------•--------•-------------------------------------------•------•
            |  200C  |  ZWNJ  |  ZERO WIDTH NON-JOINER                    |  Cf  |
            |  200D  |  ZWJ   |  ZERO WIDTH JOINER                        |  Cf  |
            |  200E  |  LRM   |  LEFT-TO-RIGHT MARK                       |  Cf  |
            |  200F  |  RLM   |  RIGHT-TO-LEFT MARK                       |  Cf  |
            •--------•--------•-------------------------------------------•------•
            |  202A  |  LRE   |  LEFT-TO-RIGHT EMBEDDING                  |  Cf  |
            |  202B  |  RLE   |  RIGHT-TO-LEFT EMBEDDING                  |  Cf  |
            |  202C  |  PDF   |  POP DIRECTIONAL FORMATTING               |  Cf  |
            |  202D  |  LRO   |  LEFT-TO-RIGHT OVERRIDE                   |  Cf  |
            |  202E  |  RLO   |  RIGHT-TO-LEFT OVERRIDE                   |  Cf  |
            •--------•--------•-------------------------------------------•------•
            |  206A  |  ISS   |  INHIBIT SYMMETRIC SWAPPING               |  Cf  |
            |  206B  |  ASS   |  ACTIVATE SYMMETRIC SWAPPING              |  Cf  |
            |  206C  |  IAFS  |  INHIBIT ARABIC FORM SHAPING              |  Cf  |
            |  206D  |  AAFS  |  ACTIVATE ARABIC FORM SHAPING             |  Cf  |
            |  206E  |  NADS  |  NATIONAL DIGIT SHAPES                    |  Cf  |
            |  206F  |  NODS  |  NOMINAL DIGIT SHAPES                     |  Cf  |
            •--------•--------•-------------------------------------------•------•
            |  FEFF  | ZWNBSP |  ZERO WIDTH NO-BREAK SPACE                |  Cf  |
            •--------•--------•-------------------------------------------•------•
            |  FFF9  |  IAA   |  INTERLINEAR ANNOTATION ANCHOR            |  Cf  |
            |  FFFA  |  IAS   |  INTERLINEAR ANNOTATION SEPARATOR         |  Cf  |
            |  FFFB  |  IAT   |  INTERLINEAR ANNOTATION TERMINATOR        |  Cf  |
            •--------•--------•-------------------------------------------•------•
        

        You can see a description of these characters, from the following links :

        http://www.unicode.org/charts/PDF/U2000.pdf

        http://www.unicode.org/charts/PDF/UFE70.pdf

        http://www.unicode.org/charts/PDF/UFFF0.pdf

        And any of these characters may be found, individually, with the simple regex \x{NNNN}


        Notes :

        • In the Replace dialog, you must select the Regular expression search mode

        • Of course, if you want to delete all the “black background” characters, just leave the Replace with: box EMPTY !

        Best Regards,

        guy038

        PS :

        The last column, of the 2 tables, represents the Unicode General Category Property ( Cc means a Control character and Cf means a Format character ! )

        1 Reply Last reply Reply Quote 2
        • Diedrich HesmerD Offline
          Diedrich Hesmer
          last edited by

          Hi guy038,
          thanks for the excellent response

          Best regards,
          Diedrich

          1 Reply Last reply Reply Quote 0
          • VamsiKrishna PenikalapatiV Offline
            VamsiKrishna Penikalapati
            last edited by

            Thanks for the solution, It helped me to replace that unwanted data for many records at a time. Very helpful.

            1 Reply Last reply Reply Quote 1
            • Cybelle SaffaC Offline
              Cybelle Saffa
              last edited by

              Thank you soo much for your instructions! They saved me a whole bunch of time!

              1 Reply Last reply Reply Quote 2

              Hello! It looks like you're interested in this conversation, but you don't have an account yet.

              Getting fed up of having to scroll through the same posts each visit? When you register for an account, you'll always come back to exactly where you were before, and choose to be notified of new replies (either via email, or push notification). You'll also be able to save bookmarks and upvote posts to show your appreciation to other community members.

              With your input, this post could be even better 💗

              Register Login
              • First post
                Last post
              The Community of users of the Notepad++ text editor.
              Powered by NodeBB | Contributors