Community
    • Login

    Clean up text of non-printing characters

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    9 Posts 5 Posters 411 Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • M
      m-fessler
      last edited by

      Hello everyone,

      I keep stumbling around non-printing characters such as zero-width space, soft hyphen…
      They are extremely annoying, especially for (permitted/legal) copy/paste actions from web content.

      Is there a way to convert these characters into other visible ones using Notepad++?
      Is there an add-on (or a 3rd party tool) with which I can set the appropriate rules and thus “clean” text?

      Thank you in advance for any tips!
      Regards, Martin

      Alan KilbornA CoisesC 2 Replies Last reply Reply Quote 0
      • Alan KilbornA
        Alan Kilborn @m-fessler
        last edited by Alan Kilborn

        @m-fessler

        I don’t know of any “add on” that does exactly what you’re asking for.

        But really, you’re just asking for a kind of “replace from a list of find/replace pairs”, and you could find many such scripts and macros that do such things if you search this site. A recently active topic thread that had one was https://community.notepad-plus-plus.org/topic/23638. But, there’s many others.

        If you feel like doing a little hacking of shortcuts.xml, you could build your own macro via text editing to do a series of replacements; example:

                <Macro name="Make multiple replacements" Ctrl="no" Alt="no" Shift="no" Key="0">
                    <Action type="3" message="1700" wParam="0" lParam="0" sParam="" />
                    <Action type="3" message="1601" wParam="0" lParam="0" sParam="find1" />
                    <Action type="3" message="1625" wParam="0" lParam="0" sParam="" />
                    <Action type="3" message="1602" wParam="0" lParam="0" sParam="replace1" />
                    <Action type="3" message="1702" wParam="0" lParam="768" sParam="" />
                    <Action type="3" message="1701" wParam="0" lParam="1609" sParam="" />
                    <Action type="3" message="1700" wParam="0" lParam="0" sParam="" />
                    <Action type="3" message="1601" wParam="0" lParam="0" sParam="find2" />
                    <Action type="3" message="1625" wParam="0" lParam="0" sParam="" />
                    <Action type="3" message="1602" wParam="0" lParam="0" sParam="replace2" />
                    <Action type="3" message="1702" wParam="0" lParam="768" sParam="" />
                    <Action type="3" message="1701" wParam="0" lParam="1609" sParam="" />
                </Macro>
        

        Here, each replace operation is contained between the lines containing 1700 and 1701. So if you copy and paste that group of lines to below the last “Action” line, you’d define a third replacement. You’d obviously change “find1”, “replace1”, etc. to be a substitution pair that you’d want to make. You could build up a big set of replacements this way.

        1 Reply Last reply Reply Quote 2
        • mathlete2M
          mathlete2
          last edited by

          Just to add to @Alan-Kilborn 's suggestion: depending on the nature of the characters that you need to find/replace, you may need to change the 1625 messages to use lParam="2" so that regular expressions are used. Other advanced details of this Action code can be found here.

          Alan KilbornA 1 Reply Last reply Reply Quote 1
          • Alan KilbornA
            Alan Kilborn @mathlete2
            last edited by Alan Kilborn

            @mathlete2 said in Clean up text of non-printing characters:

            you may need to change the 1625 messages to use lParam=“2”

            I was not going to complicate a technique that’s already a bit esoteric with something like THAT, especially since the need for it wasn’t in what the OP expressed. For the most part, I try to stick to proposing solutions that solve the stated problem, not some “somebody might need this” stretch.

            mathlete2M 1 Reply Last reply Reply Quote 0
            • mathlete2M
              mathlete2 @Alan Kilborn
              last edited by

              @Alan-Kilborn said in Clean up text of non-printing characters:

              the need for it wasn’t in what the OP expressed

              It’s true that the OP didn’t explicitly state that regex support was needed, but it’s also true that it didn’t explicitly state that regex wasn’t needed. Since you have already directed the user to the Macro code in shortcuts.xml, it seemed worthwhile to mention a simple tweak that is commonly used in these sorts of situations.

              Alan KilbornA 1 Reply Last reply Reply Quote 1
              • Alan KilbornA
                Alan Kilborn @mathlete2
                last edited by

                @mathlete2

                Arguably “match case” or “whole word” is even more important than the search mode.
                And again, I didn’t mention those either because the need just wasn’t there…

                mathlete2M 1 Reply Last reply Reply Quote 0
                • mathlete2M
                  mathlete2 @Alan Kilborn
                  last edited by mathlete2

                  @Alan-Kilborn said in Clean up text of non-printing characters:

                  Arguably “match case” or “whole word” is even more important than the search mode.

                  Agreed, which is one of the reasons why I added a link to the Action code page; it’s the one that documents these sorts of things.

                  I specifically mentioned the regex configuration because it’s a very useful one that users might not think to look for. Even if they do, they may find it difficult to find the mentions of it when they scroll search through the page manually; I certainly do, so I thought users would appreciate the explicit instructions for implementing them.

                  1 Reply Last reply Reply Quote 0
                  • CoisesC
                    Coises @m-fessler
                    last edited by Coises

                    @m-fessler said in Clean up text of non-printing characters:

                    I keep stumbling around non-printing characters such as zero-width space, soft hyphen…
                    They are extremely annoying, especially for (permitted/legal) copy/paste actions from web content.

                    Is there a way to convert these characters into other visible ones using Notepad++?

                    If you just want to see them, so you can clean them up manually, select View | Show Symbol | Show Non-Printing Characters (or Show All Characters).

                    Otherwise, if you want to replace them with something else, try this: Select Search | Replace; then, in the dialog, enter:

                    Find what: [^[:graph:] \r\n\t]|\xad (don’t miss that there is a space between :] and \r)
                    Replace with: (empty, or whatever you want)
                    Wrap around: checked
                    Search Mode: Regular expression

                    and click Replace All.

                    1 Reply Last reply Reply Quote 4
                    • guy038G
                      guy038
                      last edited by guy038

                      Hello,@ @m-fessler, @mathlete2, @alan-kilborn, @coises and All,

                      @m-fessler, here is, below, a list of all the special Unicode characters which belong, either, to :

                      • The Z separator category ( Zs, Zl and Zp categories )

                      • The Cc Control character category ( except for the TAB, LF and CR ones )

                      • The Cf Format character category

                      • Two So Other Symbol characters ( \x{FFFC} and \x{FFFD} )

                      This list contains 121 characters

                          •---------•--------------------•--------------------------------------------•----------•------•--------•
                          |   Code  |        Regex       |                 Character                  |  Abbre.  |  GC  | Chr.   |
                          •---------•--------------------•--------------------------------------------•----------•------•--------•
                          |   0000  |      \x{0000}      |  NULL                                      |  NUL     |  Cc  |   
                          |   0001  |      \x{0001}      |  START OF HEADING                          |  SOH     |  Cc  |  
                          |   0002  |      \x{0002}      |  START OF TEXT                             |  STX     |  Cc  |  
                          |   0003  |      \x{0003}      |  END OF TEXT                               |  ETX     |  Cc  |  
                          |   0004  |      \x{0004}      |  END OF TRANSMISSION                       |  EOT     |  Cc  |  
                          |   0005  |      \x{0005}      |  ENQUIRY                                   |  ENQ     |  Cc  |  
                          |   0006  |      \x{0006}      |  ACKNOWLEDGE                               |  ACK     |  Cc  |  
                          |   0007  |      \x{0007}      |  BELL                                      |  BEL     |  Cc  |  
                          |   0008  |      \x{0008}      |  BACKSPACE                                 |  BS      |  Cc  |  
                          |   000B  |      \x{000B}      |  VERTICAL TABULATION                       |  VT      |  Cc  |  
                          |   000C  |      \x{000C}      |  FORM FEED                                 |  FF      |  Cc  |  
                          |   000E  |      \x{000E}      |  SHIFT OUT                                 |  SO      |  Cc  |  
                          |   000F  |      \x{000F}      |  SHIFT IN                                  |  SI      |  Cc  |  
                          |   0010  |      \x{0010}      |  DATA LINK ESCAPE                          |  DLE     |  Cc  |  
                          |   0011  |      \x{0011}      |  DEVICE CONTROL ONE                        |  DC1     |  Cc  |  
                          |   0012  |      \x{0012}      |  DEVICE CONTROL TWO                        |  DC2     |  Cc  |  
                          |   0013  |      \x{0013}      |  DEVICE CONTROL THREE                      |  DC3     |  Cc  |  
                          |   0014  |      \x{0014}      |  DEVICE CONTROL FOUR                       |  DC4     |  Cc  |  
                          |   0015  |      \x{0015}      |  NEGATIVE ACKNOWLEDGE                      |  NAK     |  Cc  |  
                          |   0016  |      \x{0016}      |  SYNCHRONOUS IDLE                          |  SYN     |  Cc  |  
                          |   0017  |      \x{0017}      |  END OF TRANSMISSION BLOCK                 |  ETB     |  Cc  |  
                          |   0018  |      \x{0018}      |  CANCEL                                    |  CAN     |  Cc  |  
                          |   0019  |      \x{0019}      |  END OF MEDIUM                             |  EM      |  Cc  |  
                          |   001A  |      \x{001A}      |  SUBSTITUTE                                |  SUB     |  Cc  |  
                          |   001B  |      \x{001B}      |  ESCAPE                                    |  ESC     |  Cc  |  
                          |   001C  |      \x{001C}      |  FILE SEPARATOR                            |  FS      |  Cc  |  
                          |   001D  |      \x{001D}      |  GROUP SEPARATOR                           |  GS      |  Cc  |  
                          |   001E  |      \x{001E}      |  RECORD SEPARATOR                          |  RS      |  Cc  |  
                          |   001F  |      \x{001F}      |  UNIT SEPARATOR                            |  US      |  Cc  |  
                          •---------•-------------------•--------------------------------------------•----------•------•--------•
                          |   007F  |      \x{007F}      |  DELETE                                    |  DEL     |  Cc  |  
                          •---------•--------------------•--------------------------------------------•----------•------•-------•
                          |   0080  |      \x{0080}      |  PADDING CHARACTER                         |  PAD     |  Cc  |  €
                          |   0081  |      \x{0081}      |  HIGH OCTET PRESET                         |  HOP     |  Cc  |  
                          |   0082  |      \x{0082}      |  BREAK PERMITTED HERE                      |  BPH     |  Cc  |  ‚
                          |   0083  |      \x{0083}      |  NO BREAK HERE                             |  NBH     |  Cc  |  ƒ
                          |   0084  |      \x{0084}      |  INDEX                                     |  IND     |  Cc  |  „
                          |   0085  |      \x{0085}      |  NEXT LINE                                 |  NEL     |  Cc  |  …
                          |   0086  |      \x{0086}      |  START OF SELECTED AREA                    |  SSA     |  Cc  |  †
                          |   0087  |      \x{0087}      |  END OF SELECTED AREA                      |  ESA     |  Cc  |  ‡
                          |   0088  |      \x{0088}      |  HORIZONTAL TABULATION SET                 |  HTS     |  Cc  |  ˆ
                          |   0089  |      \x{0089}      |  HORIZONTAL TABULATION WITH JUSTIFICATION  |  HTJ     |  Cc  |  ‰
                          |   008A  |      \x{008A}      |  VERTICAL TABULATION SET                   |  VTS     |  Cc  |  Š
                          |   008B  |      \x{008B}      |  PARTIAL LINE DOWN                         |  PLD     |  Cc  |  ‹
                          |   008C  |      \x{008C}      |  PARTIAL LINE UP                           |  PLU     |  Cc  |  Œ
                          |   008D  |      \x{008D}      |  REVERSE INDEX                             |  RI      |  Cc  |  
                          |   008E  |      \x{008E}      |  SINGLE-SHIFT 2                            |  SS2     |  Cc  |  Ž
                          |   008F  |      \x{008F}      |  SINGLE-SHIFT 3                            |  SS3     |  Cc  |  
                          |   0090  |      \x{0090}      |  DEVICE CONTROL STRING                     |  DCS     |  Cc  |  
                          |   0091  |      \x{0091}      |  PRIVATE USE 1                             |  PU1     |  Cc  |  ‘
                          |   0092  |      \x{0092}      |  PRIVATE USE 2                             |  PU2     |  Cc  |  ’
                          |   0093  |      \x{0093}      |  SET TRANSMIT STATE                        |  STS     |  Cc  |  “
                          |   0094  |      \x{0094}      |  CANCEL CHARACTER                          |  CCH     |  Cc  |  ”
                          |   0095  |      \x{0095}      |  MESSAGE WAITING                           |  MW      |  Cc  |  •
                          |   0096  |      \x{0096}      |  START OF PROTECTED AREA                   |  SPA     |  Cc  |  –
                          |   0097  |      \x{0097}      |  END OF PROTECTED AREA                     |  EPA     |  Cc  |  —
                          |   0098  |      \x{0098}      |  START OF STRING                           |  SOS     |  Cc  |  ˜
                          |   0099  |      \x{0099}      |  SINGLE GRAPHIC CHARACTER INTRODUCER       |  SGCI    |  Cc  |  ™
                          |   009A  |      \x{009A}      |  SINGLE CHARACTER INTRODUCER               |  SCI     |  Cc  |  š
                          |   009B  |      \x{009B}      |  CONTROL SEQUENCE INTRODUCER               |  CSI     |  Cc  |  ›
                          |   009C  |      \x{009C}      |  STRING TERMINATOR                         |  ST      |  Cc  |  œ
                          |   009D  |      \x{009D}      |  OPERATING SYSTEM COMMAND                  |  OSC     |  Cc  |  
                          |   009E  |      \x{009E}      |  PRIVACY MESSAGE                           |  PM      |  Cc  |  ž
                          |   009F  |      \x{009F}      |  APPLICATION PROGRAM COMMAND               |  APC     |  Cc  |  Ÿ
                          •---------•--------------------•--------------------------------------------•----------•------•--------•
                          |   00A0  |      \x{00A0}      |  NO-BREAK SPACE                            |  NBSP    |  Zs  |   
                          •---------•--------------------•--------------------------------------------•----------•------•--------•
                          |   00AD  |      \x{00AD}      |  SOFT HYPHEN                               |  SHY     |  Cf  |  ­
                          •---------•--------------------•--------------------------------------------•----------•------•--------•
                          |   061C  |      \x{061C}      |  ARABIC LETTER MARK                        |  ALM     |  Cf  |  ؜
                          •---------•--------------------•--------------------------------------------•----------•------•--------•
                          |   070F  |      \x{070F}      |  SYRIAC ABBREVIATION MARK                  |  SAM     |  Cf  |  ܏
                          •---------•--------------------•--------------------------------------------•----------•------•--------•
                          |   0890  |      \x{0890}      |  ARABIC POUND MARK ABOVE                   |          |  Cf  |  ࢐
                          |   0891  |      \x{0891}      |  ARABIC PIASTRE MARK ABOVE                 |          |  Cf  |  ࢑
                          •---------•--------------------•--------------------------------------------•----------•------•--------•
                          |   1680  |      \x{1680}      |  OGHAM SPACE MARK                          |  OSPM    |  Zs  |   
                          •---------•--------------------•--------------------------------------------•----------•------•--------•
                          |   180E  |      \x{180E}      |  MONGOLIAN VOWEL SEPARATOR                 |  MVS     |  Cf  |  ᠎
                          •---------•--------------------•--------------------------------------------•----------•------•--------•
                          |   2000  |      \x{2000}      |  EN QUAD                                   |  NQSP    |  Zs  |   
                          |   2001  |      \x{2001}      |  EM QUAD                                   |  MQSP    |  Zs  |   
                          |   2002  |      \x{2002}      |  EN SPACE                                  |  ENSP    |  Zs  |   
                          |   2003  |      \x{2003}      |  EM SPACE                                  |  EMSP    |  Zs  |   
                          |   2004  |      \x{2004}      |  THREE-PER-EM SPACE                        |  3/MSP   |  Zs  |   
                          |   2005  |      \x{2005}      |  FOUR-PER-EM SPACE                         |  4/MSP   |  Zs  |   
                          |   2006  |      \x{2006}      |  SIX-PER-EM SPACE                          |  6/MSP   |  Zs  |   
                          |   2007  |      \x{2007}      |  FIGURE SPACE                              |  FSP     |  Zs  |   
                          |   2008  |      \x{2008}      |  PUNCTUATION SPACE                         |  PSP     |  Zs  |   
                          |   2009  |      \x{2009}      |  THIN SPACE                                |  THSP    |  Zs  |   
                          |   200A  |      \x{200A}      |  HAIR SPACE                                |  HSP     |  Zs  |   
                          •---------•--------------------•--------------------------------------------•----------•------•--------•
                          |   200B  |      \x{200B}      |  ZERO WIDTH SPACE                          |  ZWSP    |  Cf  |  ​
                          |   200C  |      \x{200C}      |  ZERO WIDTH NON-JOINER                     |  ZWNJ    |  Cf  |  ‌
                          |   200D  |      \x{200D}      |  ZERO WIDTH JOINER                         |  ZWJ     |  Cf  |  ‍
                          |   200E  |      \x{200E}      |  LEFT-TO-RIGHT MARK                        |  LRM     |  Cf  |  ‎
                          |   200F  |      \x{200F}      |  RIGHT-TO-LEFT MARK                        |  RLM     |  Cf  |  ‏
                          •---------•--------------------•--------------------------------------------•----------•------•--------•
                          |   2028  |      \x{2028}      |  LINE SEPARATOR                            |  LS      |  Zl  |  

                          |   2029  |      \x{2029}      |  PARAGRAPH SEPARATOR                       |  PS      |  Zp  |  

                          •---------•--------------------•--------------------------------------------•----------•------•--------•
                          |   202A  |      \x{202A}      |  LEFT-TO-RIGHT EMBEDDING                   |  LRE     |  Cf  |  ‪
                          |   202B  |      \x{202B}      |  RIGHT-TO-LEFT EMBEDDING                   |  RLE     |  Cf  |  ‫
                          |   202C  |      \x{202C}      |  POP DIRECTIONAL FORMATTING                |  PDF     |  Cf  |  ‬
                          |   202D  |      \x{202D}      |  LEFT-TO-RIGHT OVERRIDE                    |  LRO     |  Cf  |  ‭
                          |   202E  |      \x{202E}      |  RIGHT-TO-LEFT OVERRIDE                    |  RLO     |  Cf  |  ‮  |    
                          •---------•--------------------•--------------------------------------------•----------•------•--------•
                          |   202F  |      \x{202F}      |  NARROW NO-BREAK SPACE                     |  NNBSP   |  Zs  |   
                          |   205F  |      \x{205F}      |  MEDIUM MATHEMATICAL SPACE                 |  MMSP    |  Zs  |   
                          •---------•--------------------•--------------------------------------------•----------•------•--------•
                          |   2060  |      \x{2060}      |  WORD JOINER                               |  WJ      |  Cf  |  ⁠
                          •---------•--------------------•--------------------------------------------•----------•------•--------•
                          |   2061  |      \x{2061}      |  FUNCTION APPLICATION                      |  (FA)    |  Cf  |  ⁡
                          |   2062  |      \x{2062}      |  INVISIBLE TIMES                           |  (IT)    |  Cf  |  ⁢
                          |   2063  |      \x{2063}      |  INVISIBLE SEPARATOR                       |  (IS)    |  Cf  |  ⁣
                          |   2064  |      \x{2064}      |  INVISIBLE PLUS                            |  (IP)    |  Cf  |  ⁤
                          •---------•--------------------•--------------------------------------------•----------•------•--------•
                          |   2066  |      \x{2066}      |  LEFT-TO-RIGHT ISOLATE                     |  LRI     |  Cf  |  ⁦
                          |   2067  |      \x{2067}      |  RIGHT-TO-LEFT ISOLATE                     |  RLI     |  Cf  |  ⁧
                          |   2068  |      \x{2068}      |  FIRST STRONG ISOLATE                      |  FSI     |  Cf  |  ⁨
                          |   2069  |      \x{2069}      |  POP DIRECTIONAL ISOLATE                   |  PDI     |  Cf  |  ⁩
                          |   206A  |      \x{206A}      |  INHIBIT SYMMETRIC SWAPPING                |  ISS     |  Cf  |  
                          |   206B  |      \x{206B}      |  ACTIVATE SYMMETRIC SWAPPING               |  ASS     |  Cf  |  
                          |   206C  |      \x{206C}      |  INHIBIT ARABIC FORM SHAPING               |  IAFS    |  Cf  |  
                          |   206D  |      \x{206D}      |  ACTIVATE ARABIC FORM SHAPING              |  AAFS    |  Cf  |  
                          |   206E  |      \x{206E}      |  NATIONAL DIGIT SHAPES                     |  NADS    |  Cf  |  
                          |   206F  |      \x{206F}      |  NOMINAL DIGIT SHAPES                      |  NODS    |  Cf  |  
                          •---------•--------------------•--------------------------------------------•----------•------•--------•
                          |   3000  |      \x{3000}      |  IDEOGRAPHIC SPACE                         |  IDSP    |  Zs  |   
                          •---------•--------------------•--------------------------------------------•----------•------•--------•
                          |   FEFF  |      \x{FEFF}      |  ZERO WIDTH NO-BREAK SPACE                 |  ZWNBSP  |  Cf  |  
                          •---------•--------------------•--------------------------------------------•----------•------•--------•
                          |   FFF9  |      \x{FFF9}      |  INTERLINEAR ANNOTATION ANCHOR             |  IAA     |  Cf  |  
                          |   FFFA  |      \x{FFFA}      |  INTERLINEAR ANNOTATION SEPARATOR          |  IAS     |  Cf  |  
                          |   FFFB  |      \x{FFFB}      |  INTERLINEAR ANNOTATION TERMINATOR         |  IAT     |  Cf  |  
                          •---------•--------------------•--------------------------------------------•----------•------•--------•
                          |   FFFC  |      \x{FFFC}      |  OBJECT REPLACEMENT CHARACTER              |  OBJ     |  So  |  
                          |   FFFD  |      \x{FFFD}      |  REPLACEMENT CHARACTER                     |  ?       |  So  |  �
                          •---------•--------------------•--------------------------------------------•----------•------•--------•
                          |  1BCA0  |  \x{D82F}\x{DCA0}  |  SHORTHAND FORMAT LETTER OVERLAP           |  SFLO    |  Cf  |  𛲠
                          |  1BCA1  |  \x{D82F}\x{DCA1}  |  SHORTHAND FORMAT CONTINUING OVERLAP       |  SFCO    |  Cf  |  𛲡
                          |  1BCA2  |  \x{D82F}\x{DCA2}  |  SHORTHAND FORMAT DOWN STEP                |  SFDS    |  Cf  |  𛲢
                          |  1BCA3  |  \x{D82F}\x{DCA3}  |  SHORTHAND FORMAT UP STEP                  |  SFUS    |  Cf  |  𛲣
                          •---------•--------------------•--------------------------------------------•----------•------•--------•
                      

                      From this list, @m-fessler, which characters do you want to Search / Mark / Replace ?

                      Moreover, do you want to ignore all characters above the BMP ( so, over \x{FFFF} ) or do you consider these characters as normal chars ?

                      Once, you’ll know which characters you want to consider, it will be easy to get the appropriate REGEX search !

                      Best Regards,

                      guy038

                      1 Reply Last reply Reply Quote 2
                      • First post
                        Last post
                      The Community of users of the Notepad++ text editor.
                      Powered by NodeBB | Contributors