Community
    • Login

    Any way to replace all Non ASCII characters i.e. all x80 or greater within a text file?

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    2 Posts 2 Posters 7.3k Views 1 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Jaack McMahonJ Offline
      Jaack McMahon
      last edited by

      I have been getting text files written by non-standard keyboards (non USA character sets). The quote character ’ hex 27 is showing as the HEX string E2 80 99.

      Task #1 I want to be able to find all characters greater than x7F i.e x80 or greater in text files.

      Task #2 Once found then I can fix or replace them with a more standard ASCII char(s).
      Any macro or other way to do these tasks?

      Thanks Jaack

      1 Reply Last reply Reply Quote 0
      • guy038G Offline
        guy038
        last edited by guy038

        Hello, @jaack-mcmahon, and All,

        Here, is, bellow, a NON-exhaustive table of some Unicode characters, with code-point, above 007Fh, taken from the following Unicode blocks :

        • Latin 1 Supplement
        • General Punctuation
        • Mathematical Operators
        • Miscellaneous Symbols
        • Specials

        which can be replaced by a similar standard ASCII character, with code-point < 0080h :

        +--------------------------------------------------------------+---------------------------------------------+
        |           NON-ASCII Character with Code > \x{007F}           |  Similar Character(s) with Code < \x{0080}  |
        +--------------------------------------------------------------+---------------------------------------------+
        |  Code  | Char |                     Character Name           |  Code  |  Char   |      Character Name      |
        +--------+------+----------------------------------------------+--------+---------+--------------------------+
        |  00A0  |   	|  NO-BREAK SPACE                              |  0020  |         |  SPACE                   |
        |  00A6  |  ¦	|  BROKEN BAR                                  |  007C  |     |   |  VERTICAL LINE           |
        |  00AB  |  «	|  LEFT-POINTING DOUBLE ANGLE QUOTATION MARK   |  0022  |     "   |  QUOTATION MARK          |
        |  00AD  |  ­	|  SOFT HYPHEN                                 |  002D  |     -   |  HYPHEN-MINUS            |
        |  00B4  |  ´	|  ACUTE ACCENT                                |  0027  |     '   |  APOSTROPHE              |
        |  00B7  |  ·	|  MIDDLE DOT                                  |  002E  |     .   |  FULL STOP               |
        |  00BB  |  »	|  RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK  |  0022  |     "   |  QUOTATION MARK          |
        |  00BC  |  ¼	|  VULGAR FRACTION ONE QUARTER                 |        |   1/4   |                          |
        |  00BD  |  ½	|  VULGAR FRACTION ONE HALF                    |        |   1/2   |                          |
        |  00BE  |  ¾	|  VULGAR FRACTION THREE QUARTERS              |        |   3/4   |                          |
        |  00D7  |  ×	|  MULTIPLICATION SIGN                         |  0078  |     x   |  LATIN SMALL LETTER X    |
        +--------+------+----------------------------------------------+--------+---------+--------------------------+
        |  2000  |   	|  EN QUAD                                     |        | \x20{2} |                          |
        |  2001  |   	|  EM QUAD                                     |        | \x20{4} |                          |
        |  2002  |   	|  EN SPACE                                    |        | \x20{2} |                          |
        |  2003  |   	|  EM SPACE                                    |        | \x20{4} |                          |
        |  2004  |   	|  THREE-PER-EM SPACE                          |  0020  |         |  SPACE                   |
        |  2005  |   	|  FOUR-PER-EM SPACE                           |  0020  |         |  SPACE                   |
        |  2007  |   	|  FIGURE SPACE                                |        | \x20{2} |                          |
        |  2008  |   	|  PUNCTUATION SPACE                           |  0020  |         |  SPACE                   |
        |  2010  |  ‐	|  HYPHEN                                      |  002D  |     -   |  HYPHEN-MINUS            |
        |  2011  |  ‑	|  NON-BREAKING HYPHEN                         |  002D  |     -   |  HYPHEN-MINUS            |
        |  2012  |  ‒	|  FIGURE DASH                                 |        |    --   |                          |
        |  2013  |  –	|  EN DASH                                     |  002D  |     -   |  HYPHEN-MINUS            |
        |  2014  |  —	|  EM DASH                                     |  002D  |     -   |  HYPHEN-MINUS            |
        |  2015  |  ―	|  HORIZONTAL BAR                              |  002D  |     -   |  HYPHEN-MINUS            |
        |  2016  |  ‖	|  DOUBLE VERTICAL LINE                        |        |    ||   |                          |
        |  2018  |  ‘	|  LEFT SINGLE QUOTATION MARK                  |  0027  |     '   |  APOSTROPHE              |
        |  2019  |  ’	|  RIGHT SINGLE QUOTATION MARK                 |  0027  |     '   |  APOSTROPHE              |
        |  201A  |  ‚	|  SINGLE LOW-9 QUOTATION MARK                 |  002C  |     ,   |  COMMA                   |
        |  201B  |  ‛	|  SINGLE HIGH-REVERSED-9 QUOTATION MARK       |  0060  |     `   |  GRAVE ACCENT            |
        |  201C  |  “	|  LEFT DOUBLE QUOTATION MARK                  |  0022  |     "   |  QUOTATION MARK          |
        |  201D  |  ”	|  RIGHT DOUBLE QUOTATION MARK                 |  0022  |     "   |  QUOTATION MARK          |
        |  201E  |  „	|  DOUBLE LOW-9 QUOTATION MARK                 |        |    ,,   |                          |
        |  201F  |  ‟	|  DOUBLE HIGH-REVERSED-9 QUOTATION MARK       |  0022  |     "   |  QUOTATION MARK          |
        |  2022  |  •	|  BULLET                                      |  002E  |     .   |  FULL STOP               |
        |  2024  |  ․	|  ONE DOT LEADER                              |  002E  |     .   |  FULL STOP               |
        |  2025  |  ‥	|  TWO DOT LEADER                              |        |    ..   |                          |
        |  2026  |  …	|  HORIZONTAL ELLIPSIS                         |        |   ...   |                          |
        |  2032  |  ′	|  PRIME                                       |  0027  |     '   |  APOSTROPHE              |
        |  2033  |  ″	|  DOUBLE PRIME                                |        |    ''   |                          |
        |  2034  |  ‴	|  TRIPLE PRIME                                |        |   '''   |                          |
        |  2035  |  ‵	|  REVERSED PRIME                              |  0060  |     `   |  GRAVE ACCENT            |
        |  2036  |  ‶	|  REVERSED DOUBLE PRIME                       |        |    ``   |                          |
        |  2037  |  ‷	|  REVERSED TRIPLE PRIME                       |        |   ```   |                          |
        |  2039  |  ‹	|  SINGLE LEFT-POINTING ANGLE QUOTATION MARK   |  003C  |     <   |  LESS-THAN SIGN          |
        |  203A  |  ›	|  SINGLE RIGHT-POINTING ANGLE QUOTATION MARK  |  003E  |     >   |  GREATER-THAN SIGN       |
        |  203D  |  ‽	|  INTERROBANG                                 |        |    !?   |                          |
        |  2044  |  ⁄	|  FRACTION SLASH                              |  002F  |     /   |  SOLIDUS                 |
        +--------+------+----------------------------------------------+--------+---------+--------------------------+
        |  2212  |  −	|  MINUS SIGN                                  |  002D  |     -   |  HYPHEN-MINUS            |
        |  2215  |  ∕	|  DIVISION SLASH                              |  002F  |     /   |  SOLIDUS                 |
        |  2216  |  ∖	|  SET MINUS                                   |  005C  |     \   |  REVERSE SOLIDUS         |
        |  2217  |  ∗	|  ASTERISK OPERATOR                           |  002A  |     *   |  ASTERISK                |
        |  2223  |  ∣	|  DIVIDES                                     |  007C  |     |   |  VERTICAL LINE           |
        |  2225  |  ∥	|  PARALLEL TO                                 |        |    ||   |                          |
        |  2227  |  ∧	|  LOGICAL AND                                 |  005E  |     ^   |  CIRCUMFLEX ACCENT       |
        |  2228  |  ∨	|  LOGICAL OR                                  |  0056  |     V   |  LATIN CAPITAL LETTER V  |
        |  222A  |  ∪	|  UNION                                       |  0055  |     U   |  LATIN CAPITAL LETTER U  |
        |  2236  |  ∶	|  RATIO                                       |  003A  |     :   |  COLON                   |
        |  2237  |  ∷	|  PROPORTION                                  |        |    ::   |                          |
        |  2239  |  ∹	|  EXCESS                                      |        |    -:   |                          |
        |  223C  |  ∼	|  TILDE OPERATOR                              |  007E  |     ~   |  TILDE                   |
        |  2254  |  ≔	|  COLON EQUALS                                |        |    :=   |                          |
        |  2255  |  ≕	|  EQUALS COLON                                |        |    =:   |                          |
        |  2264  |  ≤	|  LESS-THAN OR EQUAL TO                       |        |    <=   |                          |
        |  2265  |  ≥	|  GREATER-THAN OR EQUAL TO                    |        |    >=   |                          |
        |  226A  |  ≪	|  MUCH LESS-THAN                              |        |    <<   |                          |
        |  226B  |  ≫	|  MUCH GREATER-THAN                           |        |    >>   |                          |
        |  2276  |  ≶	|  LESS-THAN OR GREATER-THAN                   |        |   <|>   |                          |
        |  2277  |  ≷	|  GREATER-THAN OR LESS-THAN                   |        |   >|<   |                          |
        |  22C0  |  ⋀	|  N-ARY LOGICAL AND                           |  005E  |     ^   |  CIRCUMFLEX ACCENT       |
        |  22C1  |  ⋁	|  N-ARY LOGICAL OR                            |  0056  |     V   |  LATIN CAPITAL LETTER V  |
        |  22C3  |  ⋃	|  N-ARY UNION                                 |  0055  |     U   |  LATIN CAPITAL LETTER U  |
        |  22C5  |  ⋅	|  DOT OPERATOR                                |  002E  |     .   |  FULL STOP               |
        |  22C6  |  ⋆	|  STAR OPERATOR                               |  002A  |     *   |  ASTERISK                |
        |  22D8  |  ⋘	|  VERY MUCH LESS-THAN                         |        |   <<<   |                          |
        |  22D9  |  ⋙	|  VERY MUCH GREATER-THAN                      |        |   >>>   |                          |
        |  22EF  |  ⋯	|  MIDLINE HORIZONTAL ELLIPSIS                 |        |   ...   |                          |
        +--------+------+----------------------------------------------+--------+---------+--------------------------+
        |  2639  |  ☹	|  WHITE FROWNING FACE                         |        |   :-(   |                          |
        |  263A  |  ☺	|  WHITE SMILING FACE                          |        |   :-)   |                          |
        +--------+------+----------------------------------------------+--------+---------+--------------------------+
        |  FFFD  |  �	|  REPLACEMENT CHARACTER                       |  003F  |     ?   |  QUESTION MARK           |
        +--------+------+----------------------------------------------+--------+---------+--------------------------+
        

        Now, let’s suppose that, from the list, below, you would like to replace these 14 Unicode characters, on the left, with their similar standard character, on the right :

        |  00A6  |  ¦	|  BROKEN BAR                                  |  007C  |     |   |  VERTICAL LINE           |
        |  00BD  |  ½	|  VULGAR FRACTION ONE HALF                    |        |   1/2   |                          |
        |  2000  |   	|  EN QUAD                                     |        | \x20{2} |                          |
        |  2001  |   	|  EM QUAD                                     |        | \x20{4} |                          |
        |  2018  |  ‘	|  LEFT SINGLE QUOTATION MARK                  |  0027  |     '   |  APOSTROPHE              |
        |  2019  |  ’	|  RIGHT SINGLE QUOTATION MARK                 |  0027  |     '   |  APOSTROPHE              |
        |  201C  |  “	|  LEFT DOUBLE QUOTATION MARK                  |  0022  |     "   |  QUOTATION MARK          |
        |  201D  |  ”	|  RIGHT DOUBLE QUOTATION MARK                 |  0022  |     "   |  QUOTATION MARK          |
        |  203D  |  ‽	|  INTERROBANG                                 |        |    !?   |                          |
        |  2264  |  ≤	|  LESS-THAN OR EQUAL TO                       |        |    <=   |                          |
        |  2265  |  ≥	|  GREATER-THAN OR EQUAL TO                    |        |    >=   |                          |
        |  2639  |  ☹	|  WHITE FROWNING FACE                         |        |   :-(   |                          |
        |  263A  |  ☺	|  WHITE SMILING FACE                          |        |   :-)   |                          |
        |  FFFD  |  �	|  REPLACEMENT CHARACTER                       |  003F  |     ?   |  QUESTION MARK           |
        

        Then :

        • Open the Replace dialog, in N++ ( Ctrl + H )

        • Type in the regex (¦)|(½)|( )|( )|(‘)|(’)|(“)|(”)|(‽)|(≤)|(≥)|(☹)|(☺)|(�), in the Find what: zone

        • Type in the regex (?1|)(?{2}1/2)(?3\x20\x20)(?4\x20\x20\x20\x20)(?5')(?6')(?7")(?8")(?9!?)(?{10}<=)(?{11}>=)(?{12}\:-\()(?{13}\:-\))(?{14}?), in the Replace with: zone

        • Tick the Wrap around option

        • Select the Regular expression search mode

        • Click, once , on the Replace All button, or several times on the Replace button

        Et voilà !


        Notes :

        • In search, we, simply, put each character, to be replaced, between round parentheses, in order to be stored as group 1, 2 and so on…

        • In replacement, we use a special conditional syntax (?#xxxx:yyyy) or (?{#..#}xxxx:yyyy), where :

          • # or #...# represents a group number

          • The part xxxx is rewritten, if group # or #...# exists

          • The part yyyy is rewritten, if group # or #...# does not exist

        • In our case, the ELSE part, in each conditional replacement, is not present

        • If a part xxxx or yyyy contains the character :, ( or ), it must be escaped ( preceded ) with a \ symbol

        • For the second conditional replacement, I used the syntax (?{2}1/2), on purpose ! Indeed, if I would have used the (?21/2) syntax, the regex engine would have, wrongly, tried to replace any searched group 21 with the /2 string !!

        • To end with, note that quantifiers, as {#}, do not work, in replacement. So we need to change, for instance, the \x20{2} syntax ( 2 space characters) by the simple \x20\x20 one !

        Best Regards,

        guy038

        1 Reply Last reply Reply Quote 0

        Hello! It looks like you're interested in this conversation, but you don't have an account yet.

        Getting fed up of having to scroll through the same posts each visit? When you register for an account, you'll always come back to exactly where you were before, and choose to be notified of new replies (either via email, or push notification). You'll also be able to save bookmarks and upvote posts to show your appreciation to other community members.

        With your input, this post could be even better 💗

        Register Login
        • First post
          Last post
        The Community of users of the Notepad++ text editor.
        Powered by NodeBB | Contributors