Any way to replace all Non ASCII characters i.e. all x80 or greater within a text file?



  • I have been getting text files written by non-standard keyboards (non USA character sets). The quote character ’ hex 27 is showing as the HEX string E2 80 99.

    Task #1 I want to be able to find all characters greater than x7F i.e x80 or greater in text files.

    Task #2 Once found then I can fix or replace them with a more standard ASCII char(s).
    Any macro or other way to do these tasks?

    Thanks Jaack



  • Hello, @jaack-mcmahon, and All,

    Here, is, bellow, a NON-exhaustive table of some Unicode characters, with code-point, above 007Fh, taken from the following Unicode blocks :

    • Latin 1 Supplement
    • General Punctuation
    • Mathematical Operators
    • Miscellaneous Symbols
    • Specials

    which can be replaced by a similar standard ASCII character, with code-point < 0080h :

    +--------------------------------------------------------------+---------------------------------------------+
    |           NON-ASCII Character with Code > \x{007F}           |  Similar Character(s) with Code < \x{0080}  |
    +--------------------------------------------------------------+---------------------------------------------+
    |  Code  | Char |                     Character Name           |  Code  |  Char   |      Character Name      |
    +--------+------+----------------------------------------------+--------+---------+--------------------------+
    |  00A0  |   	|  NO-BREAK SPACE                              |  0020  |         |  SPACE                   |
    |  00A6  |  ¦	|  BROKEN BAR                                  |  007C  |     |   |  VERTICAL LINE           |
    |  00AB  |  «	|  LEFT-POINTING DOUBLE ANGLE QUOTATION MARK   |  0022  |     "   |  QUOTATION MARK          |
    |  00AD  |  ­	|  SOFT HYPHEN                                 |  002D  |     -   |  HYPHEN-MINUS            |
    |  00B4  |  ´	|  ACUTE ACCENT                                |  0027  |     '   |  APOSTROPHE              |
    |  00B7  |  ·	|  MIDDLE DOT                                  |  002E  |     .   |  FULL STOP               |
    |  00BB  |  »	|  RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK  |  0022  |     "   |  QUOTATION MARK          |
    |  00BC  |  ¼	|  VULGAR FRACTION ONE QUARTER                 |        |   1/4   |                          |
    |  00BD  |  ½	|  VULGAR FRACTION ONE HALF                    |        |   1/2   |                          |
    |  00BE  |  ¾	|  VULGAR FRACTION THREE QUARTERS              |        |   3/4   |                          |
    |  00D7  |  ×	|  MULTIPLICATION SIGN                         |  0078  |     x   |  LATIN SMALL LETTER X    |
    +--------+------+----------------------------------------------+--------+---------+--------------------------+
    |  2000  |   	|  EN QUAD                                     |        | \x20{2} |                          |
    |  2001  |   	|  EM QUAD                                     |        | \x20{4} |                          |
    |  2002  |   	|  EN SPACE                                    |        | \x20{2} |                          |
    |  2003  |   	|  EM SPACE                                    |        | \x20{4} |                          |
    |  2004  |   	|  THREE-PER-EM SPACE                          |  0020  |         |  SPACE                   |
    |  2005  |   	|  FOUR-PER-EM SPACE                           |  0020  |         |  SPACE                   |
    |  2007  |   	|  FIGURE SPACE                                |        | \x20{2} |                          |
    |  2008  |   	|  PUNCTUATION SPACE                           |  0020  |         |  SPACE                   |
    |  2010  |  ‐	|  HYPHEN                                      |  002D  |     -   |  HYPHEN-MINUS            |
    |  2011  |  ‑	|  NON-BREAKING HYPHEN                         |  002D  |     -   |  HYPHEN-MINUS            |
    |  2012  |  ‒	|  FIGURE DASH                                 |        |    --   |                          |
    |  2013  |  –	|  EN DASH                                     |  002D  |     -   |  HYPHEN-MINUS            |
    |  2014  |  —	|  EM DASH                                     |  002D  |     -   |  HYPHEN-MINUS            |
    |  2015  |  ―	|  HORIZONTAL BAR                              |  002D  |     -   |  HYPHEN-MINUS            |
    |  2016  |  ‖	|  DOUBLE VERTICAL LINE                        |        |    ||   |                          |
    |  2018  |  ‘	|  LEFT SINGLE QUOTATION MARK                  |  0027  |     '   |  APOSTROPHE              |
    |  2019  |  ’	|  RIGHT SINGLE QUOTATION MARK                 |  0027  |     '   |  APOSTROPHE              |
    |  201A  |  ‚	|  SINGLE LOW-9 QUOTATION MARK                 |  002C  |     ,   |  COMMA                   |
    |  201B  |  ‛	|  SINGLE HIGH-REVERSED-9 QUOTATION MARK       |  0060  |     `   |  GRAVE ACCENT            |
    |  201C  |  “	|  LEFT DOUBLE QUOTATION MARK                  |  0022  |     "   |  QUOTATION MARK          |
    |  201D  |  ”	|  RIGHT DOUBLE QUOTATION MARK                 |  0022  |     "   |  QUOTATION MARK          |
    |  201E  |  „	|  DOUBLE LOW-9 QUOTATION MARK                 |        |    ,,   |                          |
    |  201F  |  ‟	|  DOUBLE HIGH-REVERSED-9 QUOTATION MARK       |  0022  |     "   |  QUOTATION MARK          |
    |  2022  |  •	|  BULLET                                      |  002E  |     .   |  FULL STOP               |
    |  2024  |  ․	|  ONE DOT LEADER                              |  002E  |     .   |  FULL STOP               |
    |  2025  |  ‥	|  TWO DOT LEADER                              |        |    ..   |                          |
    |  2026  |  …	|  HORIZONTAL ELLIPSIS                         |        |   ...   |                          |
    |  2032  |  ′	|  PRIME                                       |  0027  |     '   |  APOSTROPHE              |
    |  2033  |  ″	|  DOUBLE PRIME                                |        |    ''   |                          |
    |  2034  |  ‴	|  TRIPLE PRIME                                |        |   '''   |                          |
    |  2035  |  ‵	|  REVERSED PRIME                              |  0060  |     `   |  GRAVE ACCENT            |
    |  2036  |  ‶	|  REVERSED DOUBLE PRIME                       |        |    ``   |                          |
    |  2037  |  ‷	|  REVERSED TRIPLE PRIME                       |        |   ```   |                          |
    |  2039  |  ‹	|  SINGLE LEFT-POINTING ANGLE QUOTATION MARK   |  003C  |     <   |  LESS-THAN SIGN          |
    |  203A  |  ›	|  SINGLE RIGHT-POINTING ANGLE QUOTATION MARK  |  003E  |     >   |  GREATER-THAN SIGN       |
    |  203D  |  ‽	|  INTERROBANG                                 |        |    !?   |                          |
    |  2044  |  ⁄	|  FRACTION SLASH                              |  002F  |     /   |  SOLIDUS                 |
    +--------+------+----------------------------------------------+--------+---------+--------------------------+
    |  2212  |  −	|  MINUS SIGN                                  |  002D  |     -   |  HYPHEN-MINUS            |
    |  2215  |  ∕	|  DIVISION SLASH                              |  002F  |     /   |  SOLIDUS                 |
    |  2216  |  ∖	|  SET MINUS                                   |  005C  |     \   |  REVERSE SOLIDUS         |
    |  2217  |  ∗	|  ASTERISK OPERATOR                           |  002A  |     *   |  ASTERISK                |
    |  2223  |  ∣	|  DIVIDES                                     |  007C  |     |   |  VERTICAL LINE           |
    |  2225  |  ∥	|  PARALLEL TO                                 |        |    ||   |                          |
    |  2227  |  ∧	|  LOGICAL AND                                 |  005E  |     ^   |  CIRCUMFLEX ACCENT       |
    |  2228  |  ∨	|  LOGICAL OR                                  |  0056  |     V   |  LATIN CAPITAL LETTER V  |
    |  222A  |  ∪	|  UNION                                       |  0055  |     U   |  LATIN CAPITAL LETTER U  |
    |  2236  |  ∶	|  RATIO                                       |  003A  |     :   |  COLON                   |
    |  2237  |  ∷	|  PROPORTION                                  |        |    ::   |                          |
    |  2239  |  ∹	|  EXCESS                                      |        |    -:   |                          |
    |  223C  |  ∼	|  TILDE OPERATOR                              |  007E  |     ~   |  TILDE                   |
    |  2254  |  ≔	|  COLON EQUALS                                |        |    :=   |                          |
    |  2255  |  ≕	|  EQUALS COLON                                |        |    =:   |                          |
    |  2264  |  ≤	|  LESS-THAN OR EQUAL TO                       |        |    <=   |                          |
    |  2265  |  ≥	|  GREATER-THAN OR EQUAL TO                    |        |    >=   |                          |
    |  226A  |  ≪	|  MUCH LESS-THAN                              |        |    <<   |                          |
    |  226B  |  ≫	|  MUCH GREATER-THAN                           |        |    >>   |                          |
    |  2276  |  ≶	|  LESS-THAN OR GREATER-THAN                   |        |   <|>   |                          |
    |  2277  |  ≷	|  GREATER-THAN OR LESS-THAN                   |        |   >|<   |                          |
    |  22C0  |  ⋀	|  N-ARY LOGICAL AND                           |  005E  |     ^   |  CIRCUMFLEX ACCENT       |
    |  22C1  |  ⋁	|  N-ARY LOGICAL OR                            |  0056  |     V   |  LATIN CAPITAL LETTER V  |
    |  22C3  |  ⋃	|  N-ARY UNION                                 |  0055  |     U   |  LATIN CAPITAL LETTER U  |
    |  22C5  |  ⋅	|  DOT OPERATOR                                |  002E  |     .   |  FULL STOP               |
    |  22C6  |  ⋆	|  STAR OPERATOR                               |  002A  |     *   |  ASTERISK                |
    |  22D8  |  ⋘	|  VERY MUCH LESS-THAN                         |        |   <<<   |                          |
    |  22D9  |  ⋙	|  VERY MUCH GREATER-THAN                      |        |   >>>   |                          |
    |  22EF  |  ⋯	|  MIDLINE HORIZONTAL ELLIPSIS                 |        |   ...   |                          |
    +--------+------+----------------------------------------------+--------+---------+--------------------------+
    |  2639  |  ☹	|  WHITE FROWNING FACE                         |        |   :-(   |                          |
    |  263A  |  ☺	|  WHITE SMILING FACE                          |        |   :-)   |                          |
    +--------+------+----------------------------------------------+--------+---------+--------------------------+
    |  FFFD  |  �	|  REPLACEMENT CHARACTER                       |  003F  |     ?   |  QUESTION MARK           |
    +--------+------+----------------------------------------------+--------+---------+--------------------------+
    

    Now, let’s suppose that, from the list, below, you would like to replace these 14 Unicode characters, on the left, with their similar standard character, on the right :

    |  00A6  |  ¦	|  BROKEN BAR                                  |  007C  |     |   |  VERTICAL LINE           |
    |  00BD  |  ½	|  VULGAR FRACTION ONE HALF                    |        |   1/2   |                          |
    |  2000  |   	|  EN QUAD                                     |        | \x20{2} |                          |
    |  2001  |   	|  EM QUAD                                     |        | \x20{4} |                          |
    |  2018  |  ‘	|  LEFT SINGLE QUOTATION MARK                  |  0027  |     '   |  APOSTROPHE              |
    |  2019  |  ’	|  RIGHT SINGLE QUOTATION MARK                 |  0027  |     '   |  APOSTROPHE              |
    |  201C  |  “	|  LEFT DOUBLE QUOTATION MARK                  |  0022  |     "   |  QUOTATION MARK          |
    |  201D  |  ”	|  RIGHT DOUBLE QUOTATION MARK                 |  0022  |     "   |  QUOTATION MARK          |
    |  203D  |  ‽	|  INTERROBANG                                 |        |    !?   |                          |
    |  2264  |  ≤	|  LESS-THAN OR EQUAL TO                       |        |    <=   |                          |
    |  2265  |  ≥	|  GREATER-THAN OR EQUAL TO                    |        |    >=   |                          |
    |  2639  |  ☹	|  WHITE FROWNING FACE                         |        |   :-(   |                          |
    |  263A  |  ☺	|  WHITE SMILING FACE                          |        |   :-)   |                          |
    |  FFFD  |  �	|  REPLACEMENT CHARACTER                       |  003F  |     ?   |  QUESTION MARK           |
    

    Then :

    • Open the Replace dialog, in N++ ( Ctrl + H )

    • Type in the regex (¦)|(½)|( )|( )|(‘)|(’)|(“)|(”)|(‽)|(≤)|(≥)|(☹)|(☺)|(�), in the Find what: zone

    • Type in the regex (?1|)(?{2}1/2)(?3\x20\x20)(?4\x20\x20\x20\x20)(?5')(?6')(?7")(?8")(?9!?)(?{10}<=)(?{11}>=)(?{12}\:-\()(?{13}\:-\))(?{14}?), in the Replace with: zone

    • Tick the Wrap around option

    • Select the Regular expression search mode

    • Click, once , on the Replace All button, or several times on the Replace button

    Et voilà !


    Notes :

    • In search, we, simply, put each character, to be replaced, between round parentheses, in order to be stored as group 1, 2 and so on…

    • In replacement, we use a special conditional syntax (?#xxxx:yyyy) or (?{#..#}xxxx:yyyy), where :

      • # or #...# represents a group number

      • The part xxxx is rewritten, if group # or #...# exists

      • The part yyyy is rewritten, if group # or #...# does not exist

    • In our case, the ELSE part, in each conditional replacement, is not present

    • If a part xxxx or yyyy contains the character :, ( or ), it must be escaped ( preceded ) with a \ symbol

    • For the second conditional replacement, I used the syntax (?{2}1/2), on purpose ! Indeed, if I would have used the (?21/2) syntax, the regex engine would have, wrongly, tried to replace any searched group 21 with the /2 string !!

    • To end with, note that quantifiers, as {#}, do not work, in replacement. So we need to change, for instance, the \x20{2} syntax ( 2 space characters) by the simple \x20\x20 one !

    Best Regards,

    guy038


Log in to reply