Any way to replace all Non ASCII characters i.e. all x80 or greater within a text file?
-
I have been getting text files written by non-standard keyboards (non USA character sets). The quote character ’ hex 27 is showing as the HEX string E2 80 99.
Task #1 I want to be able to find all characters greater than x7F i.e x80 or greater in text files.
Task #2 Once found then I can fix or replace them with a more standard ASCII char(s).
Any macro or other way to do these tasks?Thanks Jaack
-
Hello, @jaack-mcmahon, and All,
Here, is, bellow, a NON-exhaustive table of some Unicode characters, with code-point, above
007Fh
, taken from the following Unicode blocks :Latin 1 Supplement
General Punctuation
Mathematical Operators
Miscellaneous Symbols
Specials
which can be replaced by a similar standard ASCII character, with code-point <
0080h
:+--------------------------------------------------------------+---------------------------------------------+ | NON-ASCII Character with Code > \x{007F} | Similar Character(s) with Code < \x{0080} | +--------------------------------------------------------------+---------------------------------------------+ | Code | Char | Character Name | Code | Char | Character Name | +--------+------+----------------------------------------------+--------+---------+--------------------------+ | 00A0 | | NO-BREAK SPACE | 0020 | | SPACE | | 00A6 | ¦ | BROKEN BAR | 007C | | | VERTICAL LINE | | 00AB | « | LEFT-POINTING DOUBLE ANGLE QUOTATION MARK | 0022 | " | QUOTATION MARK | | 00AD | | SOFT HYPHEN | 002D | - | HYPHEN-MINUS | | 00B4 | ´ | ACUTE ACCENT | 0027 | ' | APOSTROPHE | | 00B7 | · | MIDDLE DOT | 002E | . | FULL STOP | | 00BB | » | RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK | 0022 | " | QUOTATION MARK | | 00BC | ¼ | VULGAR FRACTION ONE QUARTER | | 1/4 | | | 00BD | ½ | VULGAR FRACTION ONE HALF | | 1/2 | | | 00BE | ¾ | VULGAR FRACTION THREE QUARTERS | | 3/4 | | | 00D7 | × | MULTIPLICATION SIGN | 0078 | x | LATIN SMALL LETTER X | +--------+------+----------------------------------------------+--------+---------+--------------------------+ | 2000 | | EN QUAD | | \x20{2} | | | 2001 | | EM QUAD | | \x20{4} | | | 2002 | | EN SPACE | | \x20{2} | | | 2003 | | EM SPACE | | \x20{4} | | | 2004 | | THREE-PER-EM SPACE | 0020 | | SPACE | | 2005 | | FOUR-PER-EM SPACE | 0020 | | SPACE | | 2007 | | FIGURE SPACE | | \x20{2} | | | 2008 | | PUNCTUATION SPACE | 0020 | | SPACE | | 2010 | ‐ | HYPHEN | 002D | - | HYPHEN-MINUS | | 2011 | ‑ | NON-BREAKING HYPHEN | 002D | - | HYPHEN-MINUS | | 2012 | ‒ | FIGURE DASH | | -- | | | 2013 | – | EN DASH | 002D | - | HYPHEN-MINUS | | 2014 | — | EM DASH | 002D | - | HYPHEN-MINUS | | 2015 | ― | HORIZONTAL BAR | 002D | - | HYPHEN-MINUS | | 2016 | ‖ | DOUBLE VERTICAL LINE | | || | | | 2018 | ‘ | LEFT SINGLE QUOTATION MARK | 0027 | ' | APOSTROPHE | | 2019 | ’ | RIGHT SINGLE QUOTATION MARK | 0027 | ' | APOSTROPHE | | 201A | ‚ | SINGLE LOW-9 QUOTATION MARK | 002C | , | COMMA | | 201B | ‛ | SINGLE HIGH-REVERSED-9 QUOTATION MARK | 0060 | ` | GRAVE ACCENT | | 201C | “ | LEFT DOUBLE QUOTATION MARK | 0022 | " | QUOTATION MARK | | 201D | ” | RIGHT DOUBLE QUOTATION MARK | 0022 | " | QUOTATION MARK | | 201E | „ | DOUBLE LOW-9 QUOTATION MARK | | ,, | | | 201F | ‟ | DOUBLE HIGH-REVERSED-9 QUOTATION MARK | 0022 | " | QUOTATION MARK | | 2022 | • | BULLET | 002E | . | FULL STOP | | 2024 | ․ | ONE DOT LEADER | 002E | . | FULL STOP | | 2025 | ‥ | TWO DOT LEADER | | .. | | | 2026 | … | HORIZONTAL ELLIPSIS | | ... | | | 2032 | ′ | PRIME | 0027 | ' | APOSTROPHE | | 2033 | ″ | DOUBLE PRIME | | '' | | | 2034 | ‴ | TRIPLE PRIME | | ''' | | | 2035 | ‵ | REVERSED PRIME | 0060 | ` | GRAVE ACCENT | | 2036 | ‶ | REVERSED DOUBLE PRIME | | `` | | | 2037 | ‷ | REVERSED TRIPLE PRIME | | ``` | | | 2039 | ‹ | SINGLE LEFT-POINTING ANGLE QUOTATION MARK | 003C | < | LESS-THAN SIGN | | 203A | › | SINGLE RIGHT-POINTING ANGLE QUOTATION MARK | 003E | > | GREATER-THAN SIGN | | 203D | ‽ | INTERROBANG | | !? | | | 2044 | ⁄ | FRACTION SLASH | 002F | / | SOLIDUS | +--------+------+----------------------------------------------+--------+---------+--------------------------+ | 2212 | − | MINUS SIGN | 002D | - | HYPHEN-MINUS | | 2215 | ∕ | DIVISION SLASH | 002F | / | SOLIDUS | | 2216 | ∖ | SET MINUS | 005C | \ | REVERSE SOLIDUS | | 2217 | ∗ | ASTERISK OPERATOR | 002A | * | ASTERISK | | 2223 | ∣ | DIVIDES | 007C | | | VERTICAL LINE | | 2225 | ∥ | PARALLEL TO | | || | | | 2227 | ∧ | LOGICAL AND | 005E | ^ | CIRCUMFLEX ACCENT | | 2228 | ∨ | LOGICAL OR | 0056 | V | LATIN CAPITAL LETTER V | | 222A | ∪ | UNION | 0055 | U | LATIN CAPITAL LETTER U | | 2236 | ∶ | RATIO | 003A | : | COLON | | 2237 | ∷ | PROPORTION | | :: | | | 2239 | ∹ | EXCESS | | -: | | | 223C | ∼ | TILDE OPERATOR | 007E | ~ | TILDE | | 2254 | ≔ | COLON EQUALS | | := | | | 2255 | ≕ | EQUALS COLON | | =: | | | 2264 | ≤ | LESS-THAN OR EQUAL TO | | <= | | | 2265 | ≥ | GREATER-THAN OR EQUAL TO | | >= | | | 226A | ≪ | MUCH LESS-THAN | | << | | | 226B | ≫ | MUCH GREATER-THAN | | >> | | | 2276 | ≶ | LESS-THAN OR GREATER-THAN | | <|> | | | 2277 | ≷ | GREATER-THAN OR LESS-THAN | | >|< | | | 22C0 | ⋀ | N-ARY LOGICAL AND | 005E | ^ | CIRCUMFLEX ACCENT | | 22C1 | ⋁ | N-ARY LOGICAL OR | 0056 | V | LATIN CAPITAL LETTER V | | 22C3 | ⋃ | N-ARY UNION | 0055 | U | LATIN CAPITAL LETTER U | | 22C5 | ⋅ | DOT OPERATOR | 002E | . | FULL STOP | | 22C6 | ⋆ | STAR OPERATOR | 002A | * | ASTERISK | | 22D8 | ⋘ | VERY MUCH LESS-THAN | | <<< | | | 22D9 | ⋙ | VERY MUCH GREATER-THAN | | >>> | | | 22EF | ⋯ | MIDLINE HORIZONTAL ELLIPSIS | | ... | | +--------+------+----------------------------------------------+--------+---------+--------------------------+ | 2639 | ☹ | WHITE FROWNING FACE | | :-( | | | 263A | ☺ | WHITE SMILING FACE | | :-) | | +--------+------+----------------------------------------------+--------+---------+--------------------------+ | FFFD | � | REPLACEMENT CHARACTER | 003F | ? | QUESTION MARK | +--------+------+----------------------------------------------+--------+---------+--------------------------+
Now, let’s suppose that, from the list, below, you would like to replace these
14
Unicode characters, on the left, with their similar standard character, on the right :| 00A6 | ¦ | BROKEN BAR | 007C | | | VERTICAL LINE | | 00BD | ½ | VULGAR FRACTION ONE HALF | | 1/2 | | | 2000 | | EN QUAD | | \x20{2} | | | 2001 | | EM QUAD | | \x20{4} | | | 2018 | ‘ | LEFT SINGLE QUOTATION MARK | 0027 | ' | APOSTROPHE | | 2019 | ’ | RIGHT SINGLE QUOTATION MARK | 0027 | ' | APOSTROPHE | | 201C | “ | LEFT DOUBLE QUOTATION MARK | 0022 | " | QUOTATION MARK | | 201D | ” | RIGHT DOUBLE QUOTATION MARK | 0022 | " | QUOTATION MARK | | 203D | ‽ | INTERROBANG | | !? | | | 2264 | ≤ | LESS-THAN OR EQUAL TO | | <= | | | 2265 | ≥ | GREATER-THAN OR EQUAL TO | | >= | | | 2639 | ☹ | WHITE FROWNING FACE | | :-( | | | 263A | ☺ | WHITE SMILING FACE | | :-) | | | FFFD | � | REPLACEMENT CHARACTER | 003F | ? | QUESTION MARK |
Then :
-
Open the Replace dialog, in N++ (
Ctrl + H
) -
Type in the regex
(¦)|(½)|( )|( )|(‘)|(’)|(“)|(”)|(‽)|(≤)|(≥)|(☹)|(☺)|(�)
, in the Find what: zone -
Type in the regex
(?1|)(?{2}1/2)(?3\x20\x20)(?4\x20\x20\x20\x20)(?5')(?6')(?7")(?8")(?9!?)(?{10}<=)(?{11}>=)(?{12}\:-\()(?{13}\:-\))(?{14}?)
, in the Replace with: zone -
Tick the
Wrap around
option -
Select the
Regular expression
search mode -
Click, once , on the
Replace All
button, or several times on the Replace button
Et voilà !
Notes :
-
In search, we, simply, put each character, to be replaced, between round parentheses, in order to be stored as group
1
,2
and so on… -
In replacement, we use a special conditional syntax
(?#xxxx:yyyy)
or(?{#..#}xxxx:yyyy)
, where :-
#
or#...#
represents a group number -
The part
xxxx
is rewritten, if group#
or#...#
exists -
The part
yyyy
is rewritten, if group#
or#...#
does not exist
-
-
In our case, the ELSE part, in each conditional replacement, is not present
-
If a part
xxxx
oryyyy
contains the character:
,(
or)
, it must be escaped ( preceded ) with a\
symbol -
For the second conditional replacement, I used the syntax
(?{2}1/2)
, on purpose ! Indeed, if I would have used the(?21/2)
syntax, the regex engine would have, wrongly, tried to replace any searched group21
with the/2
string !! -
To end with, note that quantifiers, as
{#}
, do not work, in replacement. So we need to change, for instance, the\x20{2}
syntax ( 2 space characters) by the simple\x20\x20
one !
Best Regards,
guy038