Any way to replace all Non ASCII characters i.e. all x80 or greater within a text file?
-
I have been getting text files written by non-standard keyboards (non USA character sets). The quote character ’ hex 27 is showing as the HEX string E2 80 99.
Task #1 I want to be able to find all characters greater than x7F i.e x80 or greater in text files.
Task #2 Once found then I can fix or replace them with a more standard ASCII char(s).
Any macro or other way to do these tasks?Thanks Jaack
-
Hello, @jaack-mcmahon, and All,
Here, is, bellow, a NON-exhaustive table of some Unicode characters, with code-point, above
007Fh, taken from the following Unicode blocks :Latin 1 SupplementGeneral PunctuationMathematical OperatorsMiscellaneous SymbolsSpecials
which can be replaced by a similar standard ASCII character, with code-point <
0080h:+--------------------------------------------------------------+---------------------------------------------+ | NON-ASCII Character with Code > \x{007F} | Similar Character(s) with Code < \x{0080} | +--------------------------------------------------------------+---------------------------------------------+ | Code | Char | Character Name | Code | Char | Character Name | +--------+------+----------------------------------------------+--------+---------+--------------------------+ | 00A0 | | NO-BREAK SPACE | 0020 | | SPACE | | 00A6 | ¦ | BROKEN BAR | 007C | | | VERTICAL LINE | | 00AB | « | LEFT-POINTING DOUBLE ANGLE QUOTATION MARK | 0022 | " | QUOTATION MARK | | 00AD | | SOFT HYPHEN | 002D | - | HYPHEN-MINUS | | 00B4 | ´ | ACUTE ACCENT | 0027 | ' | APOSTROPHE | | 00B7 | · | MIDDLE DOT | 002E | . | FULL STOP | | 00BB | » | RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK | 0022 | " | QUOTATION MARK | | 00BC | ¼ | VULGAR FRACTION ONE QUARTER | | 1/4 | | | 00BD | ½ | VULGAR FRACTION ONE HALF | | 1/2 | | | 00BE | ¾ | VULGAR FRACTION THREE QUARTERS | | 3/4 | | | 00D7 | × | MULTIPLICATION SIGN | 0078 | x | LATIN SMALL LETTER X | +--------+------+----------------------------------------------+--------+---------+--------------------------+ | 2000 | | EN QUAD | | \x20{2} | | | 2001 | | EM QUAD | | \x20{4} | | | 2002 | | EN SPACE | | \x20{2} | | | 2003 | | EM SPACE | | \x20{4} | | | 2004 | | THREE-PER-EM SPACE | 0020 | | SPACE | | 2005 | | FOUR-PER-EM SPACE | 0020 | | SPACE | | 2007 | | FIGURE SPACE | | \x20{2} | | | 2008 | | PUNCTUATION SPACE | 0020 | | SPACE | | 2010 | ‐ | HYPHEN | 002D | - | HYPHEN-MINUS | | 2011 | ‑ | NON-BREAKING HYPHEN | 002D | - | HYPHEN-MINUS | | 2012 | ‒ | FIGURE DASH | | -- | | | 2013 | – | EN DASH | 002D | - | HYPHEN-MINUS | | 2014 | — | EM DASH | 002D | - | HYPHEN-MINUS | | 2015 | ― | HORIZONTAL BAR | 002D | - | HYPHEN-MINUS | | 2016 | ‖ | DOUBLE VERTICAL LINE | | || | | | 2018 | ‘ | LEFT SINGLE QUOTATION MARK | 0027 | ' | APOSTROPHE | | 2019 | ’ | RIGHT SINGLE QUOTATION MARK | 0027 | ' | APOSTROPHE | | 201A | ‚ | SINGLE LOW-9 QUOTATION MARK | 002C | , | COMMA | | 201B | ‛ | SINGLE HIGH-REVERSED-9 QUOTATION MARK | 0060 | ` | GRAVE ACCENT | | 201C | “ | LEFT DOUBLE QUOTATION MARK | 0022 | " | QUOTATION MARK | | 201D | ” | RIGHT DOUBLE QUOTATION MARK | 0022 | " | QUOTATION MARK | | 201E | „ | DOUBLE LOW-9 QUOTATION MARK | | ,, | | | 201F | ‟ | DOUBLE HIGH-REVERSED-9 QUOTATION MARK | 0022 | " | QUOTATION MARK | | 2022 | • | BULLET | 002E | . | FULL STOP | | 2024 | ․ | ONE DOT LEADER | 002E | . | FULL STOP | | 2025 | ‥ | TWO DOT LEADER | | .. | | | 2026 | … | HORIZONTAL ELLIPSIS | | ... | | | 2032 | ′ | PRIME | 0027 | ' | APOSTROPHE | | 2033 | ″ | DOUBLE PRIME | | '' | | | 2034 | ‴ | TRIPLE PRIME | | ''' | | | 2035 | ‵ | REVERSED PRIME | 0060 | ` | GRAVE ACCENT | | 2036 | ‶ | REVERSED DOUBLE PRIME | | `` | | | 2037 | ‷ | REVERSED TRIPLE PRIME | | ``` | | | 2039 | ‹ | SINGLE LEFT-POINTING ANGLE QUOTATION MARK | 003C | < | LESS-THAN SIGN | | 203A | › | SINGLE RIGHT-POINTING ANGLE QUOTATION MARK | 003E | > | GREATER-THAN SIGN | | 203D | ‽ | INTERROBANG | | !? | | | 2044 | ⁄ | FRACTION SLASH | 002F | / | SOLIDUS | +--------+------+----------------------------------------------+--------+---------+--------------------------+ | 2212 | − | MINUS SIGN | 002D | - | HYPHEN-MINUS | | 2215 | ∕ | DIVISION SLASH | 002F | / | SOLIDUS | | 2216 | ∖ | SET MINUS | 005C | \ | REVERSE SOLIDUS | | 2217 | ∗ | ASTERISK OPERATOR | 002A | * | ASTERISK | | 2223 | ∣ | DIVIDES | 007C | | | VERTICAL LINE | | 2225 | ∥ | PARALLEL TO | | || | | | 2227 | ∧ | LOGICAL AND | 005E | ^ | CIRCUMFLEX ACCENT | | 2228 | ∨ | LOGICAL OR | 0056 | V | LATIN CAPITAL LETTER V | | 222A | ∪ | UNION | 0055 | U | LATIN CAPITAL LETTER U | | 2236 | ∶ | RATIO | 003A | : | COLON | | 2237 | ∷ | PROPORTION | | :: | | | 2239 | ∹ | EXCESS | | -: | | | 223C | ∼ | TILDE OPERATOR | 007E | ~ | TILDE | | 2254 | ≔ | COLON EQUALS | | := | | | 2255 | ≕ | EQUALS COLON | | =: | | | 2264 | ≤ | LESS-THAN OR EQUAL TO | | <= | | | 2265 | ≥ | GREATER-THAN OR EQUAL TO | | >= | | | 226A | ≪ | MUCH LESS-THAN | | << | | | 226B | ≫ | MUCH GREATER-THAN | | >> | | | 2276 | ≶ | LESS-THAN OR GREATER-THAN | | <|> | | | 2277 | ≷ | GREATER-THAN OR LESS-THAN | | >|< | | | 22C0 | ⋀ | N-ARY LOGICAL AND | 005E | ^ | CIRCUMFLEX ACCENT | | 22C1 | ⋁ | N-ARY LOGICAL OR | 0056 | V | LATIN CAPITAL LETTER V | | 22C3 | ⋃ | N-ARY UNION | 0055 | U | LATIN CAPITAL LETTER U | | 22C5 | ⋅ | DOT OPERATOR | 002E | . | FULL STOP | | 22C6 | ⋆ | STAR OPERATOR | 002A | * | ASTERISK | | 22D8 | ⋘ | VERY MUCH LESS-THAN | | <<< | | | 22D9 | ⋙ | VERY MUCH GREATER-THAN | | >>> | | | 22EF | ⋯ | MIDLINE HORIZONTAL ELLIPSIS | | ... | | +--------+------+----------------------------------------------+--------+---------+--------------------------+ | 2639 | ☹ | WHITE FROWNING FACE | | :-( | | | 263A | ☺ | WHITE SMILING FACE | | :-) | | +--------+------+----------------------------------------------+--------+---------+--------------------------+ | FFFD | � | REPLACEMENT CHARACTER | 003F | ? | QUESTION MARK | +--------+------+----------------------------------------------+--------+---------+--------------------------+Now, let’s suppose that, from the list, below, you would like to replace these
14Unicode characters, on the left, with their similar standard character, on the right :| 00A6 | ¦ | BROKEN BAR | 007C | | | VERTICAL LINE | | 00BD | ½ | VULGAR FRACTION ONE HALF | | 1/2 | | | 2000 | | EN QUAD | | \x20{2} | | | 2001 | | EM QUAD | | \x20{4} | | | 2018 | ‘ | LEFT SINGLE QUOTATION MARK | 0027 | ' | APOSTROPHE | | 2019 | ’ | RIGHT SINGLE QUOTATION MARK | 0027 | ' | APOSTROPHE | | 201C | “ | LEFT DOUBLE QUOTATION MARK | 0022 | " | QUOTATION MARK | | 201D | ” | RIGHT DOUBLE QUOTATION MARK | 0022 | " | QUOTATION MARK | | 203D | ‽ | INTERROBANG | | !? | | | 2264 | ≤ | LESS-THAN OR EQUAL TO | | <= | | | 2265 | ≥ | GREATER-THAN OR EQUAL TO | | >= | | | 2639 | ☹ | WHITE FROWNING FACE | | :-( | | | 263A | ☺ | WHITE SMILING FACE | | :-) | | | FFFD | � | REPLACEMENT CHARACTER | 003F | ? | QUESTION MARK |
Then :
-
Open the Replace dialog, in N++ (
Ctrl + H) -
Type in the regex
(¦)|(½)|( )|( )|(‘)|(’)|(“)|(”)|(‽)|(≤)|(≥)|(☹)|(☺)|(�), in the Find what: zone -
Type in the regex
(?1|)(?{2}1/2)(?3\x20\x20)(?4\x20\x20\x20\x20)(?5')(?6')(?7")(?8")(?9!?)(?{10}<=)(?{11}>=)(?{12}\:-\()(?{13}\:-\))(?{14}?), in the Replace with: zone -
Tick the
Wrap aroundoption -
Select the
Regular expressionsearch mode -
Click, once , on the
Replace Allbutton, or several times on the Replace button
Et voilà !
Notes :
-
In search, we, simply, put each character, to be replaced, between round parentheses, in order to be stored as group
1,2and so on… -
In replacement, we use a special conditional syntax
(?#xxxx:yyyy)or(?{#..#}xxxx:yyyy), where :-
#or#...#represents a group number -
The part
xxxxis rewritten, if group#or#...#exists -
The part
yyyyis rewritten, if group#or#...#does not exist
-
-
In our case, the ELSE part, in each conditional replacement, is not present
-
If a part
xxxxoryyyycontains the character:,(or), it must be escaped ( preceded ) with a\symbol -
For the second conditional replacement, I used the syntax
(?{2}1/2), on purpose ! Indeed, if I would have used the(?21/2)syntax, the regex engine would have, wrongly, tried to replace any searched group21with the/2string !! -
To end with, note that quantifiers, as
{#}, do not work, in replacement. So we need to change, for instance, the\x20{2}syntax ( 2 space characters) by the simple\x20\x20one !
Best Regards,
guy038
Hello! It looks like you're interested in this conversation, but you don't have an account yet.
Getting fed up of having to scroll through the same posts each visit? When you register for an account, you'll always come back to exactly where you were before, and choose to be notified of new replies (either via email, or push notification). You'll also be able to save bookmarks and upvote posts to show your appreciation to other community members.
With your input, this post could be even better 💗
Register Login