Examining a character?
-
Occasionally I have a text document with an oddly encoded or corrupted character. Is there a way (inside np++) to examine a selected character? The alternative would be to exit np++ and open the file with a hex editor. Thanks.
-
There is a HexEditor plugin available for Notepad++.
However, by the time Notepad++ has loaded the file, it’s already decoded the bytestream from the file into the characters that are used in the editor, so it’s possible that the HexEditor plugin (or any other in-Notepad++ solution) will be showing a different byte sequence than you would find when it’s actually written to disk, so there would be a slight doubt that what any inside-notepad++ tool showed you was correct.
-
@PeterJones Okay, I am not sure what would be ideal. Here is an example…
-
@Dave-Joseph said in Examining a character?:
I have a text document with an oddly encoded or corrupted character
Are you sure about this?
Perhaps you just aren’t using a font setting that can display the character?
Perhaps a change of font, or enabling Direct Write in the Preferences helps the situation?If there is still a doubt, perhaps a true hex editor (e.g.
HxD
– google it) tells the true story of what is going on with your data.Lots of "perhaps"s.
Life is complicated these days. -
That looks like you have a UTF-8 encoded file, but Notepad++ is interpreting it as “ANSI”.
The sequence
’
are bytes 0xE2 0x80 0x99, which is the UTF-8 encoding for U+2019, which is the right single quotation mark (the right ‘smartquote’:’
), and often gets used as the apostrophe-inside-a-word when using a Word Processor to edit a text file.In my preferences, I generally set MISC > Autodetect character encoding off, and New Document > Encoding to be UTF-8, ☑ Apply to opened ANSI files enabled – this way, even if Notepad++ thinks it’s an “ANSI” file, it interprets byte sequences as UTF-8 encoded, so it should then properly interpret those three bytes.
-
@PeterJones This is an old file that I presumed was ANSI and is declared as ANSI and the strange character was seen both before and after using “Encoding>Convert to ANSI.” Perhaps this character can’t be converted to ANSI?
-
Perhaps this character can’t be converted to ANSI?
The smart quote is not one of the less-than-256 characters available to the “ANSI encoding”, so there was no ANSI codepoint to map
’
to, so it left the byte sequence alone when it converted then saved the file.In the modern world, there is very little reason for ever using an “ANSI encoding”, except when passing text to a legacy application that refuses to learn a reasonable encoding like UTF-8. Your HTML (I presume) does not fall into that category; IMO, your webserver should be configured to send your HTML as UTF-8 unless you have a good reason not to, and you should be editing all HTML source as UTF-8 files.
-
@PeterJones It might be nice if np++ would have given me a warning that it failed to convert successfully. I am editing a copy of a website owned by an elderly gentleman who is very nervous about making any changes – so I am attempting to make as few changes as possible.
-
@Dave-Joseph said in Examining a character?:
It might be nice if np++ would have given me a warning that it failed to convert successfully
I’m not sure what Notepad++ could have done for you in this circumstance. If I create a file with that sequence in it, using an independent hex editor, then pull the file into Notepad++, it has no issue showing the file as UTF-8 and showing the right smart-quote to me right where it should.
So, unless some bit of information is missing that would nail the door shut on a Notepad++ problem of some kind, I don’t see it. Perhaps your settings are different from what @PeterJones shows; I can’t tell from the info provided. Perhaps you’ve already “moved beyond”, but it would be good to know if a settings change alters your outcome.
-
Hello, @dave-joseph, @peterjones and All,
@peterjones, you said :
The smart quote is not one of the less-than-256 characters available to the “ANSI encoding”, so there was no ANSI codepoint to map ’ to, so it left the byte sequence alone when it converted then saved the file.
Not exactly ! Indeed, the RIGHT SINGLE QUOTATION MARK character (
U+2019
) is really coded in anANSI/Windows-1252
file, not with its exact code-point, but with the single byte92
. This is also the case for all the characters non pure ASCII, with a Unicode code-point overU+00FF
, which map to a specific location in some 256-bytes encodings.For instance, it’s the case of the
27
following characters, of theANSI - Windows-1252
encoding :•------•--------•-------------•--------------------------------•-----------------------------------------• | Char | INPUT | Hex UNICODE | In an ANSI / Windows-1252 file | In a UTF-8[-BOM] , UCS-2 BE/LE BOM file | | | | |----------------•---------------•------------•----------------------------• | | ALT + | Code-Point | Byte | Regex | Bytes | Regex | •------•--------•-------------•----------------•---------------•------------•----------------------------• | Œ | 0140 | U+0152 | 8C | \x8C | C5 92 | \x{0152} | | œ | 0156 | U+0153 | 9C | \x9C | C5 93 | \x{0153} | | Š | 0138 | U+0160 | 8A | \x8A | C5 A0 | \x{0160} | | š | 0154 | U+0161 | 9A | \x9A | C5 A1 | \x{0161} | | Ÿ | 0159 | U+0178 | 9F | \x9F | C5 B8 | \x{0178} | | Ž | 0142 | U+017D | 8E | \x8E | C5 BD | \x{017D} | | ž | 0158 | U+017E | 9E | \x9E | C5 BE | \x{017E} | | ƒ | 0131 | U+0192 | 83 | \x83 | C6 92 | \x{0192} | | ˆ | 0136 | U+02C6 | 88 | \x88 | CB 86 | \x{02C6} | | ˜ | 0152 | U+02DC | 98 | \x98 | CB 9C | \x{02DC} | | – | 0150 | U+2013 | 96 | \x96 | E2 80 93 | \x{2013} | | — | 0151 | U+2014 | 97 | \x97 | E2 80 94 | \x{2014} | | ‘ | 0145 | U+2018 | 91 | \x91 | E2 80 98 | \x{2018} | | ’ | 0146 | U+2019 | 92 | \x92 | E2 80 99 | \x{2019} | | ‘ | 0130 | U+201A | 82 | \x82 | E2 80 9A | \x{201A} | | “ | 0147 | U+201C | 93 | \x93 | E2 80 9C | \x{201C} | | ” | 0148 | U+201D | 94 | \x94 | E2 80 9D | \x{201D} | | „ | 0132 | U+201E | 84 | \x84 | E2 80 9E | \x{201E} | | † | 0134 | U+2020 | 86 | \x86 | E2 80 A0 | \x{2020} | | ‡ | 0135 | U+2021 | 87 | \x87 | E2 80 A1 | \x{2021} | | • | 0149 | U+2022 | 95 | \x95 | E2 80 A2 | \x{2022} | | … | 0133 | U+2026 | 85 | \x85 | E2 80 A6 | \x{2026} | | ‰ | 0137 | U+2030 | 89 | \x89 | E2 80 B0 | \x{2030} | | ‹ | 0149 | U+2039 | 8B | \x8B | E2 80 B9 | \x{2039} | | › | 0155 | U+203A | 9B | \x9B | E2 80 BA | \x{203A} | | € | 0128 | U+20AC | 80 | \x80 | E2 82 AC | \x{20AC} | | ™ | 0153 | U+2122 | 99 | \x99 | E2 84 A2 | \x{2122} | •------•--------•-------------•----------------•---------------•------------•----------------------------•
In contrast, all the characters of the second table, below, of the range
[U+00A0 - U+00FF]
, have a same byte :##
in ANSI andU+00##
in Unicode. So, in an Unicode file, you have the choice between several regex syntaxes !•------•--------•-------------•--------------------------------•-----------------------------------------• | Char | INPUT | Hex UNICODE | In an ANSI / Windows-1252 file | In a UTF-8[-BOM] , UCS-2 BE/LE BOM file | | | | |----------------•---------------•------------•----------------------------• | | ALT + | Code-Point | Byte | Regex | Bytes | Regexes | •------•--------•-------------•----------------•---------------•------------•----------------------------• | | 0160 | U+00A0 | A0 | \xA0 | C2 A0 | \xA0 \x{A0} \x{00A0} | | ¡ | 0161 | U+00A1 | A1 | \xA1 | C2 A1 | \xA1 \x{A1} \x{00A1} | | ¢ | 0162 | U+00A2 | A2 | \xA2 | C2 A2 | \xA2 \x{A2} \x{00A2} | | £ | 0163 | U+00A3 | A3 | \xA3 | C2 A3 | \xA3 \x{A3} \x{00A3} | | ¤ | 0164 | U+00A4 | A4 | \xA4 | C2 A4 | \xA4 \x{A4} \x{00A4} | | ¥ | 0165 | U+00A5 | A5 | \xA5 | C2 A5 | \xA5 \x{A5} \x{00A5} | | ¦ | 0166 | U+00A6 | A6 | \xA6 | C2 A6 | \xA6 \x{A6} \x{00A6} | | § | 0167 | U+00A7 | A7 | \xA7 | C2 A7 | \xA7 \x{A7} \x{00A7} | | ¨ | 0168 | U+00A8 | A8 | \xA8 | C2 A8 | \xA8 \x{A8} \x{00A8} | | © | 0169 | U+00A9 | A9 | \xA9 | C2 A9 | \xA9 \x{A9} \x{00A9} | | ª | 0170 | U+00AA | AA | \xAA | C2 AA | \xAA \x{AA} \x{00AA} | | « | 0171 | U+00AB | AB | \xAB | C2 AB | \xAB \x{AB} \x{00AB} | | ¬ | 0172 | U+00AC | AC | \xAC | C2 AC | \xAC \x{AC} \x{00AC} | | | 0173 | U+00AD | AD | \xAD | C2 AD | \xAD \x{AD} \x{00AD} | | ® | 0174 | U+00AE | AE | \xAE | C2 AE | \xAE \x{AE} \x{00AE} | | ¯ | 0175 | U+00AF | AF | \xAF | C2 AF | \xAF \x{AF} \x{00AF} | | ° | 0176 | U+00B0 | B0 | \xB0 | C2 B0 | \xB0 \x{B0} \x{00B0} | | ± | 0177 | U+00B1 | B1 | \xB1 | C2 B1 | \xB1 \x{B1} \x{00B1} | | ² | 0178 | U+00B2 | B2 | \xB2 | C2 B2 | \xB2 \x{B2} \x{00B2} | | ³ | 0179 | U+00B3 | B3 | \xB3 | C2 B3 | \xB3 \x{B3} \x{00B3} | | ´ | 0180 | U+00B4 | B4 | \xB4 | C2 B4 | \xB4 \x{B4} \x{00B4} | | µ | 0181 | U+00B5 | B5 | \xB5 | C2 B5 | \xB5 \x{B5} \x{00B5} | | ¶ | 0182 | U+00B6 | B6 | \xB6 | C2 B6 | \xB6 \x{B6} \x{00B6} | | • | 0183 | U+00B7 | B7 | \xB7 | C2 B7 | \xB7 \x{B7} \x{00B7} | | ¸ | 0184 | U+00B8 | B8 | \xB8 | C2 B8 | \xB8 \x{B8} \x{00B8} | | ¹ | 0185 | U+00B9 | B9 | \xB9 | C2 B9 | \xB9 \x{B9} \x{00B9} | | º | 0186 | U+00BA | BA | \xBA | C2 BA | \xBA \x{BA} \x{00BA} | | » | 0187 | U+00BB | BB | \xBB | C2 BB | \xBB \x{BB} \x{00BB} | | ¼ | 0188 | U+00BC | BC | \xBC | C2 BC | \xBC \x{BC} \x{00BC} | | ½ | 0189 | U+00BD | BD | \xBD | C2 BD | \xBD \x{BD} \x{00BD} | | ¾ | 0190 | U+00BE | BE | \xBE | C2 BE | \xBE \x{BE} \x{00BE} | | ¿ | 0191 | U+00BF | BF | \xBF | C2 BF | \xBF \x{BF} \x{00BF} | | À | 0192 | U+00C0 | C0 | \xC0 | C3 80 | \xC0 \x{C0} \x{00C0} | | Á | 0193 | U+00C1 | C1 | \xC1 | C3 81 | \xC1 \x{C1} \x{00C1} | | Â | 0194 | U+00C2 | C2 | \xC2 | C3 82 | \xC2 \x{C2} \x{00C2} | | Ã | 0195 | U+00C3 | C3 | \xC3 | C3 83 | \xC3 \x{C3} \x{00C3} | | Ä | 0196 | U+00C4 | C4 | \xC4 | C3 84 | \xC4 \x{C4} \x{00C4} | | Å | 0197 | U+00C5 | C5 | \xC5 | C3 85 | \xC5 \x{C5} \x{00C5} | | Æ | 0198 | U+00C6 | C6 | \xC6 | C3 86 | \xC6 \x{C6} \x{00C6} | | Ç | 0199 | U+00C7 | C7 | \xC7 | C3 87 | \xC7 \x{C7} \x{00C7} | | È | 0200 | U+00C8 | C8 | \xC8 | C3 88 | \xC8 \x{C8} \x{00C8} | | É | 0201 | U+00C9 | C9 | \xC9 | C3 89 | \xC9 \x{C9} \x{00C9} | | Ê | 0202 | U+00CA | CA | \xCA | C3 8A | \xCA \x{CA} \x{00CA} | | Ë | 0203 | U+00CB | CB | \xCB | C3 8B | \xCB \x{CB} \x{00CB} | | Ì | 0204 | U+00CC | CC | \xCC | C3 8C | \xCC \x{CC} \x{00CC} | | Í | 0205 | U+00CD | CD | \xCD | C3 8D | \xCD \x{CD} \x{00CD} | | Î | 0206 | U+00CE | CE | \xCE | C3 8E | \xCE \x{CE} \x{00CE} | | Ï | 0207 | U+00CF | CF | \xCF | C3 8F | \xCF \x{CF} \x{00CF} | | Ð | 0208 | U+00D0 | D0 | \xD0 | C3 90 | \xD0 \x{D0} \x{00D0} | | Ñ | 0209 | U+00D1 | D1 | \xD1 | C3 91 | \xD1 \x{D1} \x{00D1} | | Ò | 0210 | U+00D2 | D2 | \xD2 | C3 92 | \xD2 \x{D2} \x{00D2} | | Ó | 0211 | U+00D3 | D3 | \xD3 | C3 93 | \xD3 \x{D3} \x{00D3} | | Ô | 0212 | U+00D4 | D4 | \xD4 | C3 94 | \xD4 \x{D4} \x{00D4} | | Õ | 0213 | U+00D5 | D5 | \xD5 | C3 95 | \xD5 \x{D5} \x{00D5} | | Ö | 0214 | U+00D6 | D6 | \xD6 | C3 96 | \xD6 \x{D6} \x{00D6} | | × | 0215 | U+00D7 | D7 | \xD7 | C3 97 | \xD7 \x{D7} \x{00D7} | | Ø | 0216 | U+00D8 | D8 | \xD8 | C3 98 | \xD8 \x{D8} \x{00D8} | | Ù | 0217 | U+00D9 | D9 | \xD9 | C3 99 | \xD9 \x{D9} \x{00D9} | | Ú | 0218 | U+00DA | DA | \xDA | C3 9A | \xDA \x{DA} \x{00DA} | | Û | 0219 | U+00DB | DB | \xDB | C3 9B | \xDB \x{DB} \x{00DB} | | Ü | 0220 | U+00DC | DC | \xDC | C3 9C | \xDC \x{DC} \x{00DC} | | Ý | 0221 | U+00DD | DD | \xDD | C3 9D | \xDD \x{DD} \x{00DD} | | Þ | 0222 | U+00DE | DE | \xDE | C3 9E | \xDE \x{DE} \x{00DE} | | ß | 0223 | U+00DF | DF | \xDF | C3 9F | \xDF \x{DF} \x{00DF} | | à | 0224 | U+00E0 | E0 | \xE0 | C3 A0 | \xE0 \x{E0} \x{00E0} | | á | 0225 | U+00E1 | E1 | \xE1 | C3 A1 | \xE1 \x{E1} \x{00E1} | | â | 0226 | U+00E2 | E2 | \xE2 | C3 A2 | \xE2 \x{E2} \x{00E2} | | ã | 0227 | U+00E3 | E3 | \xE3 | C3 A3 | \xE3 \x{E3} \x{00E3} | | ä | 0228 | U+00E4 | E4 | \xE4 | C3 A4 | \xE4 \x{E4} \x{00E4} | | å | 0229 | U+00E5 | E5 | \xE5 | C3 A5 | \xE5 \x{E5} \x{00E5} | | æ | 0230 | U+00E6 | E6 | \xE6 | C3 A6 | \xE6 \x{E6} \x{00E6} | | ç | 0231 | U+00E7 | E7 | \xE7 | C3 A7 | \xE7 \x{E7} \x{00E7} | | è | 0232 | U+00E8 | E8 | \xE8 | C3 A8 | \xE8 \x{E8} \x{00E8} | | é | 0233 | U+00E9 | E9 | \xE9 | C3 A9 | \xE9 \x{E9} \x{00E9} | | ê | 0234 | U+00EA | EA | \xEA | C3 AA | \xEA \x{EA} \x{00EA} | | ë | 0235 | U+00EB | EB | \xEB | C3 AB | \xEB \x{EB} \x{00EB} | | ì | 0236 | U+00EC | EC | \xEC | C3 AC | \xEC \x{EC} \x{00EC} | | í | 0237 | U+00ED | ED | \xED | C3 AD | \xED \x{ED} \x{00ED} | | î | 0238 | U+00EE | EE | \xEE | C3 AE | \xEE \x{EE} \x{00EE} | | ï | 0239 | U+00EF | EF | \xEF | C3 AF | \xEF \x{EF} \x{00EF} | | ð | 0240 | U+00F0 | F0 | \xF0 | C3 B0 | \xF0 \x{F0} \x{00F0} | | ñ | 0241 | U+00F1 | F1 | \xF1 | C3 B1 | \xF1 \x{F1} \x{00F1} | | ò | 0242 | U+00F2 | F2 | \xF2 | C3 B2 | \xF2 \x{F2} \x{00F2} | | ó | 0243 | U+00F3 | F3 | \xF3 | C3 B3 | \xF3 \x{F3} \x{00F3} | | ô | 0244 | U+00F4 | F4 | \xF4 | C3 B4 | \xF4 \x{F4} \x{00F4} | | õ | 0245 | U+00F5 | F5 | \xF5 | C3 B5 | \xF5 \x{F5} \x{00F5} | | ö | 0246 | U+00F6 | F6 | \xF6 | C3 B6 | \xF6 \x{F6} \x{00F6} | | ÷ | 0247 | U+00F7 | F7 | \xF7 | C3 B7 | \xF7 \x{F7} \x{00F7} | | ø | 0248 | U+00F8 | F8 | \xF8 | C3 B8 | \xF8 \x{F8} \x{00F8} | | ù | 0249 | U+00F9 | F9 | \xF9 | C3 B9 | \xF9 \x{F9} \x{00F9} | | ú | 0250 | U+00FA | FA | \xFA | C3 BA | \xFA \x{FA} \x{00FA} | | û | 0251 | U+00FB | FB | \xFB | C3 BB | \xFB \x{FB} \x{00FB} | | ü | 0252 | U+00FC | FC | \xFC | C3 BC | \xFC \x{FC} \x{00FC} | | ý | 0253 | U+00FD | FD | \xFD | C3 BD | \xFD \x{FD} \x{00FD} | | þ | 0254 | U+00FE | FE | \xFE | C3 BE | \xFE \x{FE} \x{00FE} | | ÿ | 0255 | U+00FF | FF | \xFF | C3 BF | \xFF \x{FF} \x{00FF} | •------•--------•-------------•----------------•---------------•------------•----------------------------•
Preferably, prefix any search of a letter with the
(?-i)
modifier to get an unique character !
BTW, Peter why didn’t you speak about of your very useful
Python
script that your provided, which enable to get the hexadecimal and decimal Unicode code-point of the character at caret position. See below :https://community.notepad-plus-plus.org/post/44448
The @alan-kilborn’s version, in the same discussion, is interesting too, as it replace all occurrences of a specific character with an other char/string or nothing :
https://community.notepad-plus-plus.org/post/56576
Best Regards,
guy038
-
@guy038 said in Examining a character?:
Not exactly ! Indeed, the RIGHT SINGLE QUOTATION MARK character ( U+2019 ) is really coded in an ANSI/Windows-1252 file, not with its exact code-point, but with the single byte 92.
I had forgotten that the smart quotes were in Windows 1252 encoding.
When then begs the OP’s question, of why “convert to ANSI” didn’t properly convert the smart quote from UTF8 to ANSI/Win-1252.
In my experiments, if I create the file
‘smart singles’ “smart doubles”
and save it as UTF-8, it saves the bytes
C:\Users\peter.jones\Downloads\TempData\nppCommunity>xxd smart-utf8.txt 00000000: e280 9873 6d61 7274 2073 696e 676c 6573 ...smart singles 00000010: e280 990d 0ae2 809c 736d 6172 7420 646f ........smart do 00000020: 7562 6c65 73e2 809d 0d0a ubles....
If I Convert to ANSI and save, the bytes change to
C:\Users\peter.jones\Downloads\TempData\nppCommunity>xxd smart-utf8.txt 00000000: 9173 6d61 7274 2073 696e 676c 6573 920d .smart singles.. 00000010: 0a93 736d 6172 7420 646f 7562 6c65 7394 ..smart doubles. 00000020: 0d0a ..
So, for me, doing Convert to ANSI does properly convert smart quotes.
Of course, if the original source already had the wrong bytes, or if Notepad++ already thought of the file as ANSI (so Convert to ANSI did nothing), that might explain the problem.
BTW, Peter why didn’t you speak about of your very useful Python script that your provided, which enable to get the hexadecimal and decimal Unicode code-point of the character at caret position
Because in the original problem description, I thought the encoding settings might mean that the bytes that ended up in the scintilla editor component might not match the actual bytes on the disk (people have noted that about HexEditor plugin before, and the same would be true with PythonScript accessing the scintilla editor contents. I didn’t want to recommend my script, and then have it confuse him with claiming there were bytes there that aren’t actually in the file.
-
Hi, @dave-joseph, @peterjones, @alan-kilborn and All,
Peter, I confirm your test. In an
UTF-8
file, without encoding problems, the optionEncoding > Convert to ANSI
correctly changes the twoUTF-8
bytes of each smart quote, into a oneANSI
byte
Concerning the second point, I understand : seemingly it’s always better to see the individual bytes of a file with an external hex editor !
Cheers,
guy038