Community
    • Login

    Examining a character?

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    12 Posts 4 Posters 19.2k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Dave JosephD
      Dave Joseph
      last edited by

      Occasionally I have a text document with an oddly encoded or corrupted character. Is there a way (inside np++) to examine a selected character? The alternative would be to exit np++ and open the file with a hex editor. Thanks.

      PeterJonesP Alan KilbornA 2 Replies Last reply Reply Quote 0
      • PeterJonesP
        PeterJones @Dave Joseph
        last edited by

        @Dave-Joseph ,

        There is a HexEditor plugin available for Notepad++.

        However, by the time Notepad++ has loaded the file, it’s already decoded the bytestream from the file into the characters that are used in the editor, so it’s possible that the HexEditor plugin (or any other in-Notepad++ solution) will be showing a different byte sequence than you would find when it’s actually written to disk, so there would be a slight doubt that what any inside-notepad++ tool showed you was correct.

        Dave JosephD 1 Reply Last reply Reply Quote 0
        • Dave JosephD
          Dave Joseph @PeterJones
          last edited by

          @PeterJones Okay, I am not sure what would be ideal. Here is an example…
          character_problem.png

          PeterJonesP 1 Reply Last reply Reply Quote 0
          • Alan KilbornA
            Alan Kilborn @Dave Joseph
            last edited by

            @Dave-Joseph said in Examining a character?:

            I have a text document with an oddly encoded or corrupted character

            Are you sure about this?
            Perhaps you just aren’t using a font setting that can display the character?
            Perhaps a change of font, or enabling Direct Write in the Preferences helps the situation?

            If there is still a doubt, perhaps a true hex editor (e.g. HxD – google it) tells the true story of what is going on with your data.

            Lots of "perhaps"s.
            Life is complicated these days.

            1 Reply Last reply Reply Quote 1
            • PeterJonesP
              PeterJones @Dave Joseph
              last edited by PeterJones

              @Dave-Joseph ,

              That looks like you have a UTF-8 encoded file, but Notepad++ is interpreting it as “ANSI”.

              The sequence ’ are bytes 0xE2 0x80 0x99, which is the UTF-8 encoding for U+2019, which is the right single quotation mark (the right ‘smartquote’: ’), and often gets used as the apostrophe-inside-a-word when using a Word Processor to edit a text file.

              In my preferences, I generally set MISC > Autodetect character encoding off, and New Document > Encoding to be UTF-8, ☑ Apply to opened ANSI files enabled – this way, even if Notepad++ thinks it’s an “ANSI” file, it interprets byte sequences as UTF-8 encoded, so it should then properly interpret those three bytes.

              Dave JosephD 1 Reply Last reply Reply Quote 0
              • Dave JosephD
                Dave Joseph @PeterJones
                last edited by

                @PeterJones This is an old file that I presumed was ANSI and is declared as ANSI and the strange character was seen both before and after using “Encoding>Convert to ANSI.” Perhaps this character can’t be converted to ANSI?

                PeterJonesP 1 Reply Last reply Reply Quote 0
                • PeterJonesP
                  PeterJones @Dave Joseph
                  last edited by

                  @Dave-Joseph ,

                  Perhaps this character can’t be converted to ANSI?

                  The smart quote is not one of the less-than-256 characters available to the “ANSI encoding”, so there was no ANSI codepoint to map ’ to, so it left the byte sequence alone when it converted then saved the file.

                  In the modern world, there is very little reason for ever using an “ANSI encoding”, except when passing text to a legacy application that refuses to learn a reasonable encoding like UTF-8. Your HTML (I presume) does not fall into that category; IMO, your webserver should be configured to send your HTML as UTF-8 unless you have a good reason not to, and you should be editing all HTML source as UTF-8 files.

                  Dave JosephD 1 Reply Last reply Reply Quote 1
                  • Dave JosephD
                    Dave Joseph @PeterJones
                    last edited by

                    @PeterJones It might be nice if np++ would have given me a warning that it failed to convert successfully. I am editing a copy of a website owned by an elderly gentleman who is very nervous about making any changes – so I am attempting to make as few changes as possible.

                    Alan KilbornA 1 Reply Last reply Reply Quote 0
                    • Alan KilbornA
                      Alan Kilborn @Dave Joseph
                      last edited by

                      @Dave-Joseph said in Examining a character?:

                      It might be nice if np++ would have given me a warning that it failed to convert successfully

                      I’m not sure what Notepad++ could have done for you in this circumstance. If I create a file with that sequence in it, using an independent hex editor, then pull the file into Notepad++, it has no issue showing the file as UTF-8 and showing the right smart-quote to me right where it should.

                      So, unless some bit of information is missing that would nail the door shut on a Notepad++ problem of some kind, I don’t see it. Perhaps your settings are different from what @PeterJones shows; I can’t tell from the info provided. Perhaps you’ve already “moved beyond”, but it would be good to know if a settings change alters your outcome.

                      1 Reply Last reply Reply Quote 0
                      • guy038G
                        guy038
                        last edited by guy038

                        Hello, @dave-joseph, @peterjones and All,

                        @peterjones, you said :

                        The smart quote is not one of the less-than-256 characters available to the “ANSI encoding”, so there was no ANSI codepoint to map ’ to, so it left the byte sequence alone when it converted then saved the file.

                        Not exactly ! Indeed, the RIGHT SINGLE QUOTATION MARK character ( U+2019 ) is really coded in an ANSI/Windows-1252 file, not with its exact code-point, but with the single byte 92. This is also the case for all the characters non pure ASCII, with a Unicode code-point over U+00FF, which map to a specific location in some 256-bytes encodings.

                        For instance, it’s the case of the 27 following characters, of the ANSI - Windows-1252 encoding :

                        •------•--------•-------------•--------------------------------•-----------------------------------------•
                        | Char | INPUT  | Hex UNICODE | In an ANSI / Windows-1252 file | In a UTF-8[-BOM] , UCS-2 BE/LE BOM file |
                        |      |        |             |----------------•---------------•------------•----------------------------•
                        |      | ALT +  |  Code-Point |      Byte      |     Regex     |   Bytes    |           Regex            |
                        •------•--------•-------------•----------------•---------------•------------•----------------------------•
                        |  Œ   |  0140  |   U+0152    |       8C       |     \x8C      |   C5 92    |          \x{0152}          |
                        |  œ   |  0156  |   U+0153    |       9C       |     \x9C      |   C5 93    |          \x{0153}          |
                        |  Š   |  0138  |   U+0160    |       8A       |     \x8A      |   C5 A0    |          \x{0160}          |
                        |  š   |  0154  |   U+0161    |       9A       |     \x9A      |   C5 A1    |          \x{0161}          |
                        |  Ÿ   |  0159  |   U+0178    |       9F       |     \x9F      |   C5 B8    |          \x{0178}          |
                        |  Ž   |  0142  |   U+017D    |       8E       |     \x8E      |   C5 BD    |          \x{017D}          |
                        |  ž   |  0158  |   U+017E    |       9E       |     \x9E      |   C5 BE    |          \x{017E}          |
                        |  ƒ   |  0131  |   U+0192    |       83       |     \x83      |   C6 92    |          \x{0192}          |
                        |  ˆ   |  0136  |   U+02C6    |       88       |     \x88      |   CB 86    |          \x{02C6}          |
                        |  ˜   |  0152  |   U+02DC    |       98       |     \x98      |   CB 9C    |          \x{02DC}          |
                        |  –   |  0150  |   U+2013    |       96       |     \x96      |  E2 80 93  |          \x{2013}          |
                        |  —   |  0151  |   U+2014    |       97       |     \x97      |  E2 80 94  |          \x{2014}          |
                        |  ‘   |  0145  |   U+2018    |       91       |     \x91      |  E2 80 98  |          \x{2018}          |
                        |  ’   |  0146  |   U+2019    |       92       |     \x92      |  E2 80 99  |          \x{2019}          |
                        |  ‘   |  0130  |   U+201A    |       82       |     \x82      |  E2 80 9A  |          \x{201A}          |
                        |  “   |  0147  |   U+201C    |       93       |     \x93      |  E2 80 9C  |          \x{201C}          |
                        |  ”   |  0148  |   U+201D    |       94       |     \x94      |  E2 80 9D  |          \x{201D}          |
                        |  „   |  0132  |   U+201E    |       84       |     \x84      |  E2 80 9E  |          \x{201E}          |
                        |  †   |  0134  |   U+2020    |       86       |     \x86      |  E2 80 A0  |          \x{2020}          |
                        |  ‡   |  0135  |   U+2021    |       87       |     \x87      |  E2 80 A1  |          \x{2021}          |
                        |  •   |  0149  |   U+2022    |       95       |     \x95      |  E2 80 A2  |          \x{2022}          |
                        |  …   |  0133  |   U+2026    |       85       |     \x85      |  E2 80 A6  |          \x{2026}          |
                        |  ‰   |  0137  |   U+2030    |       89       |     \x89      |  E2 80 B0  |          \x{2030}          |
                        |  ‹   |  0149  |   U+2039    |       8B       |     \x8B      |  E2 80 B9  |          \x{2039}          |
                        |  ›   |  0155  |   U+203A    |       9B       |     \x9B      |  E2 80 BA  |          \x{203A}          |
                        |  €   |  0128  |   U+20AC    |       80       |     \x80      |  E2 82 AC  |          \x{20AC}          |
                        |  ™   |  0153  |   U+2122    |       99       |     \x99      |  E2 84 A2  |          \x{2122}          |
                        •------•--------•-------------•----------------•---------------•------------•----------------------------•
                        

                        In contrast, all the characters of the second table, below, of the range [U+00A0 - U+00FF], have a same byte : ## in ANSI and U+00## in Unicode. So, in an Unicode file, you have the choice between several regex syntaxes !

                        •------•--------•-------------•--------------------------------•-----------------------------------------•
                        | Char | INPUT  | Hex UNICODE | In an ANSI / Windows-1252 file | In a UTF-8[-BOM] , UCS-2 BE/LE BOM file |
                        |      |        |             |----------------•---------------•------------•----------------------------•
                        |      | ALT +  |  Code-Point |      Byte      |     Regex     |   Bytes    |           Regexes          |
                        •------•--------•-------------•----------------•---------------•------------•----------------------------•
                        |      |  0160  |   U+00A0    |       A0       |     \xA0      |   C2 A0    |  \xA0   \x{A0}   \x{00A0}  |
                        |  ¡   |  0161  |   U+00A1    |       A1       |     \xA1      |   C2 A1    |  \xA1   \x{A1}   \x{00A1}  |
                        |  ¢   |  0162  |   U+00A2    |       A2       |     \xA2      |   C2 A2    |  \xA2   \x{A2}   \x{00A2}  |
                        |  £   |  0163  |   U+00A3    |       A3       |     \xA3      |   C2 A3    |  \xA3   \x{A3}   \x{00A3}  |
                        |  ¤   |  0164  |   U+00A4    |       A4       |     \xA4      |   C2 A4    |  \xA4   \x{A4}   \x{00A4}  |
                        |  ¥   |  0165  |   U+00A5    |       A5       |     \xA5      |   C2 A5    |  \xA5   \x{A5}   \x{00A5}  |
                        |  ¦   |  0166  |   U+00A6    |       A6       |     \xA6      |   C2 A6    |  \xA6   \x{A6}   \x{00A6}  |
                        |  §   |  0167  |   U+00A7    |       A7       |     \xA7      |   C2 A7    |  \xA7   \x{A7}   \x{00A7}  |
                        |  ¨   |  0168  |   U+00A8    |       A8       |     \xA8      |   C2 A8    |  \xA8   \x{A8}   \x{00A8}  |
                        |  ©   |  0169  |   U+00A9    |       A9       |     \xA9      |   C2 A9    |  \xA9   \x{A9}   \x{00A9}  |
                        |  ª   |  0170  |   U+00AA    |       AA       |     \xAA      |   C2 AA    |  \xAA   \x{AA}   \x{00AA}  |
                        |  «   |  0171  |   U+00AB    |       AB       |     \xAB      |   C2 AB    |  \xAB   \x{AB}   \x{00AB}  |
                        |  ¬   |  0172  |   U+00AC    |       AC       |     \xAC      |   C2 AC    |  \xAC   \x{AC}   \x{00AC}  |
                        |      |  0173  |   U+00AD    |       AD       |     \xAD      |   C2 AD    |  \xAD   \x{AD}   \x{00AD}  |
                        |  ®   |  0174  |   U+00AE    |       AE       |     \xAE      |   C2 AE    |  \xAE   \x{AE}   \x{00AE}  |
                        |  ¯   |  0175  |   U+00AF    |       AF       |     \xAF      |   C2 AF    |  \xAF   \x{AF}   \x{00AF}  |
                        |  °   |  0176  |   U+00B0    |       B0       |     \xB0      |   C2 B0    |  \xB0   \x{B0}   \x{00B0}  |
                        |  ±   |  0177  |   U+00B1    |       B1       |     \xB1      |   C2 B1    |  \xB1   \x{B1}   \x{00B1}  |
                        |  ²   |  0178  |   U+00B2    |       B2       |     \xB2      |   C2 B2    |  \xB2   \x{B2}   \x{00B2}  |
                        |  ³   |  0179  |   U+00B3    |       B3       |     \xB3      |   C2 B3    |  \xB3   \x{B3}   \x{00B3}  |
                        |  ´   |  0180  |   U+00B4    |       B4       |     \xB4      |   C2 B4    |  \xB4   \x{B4}   \x{00B4}  |
                        |  µ   |  0181  |   U+00B5    |       B5       |     \xB5      |   C2 B5    |  \xB5   \x{B5}   \x{00B5}  |
                        |  ¶   |  0182  |   U+00B6    |       B6       |     \xB6      |   C2 B6    |  \xB6   \x{B6}   \x{00B6}  |
                        |  •   |  0183  |   U+00B7    |       B7       |     \xB7      |   C2 B7    |  \xB7   \x{B7}   \x{00B7}  |
                        |  ¸   |  0184  |   U+00B8    |       B8       |     \xB8      |   C2 B8    |  \xB8   \x{B8}   \x{00B8}  |
                        |  ¹   |  0185  |   U+00B9    |       B9       |     \xB9      |   C2 B9    |  \xB9   \x{B9}   \x{00B9}  |
                        |  º   |  0186  |   U+00BA    |       BA       |     \xBA      |   C2 BA    |  \xBA   \x{BA}   \x{00BA}  |
                        |  »   |  0187  |   U+00BB    |       BB       |     \xBB      |   C2 BB    |  \xBB   \x{BB}   \x{00BB}  |
                        |  ¼   |  0188  |   U+00BC    |       BC       |     \xBC      |   C2 BC    |  \xBC   \x{BC}   \x{00BC}  |
                        |  ½   |  0189  |   U+00BD    |       BD       |     \xBD      |   C2 BD    |  \xBD   \x{BD}   \x{00BD}  |
                        |  ¾   |  0190  |   U+00BE    |       BE       |     \xBE      |   C2 BE    |  \xBE   \x{BE}   \x{00BE}  |
                        |  ¿   |  0191  |   U+00BF    |       BF       |     \xBF      |   C2 BF    |  \xBF   \x{BF}   \x{00BF}  |
                        |  À   |  0192  |   U+00C0    |       C0       |     \xC0      |   C3 80    |  \xC0   \x{C0}   \x{00C0}  |
                        |  Á   |  0193  |   U+00C1    |       C1       |     \xC1      |   C3 81    |  \xC1   \x{C1}   \x{00C1}  |
                        |  Â   |  0194  |   U+00C2    |       C2       |     \xC2      |   C3 82    |  \xC2   \x{C2}   \x{00C2}  |
                        |  Ã   |  0195  |   U+00C3    |       C3       |     \xC3      |   C3 83    |  \xC3   \x{C3}   \x{00C3}  |
                        |  Ä   |  0196  |   U+00C4    |       C4       |     \xC4      |   C3 84    |  \xC4   \x{C4}   \x{00C4}  |
                        |  Å   |  0197  |   U+00C5    |       C5       |     \xC5      |   C3 85    |  \xC5   \x{C5}   \x{00C5}  |
                        |  Æ   |  0198  |   U+00C6    |       C6       |     \xC6      |   C3 86    |  \xC6   \x{C6}   \x{00C6}  |
                        |  Ç   |  0199  |   U+00C7    |       C7       |     \xC7      |   C3 87    |  \xC7   \x{C7}   \x{00C7}  |
                        |  È   |  0200  |   U+00C8    |       C8       |     \xC8      |   C3 88    |  \xC8   \x{C8}   \x{00C8}  |
                        |  É   |  0201  |   U+00C9    |       C9       |     \xC9      |   C3 89    |  \xC9   \x{C9}   \x{00C9}  |
                        |  Ê   |  0202  |   U+00CA    |       CA       |     \xCA      |   C3 8A    |  \xCA   \x{CA}   \x{00CA}  |
                        |  Ë   |  0203  |   U+00CB    |       CB       |     \xCB      |   C3 8B    |  \xCB   \x{CB}   \x{00CB}  |
                        |  Ì   |  0204  |   U+00CC    |       CC       |     \xCC      |   C3 8C    |  \xCC   \x{CC}   \x{00CC}  |
                        |  Í   |  0205  |   U+00CD    |       CD       |     \xCD      |   C3 8D    |  \xCD   \x{CD}   \x{00CD}  |
                        |  Î   |  0206  |   U+00CE    |       CE       |     \xCE      |   C3 8E    |  \xCE   \x{CE}   \x{00CE}  |
                        |  Ï   |  0207  |   U+00CF    |       CF       |     \xCF      |   C3 8F    |  \xCF   \x{CF}   \x{00CF}  |
                        |  Ð   |  0208  |   U+00D0    |       D0       |     \xD0      |   C3 90    |  \xD0   \x{D0}   \x{00D0}  |
                        |  Ñ   |  0209  |   U+00D1    |       D1       |     \xD1      |   C3 91    |  \xD1   \x{D1}   \x{00D1}  |
                        |  Ò   |  0210  |   U+00D2    |       D2       |     \xD2      |   C3 92    |  \xD2   \x{D2}   \x{00D2}  |
                        |  Ó   |  0211  |   U+00D3    |       D3       |     \xD3      |   C3 93    |  \xD3   \x{D3}   \x{00D3}  |
                        |  Ô   |  0212  |   U+00D4    |       D4       |     \xD4      |   C3 94    |  \xD4   \x{D4}   \x{00D4}  |
                        |  Õ   |  0213  |   U+00D5    |       D5       |     \xD5      |   C3 95    |  \xD5   \x{D5}   \x{00D5}  |
                        |  Ö   |  0214  |   U+00D6    |       D6       |     \xD6      |   C3 96    |  \xD6   \x{D6}   \x{00D6}  |
                        |  ×   |  0215  |   U+00D7    |       D7       |     \xD7      |   C3 97    |  \xD7   \x{D7}   \x{00D7}  |
                        |  Ø   |  0216  |   U+00D8    |       D8       |     \xD8      |   C3 98    |  \xD8   \x{D8}   \x{00D8}  |
                        |  Ù   |  0217  |   U+00D9    |       D9       |     \xD9      |   C3 99    |  \xD9   \x{D9}   \x{00D9}  |
                        |  Ú   |  0218  |   U+00DA    |       DA       |     \xDA      |   C3 9A    |  \xDA   \x{DA}   \x{00DA}  |
                        |  Û   |  0219  |   U+00DB    |       DB       |     \xDB      |   C3 9B    |  \xDB   \x{DB}   \x{00DB}  |
                        |  Ü   |  0220  |   U+00DC    |       DC       |     \xDC      |   C3 9C    |  \xDC   \x{DC}   \x{00DC}  |
                        |  Ý   |  0221  |   U+00DD    |       DD       |     \xDD      |   C3 9D    |  \xDD   \x{DD}   \x{00DD}  |
                        |  Þ   |  0222  |   U+00DE    |       DE       |     \xDE      |   C3 9E    |  \xDE   \x{DE}   \x{00DE}  |
                        |  ß   |  0223  |   U+00DF    |       DF       |     \xDF      |   C3 9F    |  \xDF   \x{DF}   \x{00DF}  |
                        |  à   |  0224  |   U+00E0    |       E0       |     \xE0      |   C3 A0    |  \xE0   \x{E0}   \x{00E0}  |
                        |  á   |  0225  |   U+00E1    |       E1       |     \xE1      |   C3 A1    |  \xE1   \x{E1}   \x{00E1}  |
                        |  â   |  0226  |   U+00E2    |       E2       |     \xE2      |   C3 A2    |  \xE2   \x{E2}   \x{00E2}  |
                        |  ã   |  0227  |   U+00E3    |       E3       |     \xE3      |   C3 A3    |  \xE3   \x{E3}   \x{00E3}  |
                        |  ä   |  0228  |   U+00E4    |       E4       |     \xE4      |   C3 A4    |  \xE4   \x{E4}   \x{00E4}  |
                        |  å   |  0229  |   U+00E5    |       E5       |     \xE5      |   C3 A5    |  \xE5   \x{E5}   \x{00E5}  |
                        |  æ   |  0230  |   U+00E6    |       E6       |     \xE6      |   C3 A6    |  \xE6   \x{E6}   \x{00E6}  |
                        |  ç   |  0231  |   U+00E7    |       E7       |     \xE7      |   C3 A7    |  \xE7   \x{E7}   \x{00E7}  |
                        |  è   |  0232  |   U+00E8    |       E8       |     \xE8      |   C3 A8    |  \xE8   \x{E8}   \x{00E8}  |
                        |  é   |  0233  |   U+00E9    |       E9       |     \xE9      |   C3 A9    |  \xE9   \x{E9}   \x{00E9}  |
                        |  ê   |  0234  |   U+00EA    |       EA       |     \xEA      |   C3 AA    |  \xEA   \x{EA}   \x{00EA}  |
                        |  ë   |  0235  |   U+00EB    |       EB       |     \xEB      |   C3 AB    |  \xEB   \x{EB}   \x{00EB}  |
                        |  ì   |  0236  |   U+00EC    |       EC       |     \xEC      |   C3 AC    |  \xEC   \x{EC}   \x{00EC}  |
                        |  í   |  0237  |   U+00ED    |       ED       |     \xED      |   C3 AD    |  \xED   \x{ED}   \x{00ED}  |
                        |  î   |  0238  |   U+00EE    |       EE       |     \xEE      |   C3 AE    |  \xEE   \x{EE}   \x{00EE}  |
                        |  ï   |  0239  |   U+00EF    |       EF       |     \xEF      |   C3 AF    |  \xEF   \x{EF}   \x{00EF}  |
                        |  ð   |  0240  |   U+00F0    |       F0       |     \xF0      |   C3 B0    |  \xF0   \x{F0}   \x{00F0}  |
                        |  ñ   |  0241  |   U+00F1    |       F1       |     \xF1      |   C3 B1    |  \xF1   \x{F1}   \x{00F1}  |
                        |  ò   |  0242  |   U+00F2    |       F2       |     \xF2      |   C3 B2    |  \xF2   \x{F2}   \x{00F2}  |
                        |  ó   |  0243  |   U+00F3    |       F3       |     \xF3      |   C3 B3    |  \xF3   \x{F3}   \x{00F3}  |
                        |  ô   |  0244  |   U+00F4    |       F4       |     \xF4      |   C3 B4    |  \xF4   \x{F4}   \x{00F4}  |
                        |  õ   |  0245  |   U+00F5    |       F5       |     \xF5      |   C3 B5    |  \xF5   \x{F5}   \x{00F5}  |
                        |  ö   |  0246  |   U+00F6    |       F6       |     \xF6      |   C3 B6    |  \xF6   \x{F6}   \x{00F6}  |
                        |  ÷   |  0247  |   U+00F7    |       F7       |     \xF7      |   C3 B7    |  \xF7   \x{F7}   \x{00F7}  |
                        |  ø   |  0248  |   U+00F8    |       F8       |     \xF8      |   C3 B8    |  \xF8   \x{F8}   \x{00F8}  |
                        |  ù   |  0249  |   U+00F9    |       F9       |     \xF9      |   C3 B9    |  \xF9   \x{F9}   \x{00F9}  |
                        |  ú   |  0250  |   U+00FA    |       FA       |     \xFA      |   C3 BA    |  \xFA   \x{FA}   \x{00FA}  |
                        |  û   |  0251  |   U+00FB    |       FB       |     \xFB      |   C3 BB    |  \xFB   \x{FB}   \x{00FB}  |
                        |  ü   |  0252  |   U+00FC    |       FC       |     \xFC      |   C3 BC    |  \xFC   \x{FC}   \x{00FC}  |
                        |  ý   |  0253  |   U+00FD    |       FD       |     \xFD      |   C3 BD    |  \xFD   \x{FD}   \x{00FD}  |
                        |  þ   |  0254  |   U+00FE    |       FE       |     \xFE      |   C3 BE    |  \xFE   \x{FE}   \x{00FE}  |
                        |  ÿ   |  0255  |   U+00FF    |       FF       |     \xFF      |   C3 BF    |  \xFF   \x{FF}   \x{00FF}  |
                        •------•--------•-------------•----------------•---------------•------------•----------------------------•
                        

                        Preferably, prefix any search of a letter with the (?-i) modifier to get an unique character !


                        BTW, Peter why didn’t you speak about of your very useful Python script that your provided, which enable to get the hexadecimal and decimal Unicode code-point of the character at caret position. See below :

                        https://community.notepad-plus-plus.org/post/44448

                        The @alan-kilborn’s version, in the same discussion, is interesting too, as it replace all occurrences of a specific character with an other char/string or nothing :

                        https://community.notepad-plus-plus.org/post/56576

                        Best Regards,

                        guy038

                        PeterJonesP 1 Reply Last reply Reply Quote 1
                        • PeterJonesP
                          PeterJones @guy038
                          last edited by

                          @guy038 said in Examining a character?:

                          Not exactly ! Indeed, the RIGHT SINGLE QUOTATION MARK character ( U+2019 ) is really coded in an ANSI/Windows-1252 file, not with its exact code-point, but with the single byte 92.

                          I had forgotten that the smart quotes were in Windows 1252 encoding.

                          When then begs the OP’s question, of why “convert to ANSI” didn’t properly convert the smart quote from UTF8 to ANSI/Win-1252.

                          In my experiments, if I create the file

                          ‘smart singles’
                          “smart doubles”
                          

                          and save it as UTF-8, it saves the bytes

                          C:\Users\peter.jones\Downloads\TempData\nppCommunity>xxd smart-utf8.txt
                          00000000: e280 9873 6d61 7274 2073 696e 676c 6573  ...smart singles
                          00000010: e280 990d 0ae2 809c 736d 6172 7420 646f  ........smart do
                          00000020: 7562 6c65 73e2 809d 0d0a                 ubles....
                          

                          If I Convert to ANSI and save, the bytes change to

                          C:\Users\peter.jones\Downloads\TempData\nppCommunity>xxd smart-utf8.txt
                          00000000: 9173 6d61 7274 2073 696e 676c 6573 920d  .smart singles..
                          00000010: 0a93 736d 6172 7420 646f 7562 6c65 7394  ..smart doubles.
                          00000020: 0d0a                                     ..
                          

                          So, for me, doing Convert to ANSI does properly convert smart quotes.

                          Of course, if the original source already had the wrong bytes, or if Notepad++ already thought of the file as ANSI (so Convert to ANSI did nothing), that might explain the problem.

                          BTW, Peter why didn’t you speak about of your very useful Python script that your provided, which enable to get the hexadecimal and decimal Unicode code-point of the character at caret position

                          Because in the original problem description, I thought the encoding settings might mean that the bytes that ended up in the scintilla editor component might not match the actual bytes on the disk (people have noted that about HexEditor plugin before, and the same would be true with PythonScript accessing the scintilla editor contents. I didn’t want to recommend my script, and then have it confuse him with claiming there were bytes there that aren’t actually in the file.

                          1 Reply Last reply Reply Quote 2
                          • guy038G
                            guy038
                            last edited by

                            Hi, @dave-joseph, @peterjones, @alan-kilborn and All,

                            Peter, I confirm your test. In an UTF-8 file, without encoding problems, the option Encoding > Convert to ANSI correctly changes the two UTF-8 bytes of each smart quote, into a one ANSI byte


                            Concerning the second point, I understand : seemingly it’s always better to see the individual bytes of a file with an external hex editor !

                            Cheers,

                            guy038

                            1 Reply Last reply Reply Quote 2
                            • First post
                              Last post
                            The Community of users of the Notepad++ text editor.
                            Powered by NodeBB | Contributors