Community
    • Login

    Unicode 'ÿ' , problem converting to Hex 'FF'

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    36 Posts 7 Posters 3.3k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • L
      LanceMarchetti
      last edited by

      Understanding that Notepad++ is strictly a text editor, I don’t expect magic when working with image binary. With that said, I’m stumped by an odd occurrence with my setup.
      How do I get the ASCII to HEX command to recognize ÿ (xFF) as a permitted character for encoding? It’s the only character that halts the encoding process on my Notepad++ setup. All other characters encode perfectly fine. I have to manually change all occurrences of ‘ÿ’ to ‘FF’. Yet, on decode it works fine. It converts ‘FF’ to ‘ÿ’. But with encode, Im not winning. Any suggestions?

      –ver 8.2 32bitascii to hex issue with FF (Notepad++).png

      PeterJonesP 1 Reply Last reply Reply Quote 1
      • Mark OlsonM
        Mark Olson
        last edited by Mark Olson

        Yeah, that’s weird. Working in Notepad++ 8.5.8, I notice a different bug, where it converts ÿ into C3BF

        In any case, here’s a PythonScript script that converts from ASCII to hex. It will replace the current selection with hex if you have selected text, and if nothing is selected it will convert the entire file.

        import re
        
        # use a very large number of chars per line to keep it on one line.
        # if you prefer, you can set this to something smaller like 32.
        CHAR_PER_LINE = 1<<63
        
        def convert_to_hex_lines(text):
            raw_hex = ''.join(hex(ord(char))[2:].zfill(2) for char in text)
            if CHAR_PER_LINE >= len(raw_hex):
                return raw_hex
            return '\r\n'.join(re.findall('.{,%s}' % CHAR_PER_LINE, raw_hex))
            
        selstart = editor.getSelectionStart()
        selend = editor.getSelectionEnd()
        
        if selstart == selend:
            broken_by_line = convert_to_hex_lines(editor.getText())
            editor.setText(broken_by_line)
        else:
            broken_by_line = convert_to_hex_lines(editor.getSelText())
            editor.replaceSel(broken_by_line)
        
        Alan KilbornA L 3 Replies Last reply Reply Quote 2
        • Alan KilbornA
          Alan Kilborn @Mark Olson
          last edited by

          @Mark-Olson said in Unicode 'ÿ' , problem converting to Hex 'FF':

          I notice a different bug, where it converts ÿ into C3BF

          Well… is that part really a bug?
          I suppose it would depend upon what the Converter plugin is chartered to do, when working on data in a UTF-8 encoded file. After all, ÿ IS C3BF in UTF-8; see https://www.compart.com/en/unicode/U+00FF.
          Chances are good, however that Converter was written in a time when little consideration was paid to encodings and everything was “ANSI” in Notepad++ -speak.

          I suppose the script is reasonable, as its input (per the OP) is a binary file, where there is no “encoding”. I’d hope that such a binary file gets loaded as ANSI (in case one intends to go back the other way, i.e., “hex -> ascii” at some point).

          Probably this whole idea is one properly filed under Bad Ideas.
          It would seem more reasonable to use a hex editor that has the requisite “conversion” functions native to it.

          L 1 Reply Last reply Reply Quote 4
          • Mark OlsonM
            Mark Olson
            last edited by

            @Alan-Kilborn
            Good point about how yuml is C3BF in UTF-8, that makes sense.

            And I strongly agree that you should find a proper hex editor, and not use Notepad++ for this kind of thing.

            I personally would probably use Python Pillow for editing images, but admittedly I haven’t done too much image editing, and in any case further discussion of that is off-topic for this forum.

            1 Reply Last reply Reply Quote 2
            • rdipardoR
              rdipardo
              last edited by

              My hunch is that ordinals which map to any printable character are rendered as printable characters. You would need to use a true hex editor if you wanted every single byte to render as hex.

              1 Reply Last reply Reply Quote 2
              • L
                LanceMarchetti @Mark Olson
                last edited by

                @Mark-Olson I appreciate the script. Thanks

                1 Reply Last reply Reply Quote 0
                • L
                  LanceMarchetti @Alan Kilborn
                  last edited by

                  @Alan-Kilborn Points noted. Thanks Alan.

                  1 Reply Last reply Reply Quote 0
                  • guy038G
                    guy038
                    last edited by guy038

                    Hello, @lancemarchetti, @mark-olson, @alan-kilborn, @rdipardo and All,

                    @mark-olson, your python script works nice for ANSI, UTF-8 and UTF-8-BOM encoded files. Unfortunately, it failed to get the right character representation for, either, the UCS-2 BE BOM and UCS-2 LE BOM encodings !

                    • Firstly, the important thing to note is that any character over the BMP, so with code char > x{FFFF}, is strictly forbidden in files encoded with the UCS-2 BE BOM or the UCS-2 LE BOM encoding !

                    • Secondly, in these two specific encodings, each character between, \x{0000} and \x{FFFF}, is coded with two consecutive bytes, which are :

                      • The Most Significant Byte first ( MSB) and the Least Significant Byte in second, for the UCS-2 BE BOM encoding

                      • The Least Significant Byte first ( LSB) and the Most Significant Byte in second, for the UCS-2 LE BOM encoding


                    For instance :

                    • Open a new tab, which should be an U8TF-8 encoded file, by default

                    • Enter the simple ߿� text

                    • Save the modifications

                    • Open a second new tab

                    • Again, enter the simple ߿� text

                    • Run the Encoding > UCS-2 BE BOM encoding option

                    • Save the modifications

                    • Open a third new tab

                    • Again, enter the simple ߿� text

                    • Run the Encoding > UCS-2 LE BOM encoding option

                    • Save the modifications

                    These 3 files contain a same text of 3 characters :

                    • The  character ( DEL ), of Unicode value \x{007F}

                    • The ߿ character ( NKO TAMAN SIGN ), of Unicode value \x{07FF}

                    • The � character ( REPLACEMENT CHARACTER), of Unicode value \x{FFFD}


                    If we run your python script on the UTF-8 file, it correctly finds the sequence 7fdfbfefbfbd which represents :

                    • The 1-byte value of the DEL char : 7f

                    • The 2-bytes value of the NKO TAMAN SIGN char : dfbf

                    • The 3-bytes value of the REPLACEMENT char : efbfbd


                    However, for the two other files, it wrongly gives the same results as with the UTF-8 encoded file

                    Normally, it should had outputted the text :

                    • 007f07fffffd for the UCS-2 BE BOM encoded file, so the three exact chars 007f, then 07ff and fffd

                    And :

                    • 7f00ff07fdff for the UCS-2 LE BOM encoded file, which corresponds to the exact chars 007f, 07ff and fffd

                    You can verify my assumptions with a true hexadecimal editor. Note that, in addition to the file hexadecimal contents, you should see the two BOM bytes, at the very beginning of these files, which are :

                    • FEFF for the UCS-2 BE BOM encoded file

                    • FFFE for the UCS-2 LE BOM encoded file


                    Now, a simple question : How to get the hexadecimal output, in uppercase !letters, with your script ?!

                    Best Regards,

                    guy038

                    P.S. :

                    If we add a character with Unicode value over \x{FFFF}, in the UTF-8 file, only, for example the 𐀀 char, it is correctly interpreted, in hexa as the F0908080 sequence !

                    Alan KilbornA L 2 Replies Last reply Reply Quote 2
                    • Alan KilbornA
                      Alan Kilborn @guy038
                      last edited by

                      @guy038 said in Unicode 'ÿ' , problem converting to Hex 'FF':

                      Unfortunately, it failed to get the right character representation for, either, the UCS-2 BE BOM and UCS-2 LE BOM encodings !

                      My first thought was to say this was way outside the original scope/need, but on second thought, that’s fine – talk about what you want to talk about! :-)


                      How to get the hexadecimal output, in uppercase !letters, with your script ?!

                      The hex() function is the culprit here; e.g. calling hex(127) returns '0x7f'. To get uppercase you can do hex(127).upper()[2:] which will return '7F'.

                      1 Reply Last reply Reply Quote 4
                      • L
                        LanceMarchetti @guy038
                        last edited by

                        @guy038 Wow, I was fascinated by your break-down. I’m learning a lot. Thanks also to Alan for allowing this chat in relation to the byte to hex issue. Guy, could you drop me a mail perhaps for future discussion related to image binary manipulation. Thanks. (bWFyY2hldHRpLmxhbmNlQGdtYWlsLmNvbQ)

                        1 Reply Last reply Reply Quote 1
                        • guy038G
                          guy038
                          last edited by

                          Hi @alan-kilborn and All,

                          Regarding the usefulness of my previous post, I simply thought that providing a Python solution to show hexadecimal values of characters, from within N++, should work for all the Notepad++ encodings !


                          Regarding my question, the solution is, then, to replace, in the Python script, the line :

                              raw_hex = ''.join(hex(ord(char))[2:].zfill(2) for char in text)
                          

                          by this one :

                              raw_hex = ''.join(hex(ord(char)).upper()[2:].zfill(2) for char in text)
                          

                          Thanks for your help, Alan !

                          BR

                          guy038

                          1 Reply Last reply Reply Quote 3
                          • guy038G
                            guy038
                            last edited by guy038

                            Hi, @LanceMarchetti ,

                            Here is my temporary E-mail address :

                            Best Regards,

                            guy038

                            1 Reply Last reply Reply Quote 0
                            • guy038G
                              guy038
                              last edited by guy038

                              Hello, @lancemarchetti,

                              Silly of me ! I’ve just understood how you coded your e-mail address !

                              So, I’m going to send you a first e-mail, very soon !

                              BR

                              guy038

                              1 Reply Last reply Reply Quote 2
                              • L
                                LanceMarchetti @Mark Olson
                                last edited by LanceMarchetti

                                @Mark-Olson

                                Hi Mark, thanks for the above py ascii-hex script…It works. But I only noticed today that when encoding a string that has NUL byte values , it encodes them to ‘20’ (space) instead of ‘00’ (NUL). Can you please show me how to fix that.?Hex-NUL-Issue.png

                                mkupperM PeterJonesP 2 Replies Last reply Reply Quote 0
                                • mkupperM
                                  mkupper @LanceMarchetti
                                  last edited by

                                  @LanceMarchetti said in Unicode 'ÿ' , problem converting to Hex 'FF':

                                  But I only noticed today that when encoding a string that has NUL byte values , it encodes them to ‘20’ (space) instead of ‘00’ (NUL). Can you please show me how to fix that.?Hex-NUL-Issue.png

                                  NUL does not appear in UTF-8 encoded Unicode unless the intent is to have codepoint U+0000 characters in your files. Keep in mind that NUL is also the string terminator for many things, including the Windows copy/paste of text mechanism.

                                  While you can construct files that contain U+0000 I would only do so when testing edge conditions. I would expect interesting behavior for U+0000 such as them turning into spaces. Notepad++ is a text editor, not a binary data editor.

                                  NUL, as a byte value, can and will appear in UTF-16 encoded Unicode for code points U+0000 and U+0001 to U+00FF. However, you are then not supposed to be looking at the bytes and thus should never need to deal with NUL bytes unless you have U+0000 in your files.

                                  Alan KilbornA 1 Reply Last reply Reply Quote 1
                                  • Alan KilbornA
                                    Alan Kilborn @mkupper
                                    last edited by

                                    @mkupper said in Unicode 'ÿ' , problem converting to Hex 'FF':

                                    Notepad++ is a text editor, not a binary data editor.

                                    OP said, initially:

                                    Understanding that Notepad++ is strictly a text editor

                                    But, I question if he really understands that what he’s attempting to do isn’t advisable.

                                    1 Reply Last reply Reply Quote 1
                                    • PeterJonesP
                                      PeterJones @LanceMarchetti
                                      last edited by

                                      @LanceMarchetti ,

                                      The editor.getSelText() and anything else that uses Notepad++'s normal interface for accessing the text characters will convert the NUL to a space. (The same things if you Copy a NUL inside Notepad++'s GUI.)

                                      As is said, Notepad++ was designed as a text editor, not a hex editor.

                                      That said, if you want to shoot yourself in the foot while trying to edit a PNG or other binary file with a text editor (which is obviously a bad idea), there are other commands available to PythonScript and similar.

                                      For example, editor.getCharAt(p) does properly return 0 when p is a variable containing the 0-based character position of a NUL inside the document. So if instead of having the custom convert_to_hex_lines() use the getSelText(), you could instead rewrite it to pass in the convert_to_hex_lines(selstart,selend), and then change to iterating over that range of offsets; instead of hex(ord(char)), you would use hex(editor.getCharAt(p)). I’m not going to write it completely for you (nor should you expect anyone here to do it for you), but I have given you enough that if you want to turn this into a learning exercise, I am sure you could figure out how to take what’s above, combined with my description here, and eventually get it to work.

                                      L 1 Reply Last reply Reply Quote 4
                                      • Mark OlsonM
                                        Mark Olson
                                        last edited by

                                        If NULs are involved, I would kick the problem over to VSCode, which does not have such issues with NULs.

                                        Alan KilbornA mkupperM L 3 Replies Last reply Reply Quote 3
                                        • Alan KilbornA
                                          Alan Kilborn @Mark Olson
                                          last edited by Alan Kilborn

                                          @Mark-Olson said :

                                          I would kick the problem over to VSCode, which does not have such issues with NULs

                                          There really is nothing intrinsically wrong with a text editor supporting NULs in documents, and, if VSCode indeed supports this, then bully for it. It was designed in “modern times”.

                                          Notepad++ is bound by a 20+ year old legacy, when null characters were (only) used to terminate C-strings. They are still used as C-string terminators, just not in such a “willy nilly” application as the “olden days”, when the coders were aghast at the possibility of a null character in a document (and thus made compromises based upon this).

                                          I think Notepad++ will get better with this as time continues to march on, but it is going to be slow in happening. Best thing to do is not to do anything in Notepad++ with files that need/have these characters.

                                          1 Reply Last reply Reply Quote 3
                                          • mkupperM
                                            mkupper @Mark Olson
                                            last edited by

                                            @Mark-Olson said in Unicode 'ÿ' , problem converting to Hex 'FF':

                                            If NULs are involved, I would kick the problem over to VSCode, which does not have such issues with NULs.

                                            Notepad++ and Scintilla support NULs within text files. The problems are in the details such as plugins. As Windows does not support NULs within text strings you can’t copy/paste strings containing NULs to or from any editor.

                                            Alan KilbornA 1 Reply Last reply Reply Quote 0
                                            • First post
                                              Last post
                                            The Community of users of the Notepad++ text editor.
                                            Powered by NodeBB | Contributors