Unicode 'ÿ' , problem converting to Hex 'FF'
-
Hello, @lancemarchetti, @mark-olson, @alan-kilborn, @rdipardo and All,
@mark-olson, your python script works nice for
ANSI
,UTF-8
andUTF-8-BOM
encoded files. Unfortunately, it failed to get the right character representation for, either, theUCS-2 BE BOM
andUCS-2 LE BOM
encodings !-
Firstly, the important thing to note is that any character over the BMP, so with code char
> x{FFFF}
, is strictly forbidden in files encoded with theUCS-2 BE BOM
or theUCS-2 LE BOM
encoding ! -
Secondly, in these two specific encodings, each character between,
\x{0000}
and\x{FFFF}
, is coded with two consecutive bytes, which are :-
The
Most Significant Byte
first ( MSB) and theLeast Significant Byte
in second, for theUCS-2 BE BOM
encoding -
The
Least Significant Byte
first ( LSB) and theMost Significant Byte
in second, for theUCS-2 LE BOM
encoding
-
For instance :
-
Open a new tab, which should be an
U8TF-8
encoded file, by default -
Enter the simple
߿�
text -
Save the modifications
-
Open a second new tab
-
Again, enter the simple
߿�
text -
Run the
Encoding > UCS-2 BE BOM
encoding option -
Save the modifications
-
Open a third new tab
-
Again, enter the simple
߿�
text -
Run the
Encoding > UCS-2 LE BOM
encoding option -
Save the modifications
These
3
files contain a same text of3
characters :-
The
character ( DEL ), of Unicode value\x{007F}
-
The
߿
character ( NKO TAMAN SIGN ), of Unicode value\x{07FF}
-
The
�
character ( REPLACEMENT CHARACTER), of Unicode value\x{FFFD}
If we run your python script on the
UTF-8
file, it correctly finds the sequence7fdfbfefbfbd
which represents :-
The
1
-byte value of the DEL char :7f
-
The
2
-bytes value of the NKO TAMAN SIGN char :dfbf
-
The
3
-bytes value of the REPLACEMENT char :efbfbd
However, for the two other files, it wrongly gives the same results as with the
UTF-8
encoded fileNormally, it should had outputted the text :
007f07fffffd
for theUCS-2 BE BOM
encoded file, so the three exact chars007f
, then07ff
andfffd
And :
7f00ff07fdff
for theUCS-2 LE BOM
encoded file, which corresponds to the exact chars007f
,07ff
andfffd
You can verify my assumptions with a true hexadecimal editor. Note that, in addition to the file hexadecimal contents, you should see the two
BOM
bytes, at the very beginning of these files, which are :-
FEFF
for theUCS-2 BE BOM
encoded file -
FFFE
for theUCS-2 LE BOM
encoded file
Now, a simple question : How to get the hexadecimal output, in uppercase !letters, with your script ?!
Best Regards,
guy038
P.S. :
If we add a character with Unicode value over
\x{FFFF}
, in theUTF-8
file, only, for example the𐀀
char, it is correctly interpreted, in hexa as theF0908080
sequence ! -
-
@guy038 said in Unicode 'ÿ' , problem converting to Hex 'FF':
Unfortunately, it failed to get the right character representation for, either, the UCS-2 BE BOM and UCS-2 LE BOM encodings !
My first thought was to say this was way outside the original scope/need, but on second thought, that’s fine – talk about what you want to talk about! :-)
How to get the hexadecimal output, in uppercase !letters, with your script ?!
The
hex()
function is the culprit here; e.g. callinghex(127)
returns'0x7f'
. To get uppercase you can dohex(127).upper()[2:]
which will return'7F'
. -
@guy038 Wow, I was fascinated by your break-down. I’m learning a lot. Thanks also to Alan for allowing this chat in relation to the byte to hex issue. Guy, could you drop me a mail perhaps for future discussion related to image binary manipulation. Thanks. (bWFyY2hldHRpLmxhbmNlQGdtYWlsLmNvbQ)
-
Hi @alan-kilborn and All,
Regarding the usefulness of my previous post, I simply thought that providing a Python solution to show hexadecimal values of characters, from within N++, should work for all the Notepad++ encodings !
Regarding my question, the solution is, then, to replace, in the Python script, the line :
raw_hex = ''.join(hex(ord(char))[2:].zfill(2) for char in text)
by this one :
raw_hex = ''.join(hex(ord(char)).upper()[2:].zfill(2) for char in text)
Thanks for your help, Alan !
BR
guy038
-
-
Hello, @lancemarchetti,
Silly of me ! I’ve just understood how you coded your e-mail address !
So, I’m going to send you a first e-mail, very soon !
BR
guy038
-
Hi Mark, thanks for the above py ascii-hex script…It works. But I only noticed today that when encoding a string that has NUL byte values , it encodes them to ‘20’ (space) instead of ‘00’ (NUL). Can you please show me how to fix that.?
-
@LanceMarchetti said in Unicode 'ÿ' , problem converting to Hex 'FF':
But I only noticed today that when encoding a string that has NUL byte values , it encodes them to ‘20’ (space) instead of ‘00’ (NUL). Can you please show me how to fix that.?
NUL
does not appear in UTF-8 encoded Unicode unless the intent is to have codepoint U+0000 characters in your files. Keep in mind thatNUL
is also the string terminator for many things, including the Windows copy/paste of text mechanism.While you can construct files that contain U+0000 I would only do so when testing edge conditions. I would expect interesting behavior for U+0000 such as them turning into spaces. Notepad++ is a text editor, not a binary data editor.
NUL
, as a byte value, can and will appear in UTF-16 encoded Unicode for code points U+0000 and U+0001 to U+00FF. However, you are then not supposed to be looking at the bytes and thus should never need to deal with NUL bytes unless you have U+0000 in your files. -
@mkupper said in Unicode 'ÿ' , problem converting to Hex 'FF':
Notepad++ is a text editor, not a binary data editor.
OP said, initially:
Understanding that Notepad++ is strictly a text editor
But, I question if he really understands that what he’s attempting to do isn’t advisable.
-
The
editor.getSelText()
and anything else that uses Notepad++'s normal interface for accessing the text characters will convert the NUL to a space. (The same things if you Copy a NUL inside Notepad++'s GUI.)As is said, Notepad++ was designed as a text editor, not a hex editor.
That said, if you want to shoot yourself in the foot while trying to edit a PNG or other binary file with a text editor (which is obviously a bad idea), there are other commands available to PythonScript and similar.
For example,
editor.getCharAt(p)
does properly return 0 whenp
is a variable containing the 0-based character position of a NUL inside the document. So if instead of having the customconvert_to_hex_lines()
use the getSelText(), you could instead rewrite it to pass in theconvert_to_hex_lines(selstart,selend)
, and then change to iterating over that range of offsets; instead ofhex(ord(char))
, you would usehex(editor.getCharAt(p))
. I’m not going to write it completely for you (nor should you expect anyone here to do it for you), but I have given you enough that if you want to turn this into a learning exercise, I am sure you could figure out how to take what’s above, combined with my description here, and eventually get it to work. -
If NULs are involved, I would kick the problem over to VSCode, which does not have such issues with NULs.
-
@Mark-Olson said :
I would kick the problem over to VSCode, which does not have such issues with NULs
There really is nothing intrinsically wrong with a text editor supporting NULs in documents, and, if VSCode indeed supports this, then bully for it. It was designed in “modern times”.
Notepad++ is bound by a 20+ year old legacy, when null characters were (only) used to terminate C-strings. They are still used as C-string terminators, just not in such a “willy nilly” application as the “olden days”, when the coders were aghast at the possibility of a null character in a document (and thus made compromises based upon this).
I think Notepad++ will get better with this as time continues to march on, but it is going to be slow in happening. Best thing to do is not to do anything in Notepad++ with files that need/have these characters.
-
@Mark-Olson said in Unicode 'ÿ' , problem converting to Hex 'FF':
If NULs are involved, I would kick the problem over to VSCode, which does not have such issues with NULs.
Notepad++ and Scintilla support NULs within text files. The problems are in the details such as plugins. As Windows does not support NULs within text strings you can’t copy/paste strings containing NULs to or from any editor.
-
@mkupper said in Unicode 'ÿ' , problem converting to Hex 'FF':
Notepad++ and Scintilla support NULs within text files.
Not fully (as I was trying to say in my previous post).
Anytime some text data is pulled into a C-string for processing, NUL bytes in that data is going to cause a problem, because it will look like the end-of-text right at the NUL byte, instead of the full length.
There are MANY examples of this in the Notepad++ source.
A very recent example (yesterday) of a user-reported problem of this nature is HERE. -
@Alan-Kilborn said in Unicode 'ÿ' , problem converting to Hex 'FF':
Not fully (as I was trying to say in my previous post).
Anytime some text data is pulled into a C-string for processing, NUL bytes in that data is going to cause a problem, because it will look like the end-of-text right at the NUL byte, instead of the full length.
There are MANY examples of this in the Notepad++ source.
A very recent example (yesterday) of a user-reported problem of this nature is HERE .Thank you - I see that one is a big issue. I did some testing.
- The convert case functions are ok.
- Splitting and joining lines are ok.
Besides sorting, these Notepad++ operations have issues:
- The reverse and randomize line order functions truncate the selection at the NUL.
- Remove Duplicate lines does nothing if there are no duplicates before a NUL… If there are duplicate lines before line(s) with NULs then the file or selection gets truncated at the first NUL and then the dupe-removal happens. I guess the initial scan for dupes is truncated at the NULL and it decides to do nothing. If there are dupes then those are removed and the NUL truncated buffer is written back to Scintilla.
I discovered that the Edit / Line Operations / Paste Special / copy and paste binary content work for text data containing NULs. That’s useful to keep in mind testing this sort of stuff.
-
Hi, @lancemarchetti, @mark-olson, @alan-kilborn, @rdipardo and All,
In my previous post, I said :
- Firstly, the important thing to note is that any character over the BMP, so with code char
> x{FFFF}
, is strictly forbidden in files encoded with theUCS-2 BE BOM
or theUCS-2 LE BOM
encoding !
This assumption is true but, in a N++ point of vue, is not accurate anymore because, since the
v8.0
version of N++, these encodings have been improved to the two new encodingsUTF-16 BE BOM
andUTF-16 LE BOM
And, indeed, these new encodings do support characters over the
BMP
! Note that all the characters over theBMP
are encoded with their surrogate representation, containing two16-bytes
code units. For instance the Unicode char\x{1F4A6}
is, therefore, encoded with its surrogate representation\x{D83D}\x{DCA6}
It’s important to point out that, within N++, you CANNOT search the character
\x{1F4A6}
itself but you can search it via its surrogate representation which is\x{D83D}\x{DCA6}
BTW, if someone is interessed, I created a macro which does the translation
\x{[#]#####}
to its corresponding surrogate pair\x{####}\x{####}
, for any character over theBMP
, between\x{10000}
and\x{10FFFF}
, of the current selection !
So, as an example :
- For a file containing ONLY the over
BMP
char SPLASHING SWEAT SYMBOL :💦
, of Unicode valueU + 1F4A6
, theUTF-16 BE BOM
encoded file would contain the bytes :
FE FF D8 3D DC A6
( So, theBOM
and the char, with the MSB byte first, and then, the LSB byte )- For a file containing ONLY the over
BMP
char “SPLASHING SWEAT SYMBOL” :💦
, of Unicode valueU + 1F4A6
, theUTF-16 LE BOM
encoded file would contain the bytes :
FF FE 3D D8 A6 DC
( So theBOM
and the char, with the LSB byte first, and then, the MSB byte )Best Regards,
guy038
- Firstly, the important thing to note is that any character over the BMP, so with code char
-
@mkupper said in Unicode 'ÿ' , problem converting to Hex 'FF':
these Notepad++ operations have issues
The list is endless (well, it IS finite, but perhaps doesn’t seem like it). :-)
HERE is another interesting one. -
@guy038 said in Unicode 'ÿ' , problem converting to Hex 'FF':
if someone is interessed, I created a macro which does the translation
I suppose the interest might be small, but why not post the macro?
-
@PeterJones Thanks Peter. I will try my best to figure it out. I’m currently 55 and have zero programming background, but find computers and binary fascinating.
Just to recap, I will try to keep this a short and as comprehensive as possible.
Example Of The ASCII to HEX Issue I am Currently Experiencing in NP++:Encoding=UTF-8
StringDataType=Image/Binary
TotalBytes=10UseCase: The manual editing of the dimensions sector in the image binary allows for the creation of CTF challenges. Manual BOM manipulation allows for forced cropping and the concealment of color-data in the rest of the image. It can also be used to create pixelated patterns for unraveling. Notepad++ has met all these requirements for me so far. So I am pretty impressed at what I can get away with in various image formats.
THIS EXAMPLE :
Binary: GIF89a NUL NUL
ASCII -> Hex: GIF89a20002000
Bytes: 1-6 , Header
Bytes: 7-10 , Dimensions
The above ASCII to Hex conversion is 100% correct and was performed using the ASCII->Hex converter Ver4.3 by Don Ho which utilizes NppConverter.dll
However this converter does not convert ‘ÿ’ to ‘FF’ for some reason. I therefore requested help on the Notepad++ forum and a py script was suggested to help with the ‘ÿ’ to ‘FF’ issue. The script worked successfully, but unfortunately landed up converting NUL bytes ‘00’ to spaces ‘20’.This is not helpful due to the fact that a space in ASCII is represented by the value ‘32’ (Hex ‘20’).
So when it comes to image dimensions as in the GIF example above 20 00 20 00 , indicates a 32x32 pixel image.
NOTE: Using a space ‘20’ to replace a NUL ‘00’ will still result in a 32x32 Gif image,HOWEVER, doing this to the dimension sector of a Jpeg will result in a 32x32px becoming a 8244x8224px image! Because the Big-Endian byte order of the 4-byte dimension block in Jpeg will translate 20202020 as 8244x8244 instead of 32x32 as in the case with GIF.
This is why I need the script to correctly translate NUL bytes to ‘00’ and NOT ‘20’, otherwise it will create outrageous dimensions in formats like Jpeg, Webp and PNG.
If NppConverter.dll is actually doing it right, then why can we not replicate it in Python?Thanks once again.
Lance Marchetti
-
@Mark-Olson I did not know that about VSCode, thanks, Ill look into it. Mark.