Hello, @peterjones, @terry-R and All,
Peter, you said, in your last post :
the “length” field is the (equivalent) file size …
it’s quite a good approximation !
Here is what the length field, located in the status bar, represents, depending on the current encoding :
For an ANSI encoded file : the size of current file
For a UTF-8 encoded file : the size of current file
For a UTF-8 BOM encoded file : the size of current file - 3 bytes ( the BOM )
For a UTF-16 BE BOM encoded file : The displayed number is erroneous ! It should be the total number of characters of current file × 2 - 2 bytes ( The BOM )
For a UTF-16 LE BOM encoded file : The displayed number is erroneous ! It should be The total number of characters of current file × 2 - 2 bytes ( The BOM )
Remainders :
With the UTF-16 BE BOM or UTF-16 LE BOM encodings, do not use characters over the BMP ( so, with Unicode value > \x{FFFF} )
In this regard, the names of these two encodings is quite misnommed. We should roll back to the old N++ names of UCS-2 BE BOM and UCS-2 LE BOM
Indeed, in a true UTF-16 file, characters are coded with two bytes if its Unicode code-point is <= \x{FFFF} and coded with four bytes if its Unicode code-point is > \x{FFFF} ( using the two surrogate pairs )
While, in a UCS-2 file, characters are always coded with two bytes ( chars of the BMP only ) and the surrogate zone ( from D800 to DFFF ) is not used
In addition, for a
UTF-16 BE BOM or
UTF-16 BE BOM encoded file, N++
oddly displays, in the
length zone, the bytes
count as it were an
UTF-8 encoded file :-((
The moral of this story is to stick to the UTF-8 encoding, in all circonstances !
Best Regards
guy038