Invisible characters
-
Hello,
I received a CSV file with an invisible character “PM” (when show all characters turned on). File format is Turkish Ansi.
A hex editor display this character as 0x9E
I did check http://www.december.com/html/spec/ascii.html and could not understand what it is. Is it any common character for some sort of?
Thanks & regards,
Ertan -
Why are you looking at a 7bit ASCII chart for a specific 8bit Turkish “ANSI” encoding?
Narrowing to the sub-list of characters
"ZYTİNYAĞ TABAĞI"
Notepad++ lists four Turkish encodings:
- ISO 8859-3: this has a different byte for
İ
, so I’m assuming that’s not what you meant. - ISO 8859-9: this looks compatible on the bytes I checked, but has nothing defined in positions
0x80-0x9F
, so0x9D
would not be a valid character in that encoding - OEM-857: this doesn’t match most of your characters, but does put
Ş
at hex position0x9E
- Windows-1254: this looks compatible on the bytes I checked, but has nothing defined in
0x9D
and0x9E
Ş
is at0x9E
in OEM-857, but nothing else matches with that one.
Ş
is at0xDE
in Win-1254 and ISO 8859-9It may be that the program or person that generated that unknown character was using a “standard-encoding-plus-extra” to try to get more characters (sometimes “unused” slots are filled with custom characters in certain applications or derived standards); or it could be that the program/generator mixed up the 857 codepage Ş with the 1254 encoding of
Ş
. Or it could be a transmission error.Or my analysis could be completely wrong: I am not an encoding expert, and definitely not a Turkish-encoding expert. I just looked up the various encodings and compared to your listed hexdump.
- ISO 8859-3: this has a different byte for
-
@PeterJones said in Invisible characters:
Why are you looking at a 7bit ASCII chart for a specific 8bit Turkish “ANSI” encoding?
Probably I mistakenly overlooked at it.
@PeterJones said in Invisible characters:
Why are you looking at a 7bit ASCII chart for a specific 8bit Turkish “ANSI” encoding?
- ISO 8859-3: this has a different byte for
İ
, so I’m assuming that’s not what you meant.
That should be it. You should read it as
I
and most likely it is lowercaseı
(without dot at top)@PeterJones said in Invisible characters:
It may be that the program or person that generated that unknown character was using a “standard-encoding-plus-extra” to try to get more characters (sometimes “unused” slots are filled with custom characters in certain applications or derived standards);
And this is most likely correct as I suspect data is taken from an Oracle database of some kind, input into FirebirdSQL database where I am provided the CSV from.
Thank you!
- ISO 8859-3: this has a different byte for
-
Oh, I also meant to say (but got distracted – ooh, squirrel!) that Edit > Character Panel will give the 255 character codes for 8-bit encodings; if you change from one encoding to another, the Character Panel will correctly update to match. It lists both decimal and hexadecimal character codes, along with the character at that point. If you double-click on the character, it will insert that character in the active editor (be warned: if you double-click on the hex value, it will “helpfully” type the hex value for you in the editor)
-
@PeterJones I didn’t know about Character Panel. Thanks for mentioning about it, too.