2 txt files are different in notepad , but similar in notepad++
-
Hi
I got 2 txt files that some process in our system is building.
they consist of a list of directories.when I inspect them in notepad 1 file (A) has the directory as a string with no spaces, while the other one (B) has a “space” after each character (which is the proper way we want)
however, when opening both files in notepadd++ - none of them is showing the spaces
in show symbols, the “show all characters” is turned on
when i inspect both of the files as hexa they are the same (well at least the prefix which is the same directory)any ideas why notepad++ not showing the spaces?
-
Notice you put “space” in quotes. I am betting that is because it’s not really a space, it’s really the 0 byte of the two-byte UTF-16 LE encoding that Windows uses for Unicode files. Please even note that your “oneline” file in Windows Notepad shows that it’s “UTF-16 LE” (lower right corner) – That file will have the 0-bytes between the characters in the raw form (you don’t show the hexa output; you really should have).
Notepad++ then properly sees that the files are encoded as UTF-16 LE, and interprets each two-byte sequence as a single character, because that’s what it is. It is doing the right thing by not showing 0bytes (which you called “spaces” but are actually NUL bytes when interpreted as a single-byte character). If you were to look at the lower-right of Notepad++, you would see that it shows UTF-16 LE BOM or similar, as it should:
The only bug in the above examples is that MS Notepad is interpreting the “multiline_protected.txt” as UTF-8 instead of the correct UTF-16 LE, and wrongly showing you the 0-bytes as separate characters instead of as part of the character like in “oneline_protected.txt”.
–
edit: removed all the incorrect mentions of UCS2-LE (I misrembered which was which when first writing the post). -
I think some studying and understanding of Unicode encoding concepts is in order before you go too much farther along in your current task.
-
@PeterJones thx for the detailed answer!
one thing that still confuses me - why doest MS notepad doesn’t interpret the “online” txt as UTF-8 ? why it only see this file as utf-16 , but not other files ? -
why doest MS notepad…
That would technically be a question for a MS forum. But I will give you my insight, anyway
why doest MS notepad doesn’t interpret the “online” txt as UTF-8 ?
You’ve actually got the question backwords: the oneline text file is the one that MS properly interpreted as UTF-16, because that’s what it is. The multiline text file is the one that should be read as UTF-16, but for some reason MS reads it as UTF-8 instead, so shows all the null characters as blanks between characters.
why it only see this file as utf-16 , but not other files ?
I am not an expert on Microsoft’s decision making algorithm.
My guess is that on the file where MS reads the UTF-16 file as UTF-8 that there is at least one character that isn’t properly encoded as 2-byte UTF-16 (so maybe it has an odd number of bytes in the file, which is technically impossible in a UTF-16 encoded text file, or some garbage character(s) that aren’t recognized are in there).
-
@PeterJones said in 2 txt files are different in notepad , but similar in notepad++:
(s) that aren’t recognized are in the
thanks a lot, dude. this really was helpful and insightful !