incorrect text encoding auto-detection (windows 1251 russian)
-
Hello @Almir-Abrarov,
to make encoding detection 100% reliable is impossible but you can help npp to get a better hit rate.
What about using some or unique chars of your encoding as a comment in the first line of your code?
It is need to have it in the beginning as npp doesn’t scan the whole file.Cheers
Claudia -
I tried to add in first line (Perl file)
#а
(russian “а”, #E0)
still encoding Macintosh -
One letter isn’t sufficient as its hex value is used in other codecs too.
Maybe I wasn’t that clear, what I meant was you need to provide a
comment which is unique for your language.
Something which is known to be an identifier.
A combination of letters or words. (Don’t know how to say this in other words)Npp uses mozillas chardet library and this library tries
to find out if the combination of different hex values are more
likely to be found in codec A or B and then reports it guesses back.Maybe another approach might work for you as well.
Stop using country-specific codecs start using utf-8.Cheers
Claudia -
I disagree.
For example https://gist.github.com/almirus/72f5b27a229a31da293eb427f3be239aThis file not big and contains russian word:
line #94
вот так без сокращений
line #150
Неверный форматPlease change priority (lower) codepage Macintosh or explain how to remove it from list.
-
First let me clarify, I’m a npp user as you so I’m neither in the position to change code as I like,
nor am I able to do it as I’m a c++ newbie.Mozillas universalchardet and npp are open source, so you can download it and modify it to your needs.
Btw. universalchardet source is included in npp source so you only need to download npp.In regards to the issue, if I put the following line as a comment in line 1 in your source, I do get windows-1251
#сокращений
If you need, here the link to the mozilla source http://lxr.mozilla.org/seamonkey/source/extensions/universalchardet/.
Cheers
Claudia -
@Claudia-Frank is correct with the suggestion. If you check out the source here it only uses the first 1024 characters so it doesn’t have to read the entire file. This is done for efficiency reasons and because the majority of the time this is enough.
-
thanks for reply.
same file in Notepad++ v6.5
https://goo.gl/photos/U59tZbNfFvZpaArM8 -
Sorry, don’t get the point.
Didn’t the hack with the comment work for you?
So you downgraded to npp6.5 to see it reports ANSI.
It might be that from version 6.5 to 6.9 code has changed (I didn’t check repositoryx for changes)Cheers
Claudia -
@Claudia-Frank last changes have been in 2012
in your link http://lxr.mozilla.org/seamonkey/source/extensions/universalchardet/ -
@Almir-Abrarov, I meant changes which were (or not) done in npp between version 6.5 and 6.9.
As I’m still confused about the current discussion, may I ask you to answer the following question/assumption?If you add the comment to the perl source file
- in npp 6.9 -> OK detected as windows-1251
- in npp 6.5 ->NOT OK detected as ANSI
Is this what we are talking about?
Cheers
Claudia