incorrect text encoding auto-detection (windows 1251 russian)

almirus256

Notepad++ v6.9.1
Build time : Mar 28 2016 - 19:48:40
Path : C:\Program Files (x86)\Notepad++\notepad++.exe
Admin mode : ON
Local Conf mode : OFF
OS : Windows 7
Plugins : mimeTools.dll NppConverter.dll NppExport.dll NppFTP.dll PluginManager.dll

Screens
https://goo.gl/photos/j1k2P4t8CZshFskE7

Claudia Frank

Hello @Almir-Abrarov,

to make encoding detection 100% reliable is impossible but you can help npp to get a better hit rate.
What about using some or unique chars of your encoding as a comment in the first line of your code?
It is need to have it in the beginning as npp doesn’t scan the whole file.

Cheers
Claudia

almirus256

I tried to add in first line (Perl file)
#а
(russian “а”, #E0)
still encoding Macintosh

Claudia Frank

One letter isn’t sufficient as its hex value is used in other codecs too.
Maybe I wasn’t that clear, what I meant was you need to provide a
comment which is unique for your language.
Something which is known to be an identifier.
A combination of letters or words. (Don’t know how to say this in other words)

Npp uses mozillas chardet library and this library tries
to find out if the combination of different hex values are more
likely to be found in codec A or B and then reports it guesses back.

Maybe another approach might work for you as well.
Stop using country-specific codecs start using utf-8.

Cheers
Claudia

almirus256

I disagree.
For example https://gist.github.com/almirus/72f5b27a229a31da293eb427f3be239a

This file not big and contains russian word:
line #94
вот так без сокращений
line #150
Неверный формат

Please change priority (lower) codepage Macintosh or explain how to remove it from list.

PS https://www.google.ru/search?q=кодировка+macintosh&oq=codepage+macint&aqs=chrome.1.69i57j0l3.4862j0j7&sourceid=chrome&ie=UTF-8#newwindow=1&q=notepad%2B%2B+macintosh+кодировка

Claudia Frank

First let me clarify, I’m a npp user as you so I’m neither in the position to change code as I like,
nor am I able to do it as I’m a c++ newbie.

Mozillas universalchardet and npp are open source, so you can download it and modify it to your needs.
Btw. universalchardet source is included in npp source so you only need to download npp.

In regards to the issue, if I put the following line as a comment in line 1 in your source, I do get windows-1251

 #сокращений

If you need, here the link to the mozilla source http://lxr.mozilla.org/seamonkey/source/extensions/universalchardet/.

Cheers
Claudia

dail

@Claudia-Frank is correct with the suggestion. If you check out the source here it only uses the first 1024 characters so it doesn’t have to read the entire file. This is done for efficiency reasons and because the majority of the time this is enough.

almirus256

thanks for reply.
same file in Notepad++ v6.5
https://goo.gl/photos/U59tZbNfFvZpaArM8

Claudia Frank

Sorry, don’t get the point.
Didn’t the hack with the comment work for you?
So you downgraded to npp6.5 to see it reports ANSI.
It might be that from version 6.5 to 6.9 code has changed (I didn’t check repositoryx for changes)

Cheers
Claudia

almirus256

@Claudia-Frank last changes have been in 2012
in your link http://lxr.mozilla.org/seamonkey/source/extensions/universalchardet/

Claudia Frank

@Almir-Abrarov, I meant changes which were (or not) done in npp between version 6.5 and 6.9.
As I’m still confused about the current discussion, may I ask you to answer the following question/assumption?

If you add the comment to the perl source file

in npp 6.9 -> OK detected as windows-1251
in npp 6.5 ->NOT OK detected as ANSI

Is this what we are talking about?

Cheers
Claudia