• Login
Community
  • Login

incorrect text encoding auto-detection (windows 1251 russian)

Scheduled Pinned Locked Moved General Discussion
textrussianencoding
11 Posts 3 Posters 13.3k Views
Loading More Posts
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • A
    almirus256
    last edited by Mar 29, 2016, 8:14 AM

    Notepad++ v6.9.1
    Build time : Mar 28 2016 - 19:48:40
    Path : C:\Program Files (x86)\Notepad++\notepad++.exe
    Admin mode : ON
    Local Conf mode : OFF
    OS : Windows 7
    Plugins : mimeTools.dll NppConverter.dll NppExport.dll NppFTP.dll PluginManager.dll

    Screens
    https://goo.gl/photos/j1k2P4t8CZshFskE7

    C 1 Reply Last reply Mar 29, 2016, 10:35 PM Reply Quote 0
    • C
      Claudia Frank @almirus256
      last edited by Mar 29, 2016, 10:35 PM

      Hello @Almir-Abrarov,

      to make encoding detection 100% reliable is impossible but you can help npp to get a better hit rate.
      What about using some or unique chars of your encoding as a comment in the first line of your code?
      It is need to have it in the beginning as npp doesn’t scan the whole file.

      Cheers
      Claudia

      1 Reply Last reply Reply Quote 0
      • A
        almirus256
        last edited by Mar 30, 2016, 12:27 PM

        I tried to add in first line (Perl file)
        #а
        (russian “а”, #E0)
        still encoding Macintosh

        1 Reply Last reply Reply Quote 0
        • C
          Claudia Frank
          last edited by Mar 31, 2016, 12:13 AM

          One letter isn’t sufficient as its hex value is used in other codecs too.
          Maybe I wasn’t that clear, what I meant was you need to provide a
          comment which is unique for your language.
          Something which is known to be an identifier.
          A combination of letters or words. (Don’t know how to say this in other words)

          Npp uses mozillas chardet library and this library tries
          to find out if the combination of different hex values are more
          likely to be found in codec A or B and then reports it guesses back.

          Maybe another approach might work for you as well.
          Stop using country-specific codecs start using utf-8.

          Cheers
          Claudia

          1 Reply Last reply Reply Quote 0
          • A
            almirus256
            last edited by Mar 31, 2016, 7:42 AM

            I disagree.
            For example https://gist.github.com/almirus/72f5b27a229a31da293eb427f3be239a

            This file not big and contains russian word:
            line #94
            вот так без сокращений
            line #150
            Неверный формат

            Please change priority (lower) codepage Macintosh or explain how to remove it from list.

            PS https://www.google.ru/search?q=кодировка+macintosh&oq=codepage+macint&aqs=chrome.1.69i57j0l3.4862j0j7&sourceid=chrome&ie=UTF-8#newwindow=1&q=notepad%2B%2B+macintosh+кодировка

            1 Reply Last reply Reply Quote 0
            • C
              Claudia Frank
              last edited by Mar 31, 2016, 4:00 PM

              First let me clarify, I’m a npp user as you so I’m neither in the position to change code as I like,
              nor am I able to do it as I’m a c++ newbie.

              Mozillas universalchardet and npp are open source, so you can download it and modify it to your needs.
              Btw. universalchardet source is included in npp source so you only need to download npp.

              In regards to the issue, if I put the following line as a comment in line 1 in your source, I do get windows-1251

               #сокращений
              

              If you need, here the link to the mozilla source http://lxr.mozilla.org/seamonkey/source/extensions/universalchardet/ .

              Cheers
              Claudia

              1 Reply Last reply Reply Quote 1
              • D
                dail
                last edited by Mar 31, 2016, 4:25 PM

                @Claudia-Frank is correct with the suggestion. If you check out the source here it only uses the first 1024 characters so it doesn’t have to read the entire file. This is done for efficiency reasons and because the majority of the time this is enough.

                1 Reply Last reply Reply Quote 1
                • A
                  almirus256
                  last edited by Mar 31, 2016, 6:12 PM

                  thanks for reply.
                  same file in Notepad++ v6.5
                  https://goo.gl/photos/U59tZbNfFvZpaArM8

                  1 Reply Last reply Reply Quote 0
                  • C
                    Claudia Frank
                    last edited by Mar 31, 2016, 6:30 PM

                    Sorry, don’t get the point.
                    Didn’t the hack with the comment work for you?
                    So you downgraded to npp6.5 to see it reports ANSI.
                    It might be that from version 6.5 to 6.9 code has changed (I didn’t check repositoryx for changes)

                    Cheers
                    Claudia

                    A 1 Reply Last reply Apr 4, 2016, 10:42 AM Reply Quote 0
                    • A
                      almirus256 @Claudia Frank
                      last edited by Apr 4, 2016, 10:42 AM

                      @Claudia-Frank last changes have been in 2012
                      in your link http://lxr.mozilla.org/seamonkey/source/extensions/universalchardet/

                      C 1 Reply Last reply Apr 4, 2016, 3:06 PM Reply Quote 0
                      • C
                        Claudia Frank @almirus256
                        last edited by Apr 4, 2016, 3:06 PM

                        @Almir-Abrarov, I meant changes which were (or not) done in npp between version 6.5 and 6.9.
                        As I’m still confused about the current discussion, may I ask you to answer the following question/assumption?

                        If you add the comment to the perl source file

                        • in npp 6.9 -> OK detected as windows-1251
                        • in npp 6.5 ->NOT OK detected as ANSI

                        Is this what we are talking about?

                        Cheers
                        Claudia

                        1 Reply Last reply Reply Quote 0
                        3 out of 11
                        • First post
                          3/11
                          Last post
                        The Community of users of the Notepad++ text editor.
                        Powered by NodeBB | Contributors