Community

    • Login
    • Search
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Search

    incorrect text encoding auto-detection (windows 1251 russian)

    General Discussion
    text russian encoding
    3
    11
    12266
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • almirus256
      almirus256 last edited by

      Notepad++ v6.9.1
      Build time : Mar 28 2016 - 19:48:40
      Path : C:\Program Files (x86)\Notepad++\notepad++.exe
      Admin mode : ON
      Local Conf mode : OFF
      OS : Windows 7
      Plugins : mimeTools.dll NppConverter.dll NppExport.dll NppFTP.dll PluginManager.dll

      Screens
      https://goo.gl/photos/j1k2P4t8CZshFskE7

      Claudia Frank 1 Reply Last reply Reply Quote 0
      • Claudia Frank
        Claudia Frank @almirus256 last edited by

        Hello @Almir-Abrarov,

        to make encoding detection 100% reliable is impossible but you can help npp to get a better hit rate.
        What about using some or unique chars of your encoding as a comment in the first line of your code?
        It is need to have it in the beginning as npp doesn’t scan the whole file.

        Cheers
        Claudia

        1 Reply Last reply Reply Quote 0
        • almirus256
          almirus256 last edited by

          I tried to add in first line (Perl file)
          #а
          (russian “а”, #E0)
          still encoding Macintosh

          1 Reply Last reply Reply Quote 0
          • Claudia Frank
            Claudia Frank last edited by

            One letter isn’t sufficient as its hex value is used in other codecs too.
            Maybe I wasn’t that clear, what I meant was you need to provide a
            comment which is unique for your language.
            Something which is known to be an identifier.
            A combination of letters or words. (Don’t know how to say this in other words)

            Npp uses mozillas chardet library and this library tries
            to find out if the combination of different hex values are more
            likely to be found in codec A or B and then reports it guesses back.

            Maybe another approach might work for you as well.
            Stop using country-specific codecs start using utf-8.

            Cheers
            Claudia

            1 Reply Last reply Reply Quote 0
            • almirus256
              almirus256 last edited by

              I disagree.
              For example https://gist.github.com/almirus/72f5b27a229a31da293eb427f3be239a

              This file not big and contains russian word:
              line #94
              вот так без сокращений
              line #150
              Неверный формат

              Please change priority (lower) codepage Macintosh or explain how to remove it from list.

              PS https://www.google.ru/search?q=кодировка+macintosh&oq=codepage+macint&aqs=chrome.1.69i57j0l3.4862j0j7&sourceid=chrome&ie=UTF-8#newwindow=1&q=notepad%2B%2B+macintosh+кодировка

              1 Reply Last reply Reply Quote 0
              • Claudia Frank
                Claudia Frank last edited by

                First let me clarify, I’m a npp user as you so I’m neither in the position to change code as I like,
                nor am I able to do it as I’m a c++ newbie.

                Mozillas universalchardet and npp are open source, so you can download it and modify it to your needs.
                Btw. universalchardet source is included in npp source so you only need to download npp.

                In regards to the issue, if I put the following line as a comment in line 1 in your source, I do get windows-1251

                 #сокращений
                

                If you need, here the link to the mozilla source http://lxr.mozilla.org/seamonkey/source/extensions/universalchardet/.

                Cheers
                Claudia

                1 Reply Last reply Reply Quote 1
                • dail
                  dail last edited by

                  @Claudia-Frank is correct with the suggestion. If you check out the source here it only uses the first 1024 characters so it doesn’t have to read the entire file. This is done for efficiency reasons and because the majority of the time this is enough.

                  1 Reply Last reply Reply Quote 1
                  • almirus256
                    almirus256 last edited by

                    thanks for reply.
                    same file in Notepad++ v6.5
                    https://goo.gl/photos/U59tZbNfFvZpaArM8

                    1 Reply Last reply Reply Quote 0
                    • Claudia Frank
                      Claudia Frank last edited by

                      Sorry, don’t get the point.
                      Didn’t the hack with the comment work for you?
                      So you downgraded to npp6.5 to see it reports ANSI.
                      It might be that from version 6.5 to 6.9 code has changed (I didn’t check repositoryx for changes)

                      Cheers
                      Claudia

                      almirus256 1 Reply Last reply Reply Quote 0
                      • almirus256
                        almirus256 @Claudia Frank last edited by

                        @Claudia-Frank last changes have been in 2012
                        in your link http://lxr.mozilla.org/seamonkey/source/extensions/universalchardet/

                        Claudia Frank 1 Reply Last reply Reply Quote 0
                        • Claudia Frank
                          Claudia Frank @almirus256 last edited by

                          @Almir-Abrarov, I meant changes which were (or not) done in npp between version 6.5 and 6.9.
                          As I’m still confused about the current discussion, may I ask you to answer the following question/assumption?

                          If you add the comment to the perl source file

                          • in npp 6.9 -> OK detected as windows-1251
                          • in npp 6.5 ->NOT OK detected as ANSI

                          Is this what we are talking about?

                          Cheers
                          Claudia

                          1 Reply Last reply Reply Quote 0
                          • First post
                            Last post
                          Copyright © 2014 NodeBB Forums | Contributors