Community
    • Login

    incorrect text encoding auto-detection (windows 1251 russian)

    Scheduled Pinned Locked Moved General Discussion
    textrussianencoding
    11 Posts 3 Posters 13.2k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Claudia FrankC
      Claudia Frank @almirus256
      last edited by

      Hello @Almir-Abrarov,

      to make encoding detection 100% reliable is impossible but you can help npp to get a better hit rate.
      What about using some or unique chars of your encoding as a comment in the first line of your code?
      It is need to have it in the beginning as npp doesn’t scan the whole file.

      Cheers
      Claudia

      1 Reply Last reply Reply Quote 0
      • almirus256A
        almirus256
        last edited by

        I tried to add in first line (Perl file)
        #а
        (russian “а”, #E0)
        still encoding Macintosh

        1 Reply Last reply Reply Quote 0
        • Claudia FrankC
          Claudia Frank
          last edited by

          One letter isn’t sufficient as its hex value is used in other codecs too.
          Maybe I wasn’t that clear, what I meant was you need to provide a
          comment which is unique for your language.
          Something which is known to be an identifier.
          A combination of letters or words. (Don’t know how to say this in other words)

          Npp uses mozillas chardet library and this library tries
          to find out if the combination of different hex values are more
          likely to be found in codec A or B and then reports it guesses back.

          Maybe another approach might work for you as well.
          Stop using country-specific codecs start using utf-8.

          Cheers
          Claudia

          1 Reply Last reply Reply Quote 0
          • almirus256A
            almirus256
            last edited by

            I disagree.
            For example https://gist.github.com/almirus/72f5b27a229a31da293eb427f3be239a

            This file not big and contains russian word:
            line #94
            вот так без сокращений
            line #150
            Неверный формат

            Please change priority (lower) codepage Macintosh or explain how to remove it from list.

            PS https://www.google.ru/search?q=кодировка+macintosh&oq=codepage+macint&aqs=chrome.1.69i57j0l3.4862j0j7&sourceid=chrome&ie=UTF-8#newwindow=1&q=notepad%2B%2B+macintosh+кодировка

            1 Reply Last reply Reply Quote 0
            • Claudia FrankC
              Claudia Frank
              last edited by

              First let me clarify, I’m a npp user as you so I’m neither in the position to change code as I like,
              nor am I able to do it as I’m a c++ newbie.

              Mozillas universalchardet and npp are open source, so you can download it and modify it to your needs.
              Btw. universalchardet source is included in npp source so you only need to download npp.

              In regards to the issue, if I put the following line as a comment in line 1 in your source, I do get windows-1251

               #сокращений
              

              If you need, here the link to the mozilla source http://lxr.mozilla.org/seamonkey/source/extensions/universalchardet/.

              Cheers
              Claudia

              1 Reply Last reply Reply Quote 1
              • dailD
                dail
                last edited by

                @Claudia-Frank is correct with the suggestion. If you check out the source here it only uses the first 1024 characters so it doesn’t have to read the entire file. This is done for efficiency reasons and because the majority of the time this is enough.

                1 Reply Last reply Reply Quote 1
                • almirus256A
                  almirus256
                  last edited by

                  thanks for reply.
                  same file in Notepad++ v6.5
                  https://goo.gl/photos/U59tZbNfFvZpaArM8

                  1 Reply Last reply Reply Quote 0
                  • Claudia FrankC
                    Claudia Frank
                    last edited by

                    Sorry, don’t get the point.
                    Didn’t the hack with the comment work for you?
                    So you downgraded to npp6.5 to see it reports ANSI.
                    It might be that from version 6.5 to 6.9 code has changed (I didn’t check repositoryx for changes)

                    Cheers
                    Claudia

                    almirus256A 1 Reply Last reply Reply Quote 0
                    • almirus256A
                      almirus256 @Claudia Frank
                      last edited by

                      @Claudia-Frank last changes have been in 2012
                      in your link http://lxr.mozilla.org/seamonkey/source/extensions/universalchardet/

                      Claudia FrankC 1 Reply Last reply Reply Quote 0
                      • Claudia FrankC
                        Claudia Frank @almirus256
                        last edited by

                        @Almir-Abrarov, I meant changes which were (or not) done in npp between version 6.5 and 6.9.
                        As I’m still confused about the current discussion, may I ask you to answer the following question/assumption?

                        If you add the comment to the perl source file

                        • in npp 6.9 -> OK detected as windows-1251
                        • in npp 6.5 ->NOT OK detected as ANSI

                        Is this what we are talking about?

                        Cheers
                        Claudia

                        1 Reply Last reply Reply Quote 0
                        • First post
                          Last post
                        The Community of users of the Notepad++ text editor.
                        Powered by NodeBB | Contributors