Community
    • 登入

    Help for an ANSI file

    已排程 已置頂 已鎖定 已移動 Help wanted · · · – – – · · ·
    34 貼文 7 Posters 8.9k 瀏覽
    正在載入更多貼文
    • 從舊到新
    • 從新到舊
    • 最多點贊
    回覆
    • 在新貼文中回覆
    登入後回覆
    此主題已被刪除。只有擁有主題管理權限的使用者可以查看。
    • EkopalypseE
      Ekopalypse
      最後由 編輯

      @gstavi

      If Notepad++ does not autodetect then it must assume some default.

      I thought then it is ANSI, which depends on what GetACP returns for the current setup.

      gstaviG 1 條回覆 最後回覆 回覆 引用 1
      • gstaviG
        gstavi @Ekopalypse
        最後由 編輯

        @Ekopalypse
        It is the first time I ever heard of GetACP and I wonder how a typical user should anticipate the behavior when he disables autodetect.
        And it is obviously still broken because a user should be allowed to instruct Notepad++ to assume some specific UNICODE encoding rather than codepage.

        Alan KilbornA 1 條回覆 最後回覆 回覆 引用 1
        • Alan KilbornA
          Alan Kilborn @gstavi
          最後由 編輯

          @gstavi said in Help for an ANSI file:

          user should be allowed to instruct Notepad++ to assume some specific UNICODE encoding rather than codepage

          This might be relevant to that:

          HERE @PeterJones says:

          1. In the Settings > Preferences > New Document settings, if UTF-8 is chosen as your default encoding, you can also choose to always apply UTF-8 interpretation to files that Notepad++ opens and guesses are ANSI, not just to new files.

          It seems a bit strange, or downright bad, that this option is buried in with the “New Document” settings?

          1 條回覆 最後回覆 回覆 引用 1
          • EkopalypseE
            Ekopalypse
            最後由 編輯

            @gstavi said in Help for an ANSI file:

            I am also not convinced that it works 100%, and I have tried to understand this part of the code, but I have to admit that it is quite confusing for me.

            I agree, it would be nice to have a possibility to force an encoding but
            what I would like to have is to force a lexer to a specific encoding.
            Like batch to OEM850 and python to utf8 …

            1 條回覆 最後回覆 回覆 引用 2
            • Alan KilbornA
              Alan Kilborn
              最後由 編輯

              I did some more tangential playing around with this.

              I found that N++ will open a “7-bit ASCII” file (not sure how to really say that!) that has a NUL character in it, as ANSI. All other characters are your typical A-z0-9.
              But if the NUL is replaced with a SOH character, N++ opens it as UTF-8.
              Curious about why it does it differently.

              Of course, I’m mostly set up (I think) to have it work with UTF-8, but I’m less and less sure as the discussion goes on, what I should have selected in the Preferences to do this. :-)

              1 條回覆 最後回覆 回覆 引用 1
              • EkopalypseE
                Ekopalypse
                最後由 編輯

                My understanding, when having autodetection disabled, is the following:

                A Scintilla buffer is initialized with _codepage = ::GetACP().
                The entry point is

                Notepad_plus::doOpen(const generic_string& fileName, bool isRecursive, bool isReadOnly, int encoding, const TCHAR *backupFileName, FILETIME fileNameTimestamp)
                

                The following steps are performed

                1. npp checks if the file is an html or xml file and if the encoding can be read from the prolog.
                2. when it is loaded from a session, it gets the encoding that was used before
                  else
                3. Npp tries to find out if it is Unicode or ANSI (I don’t understand this part of the code)
                  if it is a Unicode, the encoding is set accordingly
                  otherwise Npp checks if “open ANSI as utf8” is configured and sets either ANSI or utf8
                1 條回覆 最後回覆 回覆 引用 2
                • guy038G
                  guy038
                  最後由 編輯

                  Hello, @alan-kilborn and All,

                  Well, Alan, I guess the problem and there is a real bug !


                  First, I suppose that, in your Settings > Preferences... > New Document > Encoding :

                  • The UTF-8 encoding ( Not the UTF-8 with BOM one ) is selected

                  • The Apply to opened ANSI files option is selected

                  And in Settings > Preferences... > New Document > MISC. :

                  • The Autodetect character encoding option is UNCHECKED

                  Note Alan, that is also my own configuration, too !


                  Now, let’s suppose that you open an N++ new file => So, in the status bar, the UTF-8 encoding is displayed : logical !

                  Now just write the string ABCD, save this new file as Test.txt and close Notepad++

                  While opening this file, any editor, without any other indication, cannot tell which is its right encoding :

                  • It could be encoded with four bytes 41424344 in an ANSI file ( so any Windows encoding as Win-1252, Win-1251, … because the ASCII part, from 00 to 7F is identical

                  • It could be encoded, also, with four bytes 41424344 in a N++ UTF-8 file ( so without a BOM ). Indeed, with the UTF-8 encoding, any character with code-point under \x{0080} is coded with in 1 byte only, from 00 to 7F

                  But, as we have the setting Apply to opened ANSI files set, when you re-open the Test.txt file, again, you should see the UTF-8 indication in the status bar

                  And, adding the SOH character ( \x{01} ) , or any character till \x{1F} ( I verified ), between AB and CD does not change anything. The encoding will remain UTF-8 !

                  But, adding the NUL character change does change the encoding as ANSI, which is in contradiction with our user settings ! However, this particular case ( NUL char + pure ASCII chars, only ) does not really matter as current contents file do not change when switching from ANSI to UTF-8 and vice-versa, anyway !


                  Now, what’s more annoying is that the presence of the NUL character still forces the ANSI encoding, even if a character, with code over \x{007F}, is added to the file :-(( For instance, if you add the very common French char é, to get the string ABNULCDé and save this file with an UTF-8 encoding, when you re-open this file, the encoding is wrongly changed to ANSI. So, the wrong string ABNULCDé is displayed !

                  Remember that the contents of Test.txt file, the string ABNULCDé, after saving, are 4142004344C3A9 with the UTF-8 encoding ( This same string, would be coded 4142004344E9 in an ANSI file )

                  So, although files with NUL characters are not common in classical text files, I suppose that this bug need creating an issue. What is your feeling about it ?

                  Best Regards,

                  guy038

                  Alan KilbornA 1 條回覆 最後回覆 回覆 引用 1
                  • Alan KilbornA
                    Alan Kilborn @guy038
                    最後由 編輯

                    @guy038 said in Help for an ANSI file:

                    First, I suppose that, in your Settings > Preferences… > New Document > Encoding

                    Right on the settings assumptions, except for me The Autodetect character encoding option is CHECKED

                    So, although files with NUL characters are not common in classical text files, I suppose that this bug need creating an issue. What is your feeling about it ?

                    Well, I was just sort of experimenting around. NUL characters are not something I typically use. Although I do have the feeling that if Scintilla allows them in the buffer (and clearly it does because I can see a black-boxed “NUL”), then Notepad++ itself should try and “do the right thing” (whatever that is) about them.

                    1 條回覆 最後回覆 回覆 引用 2
                    • Alan KilbornA
                      Alan Kilborn
                      最後由 編輯

                      But…
                      It does seem like I as a user should be able to tell the software: "If a file can’t officially be identified via a BOM, then open it as ‘xxxxxxx’ " (UTF-8 for me! but YMMV).

                      1 條回覆 最後回覆 回覆 引用 0
                      • andrecool-68A
                        andrecool-68 @Ekopalypse
                        最後由 編輯

                        @Ekopalypse An example of an error:

                        oem-866.png

                        EkopalypseE 1 條回覆 最後回覆 回覆 引用 0
                        • EkopalypseE
                          Ekopalypse @andrecool-68
                          最後由 編輯

                          @andrecool-68
                          Npp has no chance to find out what encoding it is, neither does AkelPad.

                          What AkelPad does is to save the selected encoding in HKEY_CURRENT_USER\Software\Akelsoft\AkelPad\Recent.

                          If you open enough other documents, more than 10, and you have not changed the default setting, you will see that AkelPad opens your batch file with ANSI encoding as well.

                          1 條回覆 最後回覆 回覆 引用 2
                          • andrecool-68A
                            andrecool-68
                            最後由 andrecool-68 編輯

                            @Ekopalypse
                            In AkelPad you can reopen a document 1000 times and re-save 1000 times without breaking the file content, and Notepad ++ cannot boast of that.
                            I mean working with encoding))

                            Alan KilbornA EkopalypseE 2 條回覆 最後回覆 回覆 引用 0
                            • Alan KilbornA
                              Alan Kilborn @andrecool-68
                              最後由 編輯

                              @andrecool-68

                              I think you missed the point of @Ekopalypse

                              I saw somewhere I can’t find now about someone wanting to remember caret-position in a file that had been previously open (at some point) with N++, but is currently closed.

                              This current encoding discussion seems a candidate for same.
                              Meaning: Create/maintain some sort of database for this info (encoding, caret position) and then when a file is opened, see if the file was previously encountered; if so restore last-known encoding selection and caret.

                              1 條回覆 最後回覆 回覆 引用 2
                              • EkopalypseE
                                Ekopalypse @andrecool-68
                                最後由 編輯

                                @andrecool-68

                                But only as long as you have not opened more than the maximum number of files to be saved.

                                73679fc4-37eb-4f49-8a16-4a454c69650d-image.png

                                Create/maintain some sort of database for this info

                                But I see a disadvantage - maintenance. How to keep this kind of db clean. I would say that during runtime of npp one can expect it but
                                after a restart it will become a nightmare to keep it clean.
                                How would one handle temporarily inaccessible files, moved or deleted files.
                                On the other side if you can configure an extension or lexer to always
                                open a file in a specific encoding then I assume most of the issues are solved.

                                1 條回覆 最後回覆 回覆 引用 1
                                • 第一個貼文
                                  最後的貼文
                                The Community of users of the Notepad++ text editor.
                                Powered by NodeBB | Contributors