Community
    • Login

    SCI_GETCODEPAGE is NOT always either 0 or 65001

    Scheduled Pinned Locked Moved Notepad++ & Plugin Development
    3 Posts 2 Posters 3.9k Views 1 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • CoisesC Offline
      Coises
      last edited by

      It has been stated that SCI_GETCODEPAGE will always return either 0 (CP_ACP, for ANSI) or 65001 (CP_UTF8, for Unicode). I have repeated that statement myself. It is not true.

      In looking into an error I found in my code — one that has not yet been triggered in practice, but lies in wait for someone to hit a case where it’s applied to a very large file containing non-ASCII characters — I re-examined some assumptions and in looking into them, I found out that SCI_GETCODEPAGE isn’t quite as simple as I (and apparently others) thought.

      As far as I can determine:

      • When the file encoding as shown near the bottom right of the status bar is ANSI and the system default code page is a single-byte character set, SCI_GETCODEPAGE will return 0 (CP_ACP). The actual encoding within Scintilla will be the system default code page.

      • When the file encoding is any variant of Unicode, SCI_GETCODEPAGE will return 65001 (CP_UTF8) and the actual encoding within Scintilla will be UTF-8.

      • When the file encoding is ANSI and the system default code page is one of the supported CJK encodings — 932 (Japanese, Shift-JIS), 936 (Chinese Simplified, GB2312), 949 (Korean, Windows-949 / Unified Hangul Code) or 950 (Chinese Traditional, Big5) — SCI_GETCODEPAGE will return the numeric identifier of the system default code page and the actual encoding within Scintilla will be the system default code page.

      • When the file encoding is anything other than ANSI or some variant of Unicode, SCI_GETCODEPAGE will be 65001 and the actual encoding within Scintilla will be UTF-8.¹

      I do not know what happens if the system default code page is a multibyte encoding other than the four explicitly supported ones. The documentation for SCI_SETCODEPAGE says that Scintilla also supports code page 1361 (Korean Johab). It appears that Notepad++ does not support this encoding as an ANSI encoding… but I could be missing something. EUC-KR is listed on the Character sets menu, but that is a different multibyte encoding (51949) which is apparently not supported by Scintilla.

      When I did a test by changing my system default character set to Japanese, I started a new file, set it to ANSI, and pasted in some Japanese text. SCI_GETCODEPAGE was 932. I saved it that way. When I opened it again, the encoding was set to Shift-JIS — not ANSI — and SCI_GETCODEPAGE was 65001. (Note that the file was saved as Shift-JIS; it’s the internal encoding for editing that changed. Checking the position counts moving from one character to the next also verified that before I saved, Scintilla was using Shift-JIS, but when I opened it again, Scintilla was using UTF-8. Saving again still kept the file as Shift-JIS, as expected.)


      Bottom line:

      SCI_GETCODEPAGE can return 0, 932, 936, 949, 950 or 65001.

      • When it returns 0, character strings to and from Scintilla are in a single-byte encoding which is also the system default code page.

      • When it returns 65001, character strings to and from Scintilla are in UTF-8.

      • When it returns 932, 936, 949 or 950, character strings to and from Scintilla use the indicated multi-byte encoding.

      • Since CP_ACP, which is 0, represents the system default code page, and CP_UTF8, which is 65001, represents UTF-8, you can safely use the value returned by SCI_GETCODEPAGE in Windows API calls that take a code page identifier. However, you cannot safely assume that non-zero means UTF-8; nor can you assume that not UTF-8 means one byte = one character.


      ¹ This is true even when the encoding is the same as the system default code page. For example, on a typical American or Western European system, the system default code page is Windows-1252. If you open a new file and (if necessary) set the encoding to ANSI, the status bar will show ANSI, the encoding within Scintilla will be Windows-1252, and SCI_GETCODEPAGE will return 0. If you select Encoding | Character sets | Western European | Windows-1252, the status bar will show Windows-1252, the encoding within Scintilla will be UTF-8, and SCI_GETCODEPAGE will return 65001.

      Vitalii DovganV CoisesC 2 Replies Last reply Reply Quote 4
      • Vitalii DovganV Offline
        Vitalii Dovgan @Coises
        last edited by

        @Coises
        Yes.
        This is why CNppExec::convertSciText uses the actual Scintilla’s encoding nSciCodePage to convert Scintialla’s text to a desired encoding:
        https://github.com/d0vgan/nppexec/blob/develop/NppExec/src/NppExec.cpp#L2516

        1 Reply Last reply Reply Quote 2
        • CoisesC Offline
          Coises @Coises
          last edited by Coises

          I wrote in SCI_GETCODEPAGE is NOT always either 0 or 65001:

          When I did a test by changing my system default character set to Japanese, I started a new file, set it to ANSI, and pasted in some Japanese text. SCI_GETCODEPAGE was 932. I saved it that way. When I opened it again, the encoding was set to Shift-JIS — not ANSI — and SCI_GETCODEPAGE was 65001.

          For future reference:

          This only happens if Settings | Preferences | MISC | Autodetect character encoding is checked. When it is not checked, the file opens, as expected, as ANSI (SCI_GETCODEPAGE returns 932).

          1 Reply Last reply Reply Quote 0
          • CoisesC Coises referenced this topic on

          Hello! It looks like you're interested in this conversation, but you don't have an account yet.

          Getting fed up of having to scroll through the same posts each visit? When you register for an account, you'll always come back to exactly where you were before, and choose to be notified of new replies (either via email, or push notification). You'll also be able to save bookmarks and upvote posts to show your appreciation to other community members.

          With your input, this post could be even better 💗

          Register Login
          • First post
            Last post
          The Community of users of the Notepad++ text editor.
          Powered by NodeBB | Contributors