Community
    • Login

    Names of character sets

    Scheduled Pinned Locked Moved General Discussion
    character set
    22 Posts 4 Posters 2.9k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Paul WormerP
      Paul Wormer
      last edited by

      Today I read an old text file of mine and stumbled over its character set. I then noticed that, at least IMHO, the names for character sets used by Npp are somewhat inconsistent and a little confusing.

      I believe that “ANSI” (the first entry under the encoding tab) is not well-defined.

      The Npp character panel says on top “ASCII codes”. As is well-known “ASCII” is a 7-bit code that on the IBM-PC was extended to an 8-bit code by adding codepoints 128 to 256. The IBM code differs from the Windows CP1252 character set. I believe - but I did not check all characters - that the Npp control panel contains CP1252 characters. This set differs in codepoints 128 - 159 from Unicode Latin-1. In Latin-1 the codepoints 128-159 represent (unprintable) control characters.

      It would probably not difficult to change names of headings in Npp, but I can I imagine that for historical/legacy reasons it is not something that the Npp community (and the main developer) has high on their priority list. Personally I would find it an improvement, though, if standard names were used.

      Alan KilbornA PeterJonesP 4 Replies Last reply Reply Quote 0
      • Alan KilbornA
        Alan Kilborn @Paul Wormer
        last edited by

        @Paul-Wormer said in Names of character sets:

        The Npp character panel says on top “ASCII codes"

        It does? Can you show a screenshot?

        Your discussion probably falls into the category of “legacy behavior”. N++ provides support for “character sets” because it always has, but modern development uses Unicode.

        And thus, how much effort does one put into changing legacy features, and even nomenclature? Probably not much. So I wouldn’t expect any changes, but if you want to put the effort in, I suppose it would be reasonable for you to supply the exact changes you’d like to see.

        You could prototype your ideas here, with the end goal of creating an official issue. Or you could just jump to the official issue creation.

        PeterJonesP 1 Reply Last reply Reply Quote 1
        • PeterJonesP
          PeterJones @Paul Wormer
          last edited by PeterJones

          @Paul-Wormer

          I believe that “ANSI” (the first entry under the encoding tab) is not well-defined.

          ANSI is the “catchall” for whatever codepage your computer defaults to. This has been the Microsoft nomenclature since the MS-DOS days, no matter how misleading to technical pedants, and Notepad++ is not going to violate those decades of history on the term.

          Notepad++ uses the correct names for codepages when you look into the “character set” section of the Encoding menu. So if you want to pick a specific codepage, use those.

          The Npp character panel says on top “ASCII codes”

          Again, history had many people calling all of the codepoints 128-255 as “extended ASCII” (no matter how wrongly), and Don once again chose to maintain usage of that nomenclature. He presumably made that decision back in the early 2000s when the early editions of Notepad++ were coming out, and more people in the early 2000s still referred to such things as “ASCII”. With that weight of nomenclature outside and inside the application, Don isn’t likely to change it.

          1 Reply Last reply Reply Quote 2
          • PeterJonesP
            PeterJones @Alan Kilborn
            last edited by

            @Alan-Kilborn said in Names of character sets:

            The Npp character panel says on top “ASCII codes"

            It does? Can you show a screenshot?

            Edit > Character Panel:
            8c4c2697-907c-4f40-a478-449d032a1225-image.png

            1 Reply Last reply Reply Quote 2
            • PeterJonesP
              PeterJones @Paul Wormer
              last edited by PeterJones

              @Paul-Wormer said in Names of character sets:

              I believe that “ANSI” (the first entry under the encoding tab) is not well-defined.

              Besides what I said earlier, it’s still the term Microsoft’s notepad.exe uses as well: notepad.exe > File > Save As shows:

              a489ef52-a91f-4eec-9278-dcdb9d3ea670-image.png

              So anyone coming from MS notepad.exe to Notepad++ will expect there to be an “ANSI” encoding option.

              Alan KilbornA 1 Reply Last reply Reply Quote 2
              • Alan KilbornA
                Alan Kilborn @PeterJones
                last edited by

                …show a screenshot

                Oops.

                I suppose this is one of those cases where, long ago, maybe perhaps when I still cared about these character sets (and thus the panel), I changed the name of my panel, thusly:

                6d8dd443-7596-4691-87f0-b0c5a005448d-image.png

                So I guess I agree that “ASCII Codes Insertion Panel” is a dumb title for it. :-)

                1 Reply Last reply Reply Quote 2
                • PeterJonesP
                  PeterJones @Paul Wormer
                  last edited by PeterJones

                  @Paul-Wormer said in Names of character sets:

                  The Npp character panel says on top “ASCII codes”. As is well-known “ASCII” is a 7-bit code that on the IBM-PC was extended to an 8-bit code by adding codepoints 128 to 256. The IBM code differs from the Windows CP1252 character set. I believe - but I did not check all characters - that the Npp control panel contains CP1252 characters. This set differs in codepoints 128 - 159 from Unicode Latin-1. In Latin-1 the codepoints 128-159 represent (unprintable) control characters.

                  … Further, this paragraph by you implies that you think the Edit > Character Panel always shows the same 256 characters. You would be wrong.

                  It shows the 256 characters for whatever the active codepage is (or the first 256 codepoints from Unicode for UTF-8 or UTF-16:

                  UTF-8:
                  94f7e453-3ecc-48ad-985a-1bd96ef018dc-image.png

                  ANSI (on my system, where chcp 1252 is the default):
                  def57629-f5e1-4a13-be4f-8a530c877f0c-image.png

                  Character Set > Western > OEM-US:
                  58c2b3c8-cc40-494f-8cc0-d04b05ec700d-image.png

                  Character Set > Thai > TIS-620 (picked at random):
                  179a59bb-0801-47ee-84f8-95de551d82d9-image.png

                  …

                  And to find out what your computer considers the “ANSI”, look at the ? > Debug Info:
                  f4dd6bdf-97e4-43cf-b7ef-63ca4e2fbe91-image.png
                  It lists the “Current ANSI codepage”.

                  Paul WormerP 1 Reply Last reply Reply Quote 3
                  • Paul WormerP
                    Paul Wormer @PeterJones
                    last edited by

                    @PeterJones
                    OK, so the debug page states “Current ANSI codepage : 1252”, confirming what I thought. However, I did not know that you could set an active code page. Apparently it is not set by the entry in the tab “Encoding” because I almost always have “UTF-8” there.

                    PeterJonesP Alan KilbornA 2 Replies Last reply Reply Quote 0
                    • PeterJonesP
                      PeterJones @Paul Wormer
                      last edited by

                      @Paul-Wormer said in Names of character sets:

                      However, I did not know that you could set an active code page

                      That’s a Windows setting. In a cmd.exe prompt, you can change the codepage using chcp – but that doesn’t affect the Notepad++ “Current ANSI codepage”, even if you run Notepad++ from that cmd.exe window.

                      Your system’s default codepage (the one that Notepad++ shows) is based on your language settings in your Windows OS.

                      Paul WormerP 1 Reply Last reply Reply Quote 3
                      • Alan KilbornA
                        Alan Kilborn @Paul Wormer
                        last edited by

                        @Paul-Wormer said in Names of character sets:

                        so the debug page states “Current ANSI codepage : 1252”

                        That tells you what you are looking at when you have this in your status bar:

                        2cf8c614-ddbf-46e4-a8db-1fec36cda367-image.png

                        And this if you drop the Encoding menu:

                        1195e9d4-d725-45ca-b9ff-67e6280cfbd8-image.png

                        What’s in the Debug Info shows you what Windows has for your system’s codepage; Notepad++ just follows it.

                        BTW, when your Debug Info codepage is “1252”, the following are equivalent to “ANSI” in the screenshots above:

                        5db969e6-263e-421e-b8db-9973c16ddc21-image.png

                        which you would obtain from choosing:

                        32687e4b-39a6-4ff1-88ef-da587d40c90d-image.png

                        Paul WormerP 1 Reply Last reply Reply Quote 3
                        • Paul WormerP
                          Paul Wormer @Alan Kilborn
                          last edited by

                          @Alan-Kilborn Thank you for pointing out this route to character sets. It is interesting to find the entry “OEM-US” under the heading “Western European”. This sets the clock back to 1776.

                          When I select the OEM-US entry, the Edit > Character Panel shows the characters of the original IBM-PC extended ASCII set (CP 437). IMHO, CP 437 (or OEM 437) would be more telling than OEM-US and also more consistent with OEM 850 (the European counterpart of CP 437).

                          Incidentally, Windows’ charmap refers to CP 437 as “DOS: United States” and to CP 850 as “DOS: Western Europe”. Wouldn’t it be a good idea if N++ followed the nomenclature of the Windows charmap?

                          Alan KilbornA 1 Reply Last reply Reply Quote 1
                          • Alan KilbornA
                            Alan Kilborn @Paul Wormer
                            last edited by Alan Kilborn

                            @Paul-Wormer said in Names of character sets:

                            CP 437 (or OEM 437) would be more telling than OEM-US

                            See HERE for some also-known-as’s.

                            It is interesting to find the entry “OEM-US” under the heading “Western European”

                            I found that strange as well; I have no explanation for why it was slotted under that.

                            1 Reply Last reply Reply Quote 0
                            • Paul WormerP
                              Paul Wormer @PeterJones
                              last edited by

                              @PeterJones As Alan Kilborn pointed out, one can change code pages within N++ (under Encoding). I played around a little with it and found that often Edit > Character Panel followed the setting of the code page, but not consistently. Is there a kind of systematics that I overlook?

                              Alan KilbornA 1 Reply Last reply Reply Quote 0
                              • Alan KilbornA
                                Alan Kilborn @Paul Wormer
                                last edited by

                                @Paul-Wormer said in Names of character sets:

                                found that often Edit > Character Panel followed the setting of the code page, but not consistently

                                Please give an example of the inconsistency.

                                Paul WormerP 1 Reply Last reply Reply Quote 0
                                • Paul WormerP
                                  Paul Wormer @Alan Kilborn
                                  last edited by Paul Wormer

                                  @Alan-Kilborn
                                  53a20dab-a28d-4ff6-b803-b9141c9d2f81-afbeelding.png

                                  11e8fb79-6f17-4d52-9a1a-393ac4b01ed3-afbeelding.png

                                  As you know 0x80 is a good character to concentrate on. In UTF8 it is a control character, in CP1252 it is € and in CP 437 it is Ç. I showed results of one edit session. First UTF8 that shows the character panel of CP1252 and then OEM-US that shows CP 437.

                                  Alan KilbornA 1 Reply Last reply Reply Quote 0
                                  • Alan KilbornA
                                    Alan Kilborn @Paul Wormer
                                    last edited by Alan Kilborn

                                    @Paul-Wormer

                                    Hmm, not exactly sure what you’re getting at, but if I use the Character Panel to insert the “0x80” character into a UTF-8 file (and I think the accurate way to say that is, insert a U+0080 character), I visually obtain the € character (as I would expect), and if I save the file and open it in a hex editor I see the 3-byte combination E2 82 AC (which I also expect). So I don’t see anything unexpected or inconsistent here, but maybe I’m just missing what you’re trying to say.

                                    Paul WormerP 1 Reply Last reply Reply Quote 0
                                    • Paul WormerP
                                      Paul Wormer @Alan Kilborn
                                      last edited by Paul Wormer

                                      @Alan-Kilborn I would like to see the character panel that agrees with my choice of character set. For instance, if I choose UTF8, I like to see a panel with control codes (no letters) between 0x80 and 0xa0. Or, as in Windows charmap, the control characters may be simply skipped in the panel. In other words, I would like to see more or less the same character panels as in charmap.

                                      Now it seems from the panel as if the € sign has the code 0x80 in UTF8. And, as you correctly point out, it has a 3-byte code, not a 1-byte code in UTF8. BTW, the official Unicode code point of € is 0x20AC, the 3-bytes: E282AC give its UTF8 coding.

                                      PeterJonesP 2 Replies Last reply Reply Quote 0
                                      • PeterJonesP
                                        PeterJones @Paul Wormer
                                        last edited by

                                        This post is deleted!
                                        1 Reply Last reply Reply Quote 0
                                        • PeterJonesP
                                          PeterJones @Paul Wormer
                                          last edited by PeterJones

                                          @Paul-Wormer,

                                          UTF-8 and UTF-16 seem to be “oddities”. The character sets all just show what’s at the 255 individual codepoints for that character set. But my guess with the UTF-# encodings, because “unicode” handling was an afterthought about halfway through the Notepad++ lifecycle, is that Don just left those there in the UTF-# character panels to show the character from the nth position in Win-1252, but map it to the correct codepoint in Unicode, with the correct bytes for the UTF-# encoding. This would have made it easier for people to transition from Win-1252 to Unicode/UTF-8, while still being able to find their “favorite” characters at the same location in the Notepad++ Character Panel.

                                          Personally, it doesn’t annoy me, because I found the character panel not overly helpful for the Unicode encodings. I have a Notepad++ Run-menu entry to bring up charmap.exe. If I were going to recommend changes to the Character Panel (other than fixing the compiled and english.xml and english_customizable.xml name to be just “Character Panel” that Alan uses) would be to allow it to have multiple “subpages” on the UTF-8/UTF-16 encodings, just like charmap.exe “group by unicode subpage” does, so that you can access more than 255 characters in the Character Panel

                                          Paul WormerP 1 Reply Last reply Reply Quote 1
                                          • Paul WormerP
                                            Paul Wormer @PeterJones
                                            last edited by

                                            @PeterJones said in Names of character sets:

                                            because “unicode” handling was an afterthought about halfway through the Notepad++ lifecycle

                                            Well, the transition from a 1-byte character set to a variable-byte character set seems to my layman’s eye more than an afterthought.

                                            I don’t know anything about coding of Windows apps, but couldn’t N++ simply interface with charmap.exe?

                                            PeterJonesP 1 Reply Last reply Reply Quote 0
                                            • First post
                                              Last post
                                            The Community of users of the Notepad++ text editor.
                                            Powered by NodeBB | Contributors