How to mantain special ASCII characters

3d1l

Hi,

I’m using notepad++ to keep notes. I type using C language because it allows me to edit sections of text that I can expand and collapse using { }. I’m using special ASCII codes for arrows, bullets and others, like ALT 16 ►, ALT 17 ◄, ←. After saving the file and re-openig it, Notepad++ replaces the characters with DEL, DC1, ETB, etc. The only ones that it keeps are ALT 254 ■ and ALT 251 √. Is there a way to keep the special characters?

guy038

Hi, 3d1l,

I suppose that your Window OEM codepage, for your system, is OEM 437 ( Encoding > Character Sets > Western European > OEM US )

If you open the Character Panel ( Edit > Character Panel ) it easy to verify, for instance, that characters, as the symbols ► and ◄, do NOT exist in the OEM 437 encoding and are simply replaced by the C0 Control characters DLE and DC1( of Unicode value 0016 and 0017 ) !

You should use an Unicode encoding ( Encode as UTF-8, Encode as UTF-8 with BOM, UCS-2 BE BOM or UCS2 LE BOM ), which are the only encodings able to display an huge amount of characters/symbols, providing that your current N++ text font can display them !

Refer to the end of my post, on Sourceforce.net, below :

https://sourceforge.net/p/notepad-plus/discussion/331753/thread/e5b72494/#b5c1

I would advise to use the universal Unicode UTF-8 encoding, which allows to code any character of any language, in the world ! Of course, depending of your current font, some glyphs of characters may be displayed or not and then, replaced with a small white square or a question mark !

So, once your text, with its current encoding, and containing specific symbols, is written, just use the N++ option Encoding > Convert to UTF-8. Then, save your file, with this new encoding. After restarting N++, your file should display all your symbols, as expected :-) ( Note that I said Convert to UTF-8. Don’t use the option Encode in UTF-8 ! )

BTW, the N++ default Courrier New font is able to display the 31 symbols ( from ALT + 1 to ALT + 31 ), below :

☺ ☻ ♥ ♦ ♣ ♠ • ◘ ○ ◙ ♂ ♀ ♪ ♫ ☼ ► ◄ ↕ ‼ ¶ § ▬ ↨ ↑ ↓ → ← ∟ ↔ ▲ ▼

Best Regards

guy038

3d1l

guy038 thanks for your responce. I just take a quick glance to your answer I will be checking into that. At this moment I’m using the font consolas because it make the zero (0) diferrent from the letter O.

3d1l

Ok I read your messages but I did something wrong and I mess up big time. Now I loose all accented characters (á, é, í, ú, ó, ñ, Ñ), hundred of them. They get replaced with “words” like [xA2] and some other with the “?” character.

Actually I’m concerned because my idea of using notepad++ was so I can take notes, in a single file, in plain “vanilla” ASCII or text form. So I can open the file anywhere without caring of proprietary formats (like onenote or evernote). Now you explain to me that there isn’t really a plain text format. I like to use the font consolas (even when the font is not available in all platforms), because the zero and the letter O is different and I type using the C language, not because I’m coding, but because using the curly character { } I can keep the document indented and organized and N++ allows me to fold and unfold sections of the text.

I don’t know if it made a difference but I pres CTRL-A to select all text and then went to encoding and selected convert to UTF-8-BOM. Then I went to Edit -> Character panel but there where no difference (the ASCII value still says NULL, SOH, BEL, DC1, etc). I type several special characters and they were properly displayed, then save the file. When I reopen the file not only were the arrows and dots replaced but I also lost the accented characters. I retyped some of them, save the document but after reopening they were replaced. I tried to use find/replace but after selecting the weird [xA2] word in the replace windows the programs put a ? inside of a black diamond so it can not find that.

Is there a way to recover all the accented characters? and how exactly do I setup the program so it keeps the special ASCII characters?

Thanks again.

PeterJones

First, to correct a misconception. There is a plain “vanilla” ASCII. It’s a 7bit encoding that hasn’t technically been used for decades. It involves only 128 code points, the first 32 of which are control characters, and are not guaranteed to have any specific glyph associated with them. They are control characters that are supposed to do fancy things to physical and (by extension, virtual) terminals. For codes 16 and 17 decimal (10h and 11h), your ancient font happened to assign a glyph under certain circumstances, but those are not guaranteed displayed values under all circumstances, not even under all “plain ASCII” circumstances.

Next, accented characters. Even in the old days of MS-DOS, those were not part of ASCII. So if you were really using a plain vanilla ASCII, they are not possible. In the MS-DOS world, they were part of the “IBM PC” 8-bit “extended ASCII”, which was different from various other 8-bit extensions of ASCII throughout the world. The OEM 437 (aka CP437, “code page 437”) that @guy038 mentioned is the encoding / code page for IBM PC extended ASCII characters. But that’s only “plain vanilla encoding” if you happen to be using a machine that defaults to CP437.

(Unicode and character-encoding pedants would probably find holes in my explanation…)

Now, on to your actual problem: Go to Settings > Preferences > New Documents; change Encoding to ☑ UTF-8 and ☑ Apply to opened ANSI files. Close that dialog. This selection means that for new files, it will enocde in UTF-8 (without the BOM, the Byte Order Mark that goes at the beginning of the file) per the first checkmark, and will also assume that ANSI files (files without any BOM or other internal indication of the encoding) will be assumed to be UTF-8.

Now create a new file (File > New). Encoding menu should now show “Encode in UTF-8” selected. Enter some accented characters and some others, “á, é, í, ú, ó, ñ, Ñ, ☑, →, ▶, ◀” (note that those last two are NOT code-points 17 and 18. They are U+25B6 “Black Right-Pointing Triangle” and U+25C0 “Black Left-Pointing Triangle”. They easiest way to get them into Notepad++ is to copy them from someplace else – I often use the FileFormat.Info Unicode Character Search, because you can just type a name of a character, or part of a name like “right”, and find all the unicode characters with that in the name. But I also often use the Windows Character Map (WIN+R, charmap.exe), then select your font of choice to make sure the Unicode character you want is in your font (BTW: I would recommend a more-complete UNICODE font, such as DejaVu Sans Mono, which still differentiates between O and 0 and between 1 and l, but has a wide selection of Unicode characters). Then ☑ Advanced View, Character Set = Unicode, Group By = Unicode Subrange. Selecting a Subrange will give you an organized list of characters; double-click on a character to put it in the Characters to copy, and hit Copy to put it into the Windows clipboard. Then paste into NPP as usual. (The arrows I showed are in the “Block Elements & Geometric Shapes” subrange, BTW. But you should really get to know the general subranges yourself, to help you find the character you want.)

Save and close this file. Exit and reload, and re-open the file. The fancy characters should be preserved. The Encoding menu should still show “UTF-8” selected. If you select Encoding > Convert to UTF-8-BOM, save, exit and reload, the Encoding menu should now say “UTF-8-BOM”, and the Unicode characters should still show up (the file should also be two bytes longer because of the BOM).

Let us know if this doesn’t work for you.

guy038

Hi, Peter,

Many thanks for your very detailed post !

Just a small rectification : An UTF-8 BOM encoded file should be three bytes longer than the same UTF-8 encoded file !

Indeed ! In a file, with a Unicode Transformation Format encoding, the invisible BOM character ( of code-point \x{FEFF} ) is written, with the three bytes EF BB BF ;-)

Refer to :

https://en.wikipedia.org/wiki/Byte_order_mark#Byte_order_marks_by_encoding

With an UCS-2 BE encoding ( Universal Coded Character Set-2 ), the BOM is written with the two bytes FE FF

With an UCS-2 LE encoding ( Universal Coded Character Set-2 ), the BOM is written with the two bytes FF FE

Best Regards,

guy038

PeterJones

Oh, right. I forgot the BOM is encoded in its own encoding. Thanks for the correction.

3d1l

Wow!

Peter thanks. I followed what you said and is working. The only problem is that it seems that I lost the accents for good. I have a backup but it was not up to date :-(

Thanks for the character search web page, very handy. The Déjà Vu font is impressive but then if I get used to it and open the file in other platform without the font I will not see some of the characters. Funny it use a dot instead of a forward slash for the zero.

Peter thanks as well, very helpful comments.