Search in folder (encoding)

Mayson Kword

@guy038, thank you a lot. There is no way to improve autodetect encoding feature, but you’ve done as much as possible, including advice to use BOM, that solves my issue very well.

Have a nice day, all of you are great.

gstavi

Notepad++ assumes that a file has encoding, meaning, the entire content of the file is text (Unicode symbols) using a single encoding. Notepad++ does not try to support files where every paragraph has different encoding or files that are essentially binary with pieces of “text” at some encoding embedded here and there.

Having said that there are 2 major ways that Notepad++ could improve upon users experience in that regard that neither should be difficult to implement:

If a specific encoding is not autodetected on opening a file Notepad++ will default to ansi encoding (that should be called ascii encoding). That was reasonable 20 years ago. It is unreasonable today. Utf-8 should be the default and since it is also backward compatible to ascii it should not hurt users.
Notepad++ really needs the feature in the settings of “assume all files are of encoding XXX” where XXX is selected from a combo box. My guess is that a vast majority of Notepad++ users have all their relevant files in a single encoding and they don’t need for Notepad++ to autodetect it (guess) it if they can just tell it once.

Ekopalypse

@gstavi

No, afaik utf8 cannot replace ANSI code pages easily.
For example the byte c4 is Ä in cp1252 and Д in cp1251
and invalid in utf8.

But I agree, npp should have the possibility to let the user
force an encoding and it is, probably, a good idea to use utf8
as the default.

guy038

Hello, @ekopalypse, and All,

You said :

For example the byte c4 is Ä in cp1252 and Д in cp1251
and invalid in utf8

Eko, I not agree with that statement : a C4 byte can be found in an UTF-8 file as it is the first byte of a 2-Bytes coding sequence of the characters from Ā ( U+0100, coded as C4 80 ) till Ŀ ( U+013F, coded as C4 BF )

Refer to the link and the table below :

https://en.wikipedia.org/wiki/UTF-8#Codepage_layout

•-------•-------•--------------------------------------------------------•
| Start |  End  |                   Description                          |
•-------•-------•--------------------------------------------------------•
|   00  |   7F  |  UNIQUE byte of a 1-byte sequence ( ASCII character )  |
|   80  |   BF  |  CONTINUATION byte of a sequence  ( from 1ST to 3RD )  |
|   C0  |   C1  |  FORBIDDEN values                                      |
|   C2  |   DF  |  FIRST byte of a 2-bytes sequence                      |
|   E0  |   EF  |  FIRST byte of a 3-bytes sequence                      |
|   F0  |   F4  |  FIRST byte of a 4-bytes sequence                      |
|   F5  |   FF  |  FORBIDDEN values                                      |
•-------•-------•--------------------------------------------------------•

I think that your reasoning is correct if we take, for instance, the individual C1 byte, which is :

The Á character, in a Windows-1250/1252/1254/1258 encoded file
The Б character, in a Windows-1251 encoded file
The Α character, in a Windows-1253 encoded file
The ֱ character, in a Windows-1255 encoded file
The ء character, in a Windows-1256 encoded file
The Į character, in a Windows-1257 encoded file

…

Always forbidden in an UTF-8 or UTF-8-BOM encoded file

Best Regards,

guy038

guy038

Hi, All,

As promised, this issue on GitHub, concerning the \A assertion !

BR

guy038

Ekopalypse

@guy038

Hi Guy, how are doing? I hope you are doing well.

I replied that ANSI/ASCII can be replaced by utf-8.
ASCII can, but ANSI cannot.
My example was to show why it can’t be replaced.
Yes, C4 is valid as long as it is followed by another byte that forms a valid utf8 character.
Alone it is invalid.

gstavi

@Ekopalypse said in Search in folder (encoding):

No, afaik utf8 cannot replace ANSI code pages easily.

The terminology is confusing in general and Notepad++ is not helping.
There are modern encodings which can represent ANY Unicode symbol with various multibyte schemes.
There are legacy ascii encoding that can represent up to 256 symbols.
Every ascii encoding comes with a code page that defines different symbols for the range 128-255.
The symbols for 0-127 in ascii encoding (and utf8) are always the same. Let’s call them “plain English”.

Ascii encodings should die. Notepad++ must open them but should discourage people from creating new ones by forcing an explicit choice to do so.
People that choose one of the modern encodings save themselves trouble later.
And for the many many people who can’t understand the concept of encoding Notepad++ should help by choosing the right default.

Notepad++ default “ANSI encoding” is ascii encoding with some arbitrary code page.
Generally using ascii encoding without defining an explicit code page is equivalent to saying “I only care about plain English and don’t give a fuck about range 128-255”.

Other “code pages” or “Character Sets” are not relevant to Notepad++ default. Users who want them need to either select them manually or let the autodetect guess it. Does it even work? How accurate is guessing of a code page?

For people who are ok with “ANSI”, the majority belong to the "don’t give a fuck about 128-255 and they will be OK with utf8.
A minority that actually use “ANSI” and adds to the document symbols from the default code page will need to select it explicitly or hope that autodetect works. But they better off switch to a modern encoding anyway.
Even if the solution will not be 100% backward compatible it will benefit much more people than it would hurt.

Ekopalypse

@gstavi said in Search in folder (encoding):

I agree with most of what you said, but I think there is a misunderstanding here about ANSI. (maybe it’s me)
It’s true, ANSI is used as a type of encoding, which it is not.
Instead, it is just an abbreviation for the codepage that was used to set up the operating system.
For one person it’s cp1252, for another it’s cp1251, and for the next it’s something else, and so on.
But GetACP returns this “setup” encoding and that is,
I assume, the one that is/was used by Windows users and is used by npp.
I think that makes sense.
Nevertheless, I think using unicode and especially utf8 makes more sense these days.

gstavi

@Ekopalypse said in Search in folder (encoding):

I think that makes sense.

It is a legitimate decision. And it makes sense … and in my (very personal) opinion it is awful.
Its bad for interoperability because transferring a file between 2 computers could end up badly.

But my personal dislike is because I work on multilingual operating system where the other language is right-to-left Hebrew.
And it is unimaginably annoying when some application decides to do me a favor and adjust itself without asking.
I never want to see Hebrew on my computer unless I explicitly asked for it. The OS is obviously setup with English as primary language but FUCKING OneNote still decides to suddenly arrange pages right-to-left because the time zone is for Israel. And it feels random, unfixable and takes the control from me.

Since users don’t explicitly choose codepage when they setup their system, using GetACP is just a guess. And if it misfires, users will not understand why because they are unaware that a choice was made for them. Don’t guess on behalf of the user if it can be avoided.

Side story: as can be expected I am sometimes the “tech person” for friends and family. I strongly refuse to service installations that are fully Hebrew. If you will ever open regedit on a Hebrew Windows and see all the English content aligned right-to-left you would lose all respect to Microsoft.

Ekopalypse

@gstavi said in Search in folder (encoding):

If you will ever open regedit on a Hebrew Windows and see all the English content aligned right-to-left you would lose all respect to Microsoft.

Maybe I should give it a try to finally be persuaded to switch to Linux :-D