[New Plugin] MultiReplace
-
Thank you!
Am I correct that currently it handles only UTF-8 codepage?
In general, SCI_GETCODEPAGE may return SC_CP_UTF8 or some multi-byte encoding. In the latter case, either the multi-byte text may be used as is or MultiByteToWideChar may be used. -
You’re correct, the plugin is primarily designed to work with UTF-8. I ran some tests with multi-byte encoding and everything worked perfectly, including CSV import and export, without any code corruption.
But, as a former DB consultant, I know that nuances in character encoding can sometimes create unexpected challenges. Therefore, I’m considering enhancing the plugin to fully support other multibyte encodings.Thanks for bringing this to my attention.
-
After further investigation, it has become evident that Scintilla primarily operates using UTF-8 encoding for its functions, including SCI_REPLACESEL. While Scintilla can handle UTF-16 encoded documents, it internally converts them to UTF-8 for processing. Therefore, when interfacing with Scintilla, it is crucial to either use UTF-8 encoded text or convert UTF-16 text to UTF-8 before passing it to Scintilla functions.
There are no dedicated Scintilla functions for handling UTF-16 encoding specifically. Instead, Scintilla works primarily with its default UTF-8 encoding and performs any necessary conversions internally.
In essence, the plugin can roll with all encodings, as I’ve seen first-hand in my tests.
Thanks for bringing this up - It started a deep dive into Scintilla.
-
Indeed, when a text file has UTF-8 encoding or UTF-16 encoding, it is loaded to Scintilla as UTF-8. (Actually, it is Notepad++ that converts the UTF-16 input file stream into UTF-8 to store it in Scintilla. On file saving, Notepad++ does the opposite: converts the UTF-8 text from Scintilla to UTF-16 output stream). For these text data, SCI_GETCODEPAGE returns 65001.
I just tried a few files with ANSI encoding - and SCI_GETCODEPAGE returns 0 for them. In such case, the Windows ANSI encoding (CP_ACP) is used within Scintilla. Each character is represented by 1 byte, corresponding to the system’s ANSI codepage. Thus, for Western Europe ANSI codepage (such as Windows-1252), characters with e.g. umlauts can be shown. For Cyrillic ANSI codepage (such as Windows-1251), cyrillic characters can be shown. Unlike UTF-8, the interpreting of these 1-byte characters is completely system-dependent: if the Windows ANSI codepage is Windows-1252, you will see umlauts instead of cyrillic characters, and in case of Windows-1251 you’ll see cyriliic characters instead of umlauts.
So, when SCI_GETCODEPAGE returns 0, the text is not UTF-8, it is ANSI, and, correspondingly, ANSI (1-byte character, system-dependent) case conversion functions should be used instead of UTF-8 ones. -
I just tried a few files with ANSI encoding - and SCI_GETCODEPAGE returns 0 for them.
As a further proviso: a plugin should always defer to the document’s encoding, never the system’s encoding. Since Windows version 1903, it’s possible for the system’s “ANSI” code page to be
65001
, which maps to the same value asSC_CP_UTF8
.Relying on the system’s encoding may create the illusion that every document is UTF-8, because Scintilla maps the ANSI code page identifiers to the same values as the Win32 API. If (and only if) N++ reports the document’s encoding as “ANSI” (in the status bar), sending
SCI_GETCODEPAGE
is effectively the same as calling the system’s::GetACP
function. You will get back whateverACP
value has been set in the registry underHKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage
.In such case, the Windows ANSI encoding (CP_ACP) is used within Scintilla. Each character is represented by 1 byte, corresponding to the system’s ANSI codepage.
Except for the Double-byte Character Sets, which are (still!) the typical OEM encoding on PCs in East Asian countries. Scintilla has a dedicated API for those.
-
Thanks, i will go to focus also on ANSI CodePages. In between i found an error in the plugin handling multibyte by copying to the Clipboard i’m going to fix.
-
I am working to on an advanced Find and Replace dialog, but I am not a C++ guy , so I just designed the GUI.
If you wish I can share some screenshots of what I have done since now if you are curious.
-
Please don’t post any more of those screenshots here; just refer whomever you’re talking to to here, which seems to cover your ideas: https://github.com/notepad-plus-plus/notepad-plus-plus/issues/9627
-
@rdipardo said in [New Plugin] MultiReplace:
Except for the Double-byte Character Sets, which are (still!) the typical OEM encoding on PCs in East Asian countries. Scintilla has a dedicated API for those.
OMG, I’ve completely forgotten about those!
So, looks like the most proper way is to invoke MultiByteToWideChar first and then to deal with Unicode strings (that consist of WCHAR characters) since they are natively supported by modern Windows. Actually, this is exactly what I’ve been doing in my code, mostly because WCHAR is native for Windows NT family.
Going further, this can be enhanced to properly handle Unicode Surrogate Pairs as well. (And these may not be handled correctly in my code because I did not add any specific processing for Surrogate Pairs. Actually, I am not sure whether the standard functions such as lstrlenW take Surrogate Pairs into account or not). -
@Alan-Kilborn That one is an old ver. I’ve updated it…
-
@wonkawilly said in [New Plugin] MultiReplace:
I’ve updated it…
I’ll hand it to you; you’re tough. Even getting banned over it doesn’t dissuade you. :-)
-
This post is deleted! -
@rdipardo said in [New Plugin] MultiReplace:
because Scintilla maps the ANSI code page identifiers to the same values as the Win32 API.
Does it mean that UTF8 would directly match with ANSI in scintilla? I’m facing the Problem that normal characters are matching in ANSI but special Letters like Ä or Ö don’t. Anybody an idea how to convert a widestr into UTF8 for SCI_SEARCHINTARGET to find these Characters in ANSI? Unsurprisingly the letter Ä matches with Ä in ANSI if i convart Ä into ANSI. … i think i did it but pretty challenging topic.
-
Off Topic:
@Alan-Kilborn said in [New Plugin] MultiReplace:you’re tough. Even getting banned over it doesn’t dissuade you. :-)
Big changes make always involve taking big risks. And I understand that traditions sometimes are difficult to overcome. It is perfectly normal, at least into human logic. But I also know that traditions will be overcame when people are more aware and ready to make the jump . And this is also part of life and evolution. After all evolution is the meaning of life, and without evolution life cold be less meaningful.
This is a general rule that also applies to the case. -
@Thomas-Knoefel said in [New Plugin] MultiReplace:
@rdipardo said in [New Plugin] MultiReplace:
because Scintilla maps the ANSI code page identifiers to the same values as the Win32 API.
Does it mean that UTF8 would directly match with ANSI in scintilla?
Based on how
SCI_GETCODEPAGE
works in practice, the alternative encoding to Unicode should be thought of as the “system default” rather than “ANSI”.For most of N++'s history, the “ANSI” code page was indeed single-byte (or, in the case of the legacy CJK encodings, double-byte). But the addition of a UTF-8 OEM code page in Windows version 1903 makes “ANSI” a less useful identifier, even a potentially deceptive one. The system default is no longer directly opposed to Unicode as it once was.
So, yes, there may be times when “UTF8 would directly match with ANSI,” but only if
65001
is the value of theACP
key in the system’s registry. Check on this first:reg query HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage /s /f "CP"
Anybody an idea how to convert a widestr into UTF8 for SCI_SEARCHINTARGET to find these Characters in ANSI?
When you see
const char *
in the prototype of a Scintilla API (as you will for SCI_SEARCHINTARGET), it means the expected input is a byte string (i.e. “ANSI”). The conversion you want is probably fromwchar_t*
tochar*
. A debugger can show you what the encoded text looks like after conversion. -
Thanks, I was still a little bit too much focused on UTF8 with the preperation of ANSI. But this part is working now all directional.
@rdipardo said in [New Plugin] MultiReplace:
Except for the Double-byte Character Sets, which are (still!) the typical OEM encoding on PCs in East Asian countries. Scintilla has a dedicated API for those.
I’m trying to test DBCS on my non-Asian Windows system. Is this even possible somehow? When I test all encodings in Notepad++, SCI_GETCODEPAGE returns 0 for ANSI, and all the others give me 65001. Is there no chance of obtaining one of these encodings?
codePage == 932 || codePage == 936 || codePage == 949 || codePage == 950 || codePage == 136
I tried the BIG5 and Shift_JIS encodings, both of which are DBCS, but I obtained the same result. Even saving and reopening makes no difference. I have the feeling that i’m looking in the wrong place. -
@Thomas-Knoefel said in [New Plugin] MultiReplace:
I’m trying to test DBCS on my non-Asian Windows system. Is this even possible somehow?
Yes, go to the “Language & region” system settings, and by clicking the “Administrative language settings” a “Region” dialog is shown. This “Region” dialog has “Administrative” tab where there is a button “Change system locale” for non-Unicode programs.
( This is for Windows 11, it was much faster to find in Windows 7 :) ) -
And regarding your other question about conversion between a custom multi-byte encoding (either ANSI or DBCS) and UTF-8, this actually is achieved by double conversion:
- First, call MultiByteToWideChar to convert the input multi-byte string (e.g. ANSI/DBCS) to WCHAR string
- Second, call WideCharToMultiByte to convert the WCHAR string from the step 1 into a resulting multi-byte string (e.g. UTF-8).
To convert from UTF-8 to ANSI/DBCS, just specify CP_UTF8 in the step 1 and then the desired ANSI/DBCS codepage in the step 2.
-
I opened a few issues and added some pull requests to your repo.
If you are willing to accept pull requests, I have a few more to add once those are merged.
Cheers.
-
@Michael-Vincent Thanks, I’ve seen it, and I’m going to commit them. However, the latest updates for codepage handling have not been committed yet. I still need to set up a VMware for Chinese Language settings in order to test DBCS. Once that is finished, I’ll upload the final updates.