[New Plugin] MultiReplace

Vitalii Dovgan

Thank you!
Am I correct that currently it handles only UTF-8 codepage?
In general, SCI_GETCODEPAGE may return SC_CP_UTF8 or some multi-byte encoding. In the latter case, either the multi-byte text may be used as is or MultiByteToWideChar may be used.

Thomas Knoefel

You’re correct, the plugin is primarily designed to work with UTF-8. I ran some tests with multi-byte encoding and everything worked perfectly, including CSV import and export, without any code corruption.
But, as a former DB consultant, I know that nuances in character encoding can sometimes create unexpected challenges. Therefore, I’m considering enhancing the plugin to fully support other multibyte encodings.

Thanks for bringing this to my attention.

Thomas Knoefel

After further investigation, it has become evident that Scintilla primarily operates using UTF-8 encoding for its functions, including SCI_REPLACESEL. While Scintilla can handle UTF-16 encoded documents, it internally converts them to UTF-8 for processing. Therefore, when interfacing with Scintilla, it is crucial to either use UTF-8 encoded text or convert UTF-16 text to UTF-8 before passing it to Scintilla functions.

There are no dedicated Scintilla functions for handling UTF-16 encoding specifically. Instead, Scintilla works primarily with its default UTF-8 encoding and performs any necessary conversions internally.

In essence, the plugin can roll with all encodings, as I’ve seen first-hand in my tests.

Thanks for bringing this up - It started a deep dive into Scintilla.

Vitalii Dovgan

Indeed, when a text file has UTF-8 encoding or UTF-16 encoding, it is loaded to Scintilla as UTF-8. (Actually, it is Notepad++ that converts the UTF-16 input file stream into UTF-8 to store it in Scintilla. On file saving, Notepad++ does the opposite: converts the UTF-8 text from Scintilla to UTF-16 output stream). For these text data, SCI_GETCODEPAGE returns 65001.
I just tried a few files with ANSI encoding - and SCI_GETCODEPAGE returns 0 for them. In such case, the Windows ANSI encoding (CP_ACP) is used within Scintilla. Each character is represented by 1 byte, corresponding to the system’s ANSI codepage. Thus, for Western Europe ANSI codepage (such as Windows-1252), characters with e.g. umlauts can be shown. For Cyrillic ANSI codepage (such as Windows-1251), cyrillic characters can be shown. Unlike UTF-8, the interpreting of these 1-byte characters is completely system-dependent: if the Windows ANSI codepage is Windows-1252, you will see umlauts instead of cyrillic characters, and in case of Windows-1251 you’ll see cyriliic characters instead of umlauts.
So, when SCI_GETCODEPAGE returns 0, the text is not UTF-8, it is ANSI, and, correspondingly, ANSI (1-byte character, system-dependent) case conversion functions should be used instead of UTF-8 ones.

rdipardo

I just tried a few files with ANSI encoding - and SCI_GETCODEPAGE returns 0 for them.

As a further proviso: a plugin should always defer to the document’s encoding, never the system’s encoding. Since Windows version 1903, it’s possible for the system’s “ANSI” code page to be 65001, which maps to the same value as SC_CP_UTF8.

Relying on the system’s encoding may create the illusion that every document is UTF-8, because Scintilla maps the ANSI code page identifiers to the same values as the Win32 API. If (and only if) N++ reports the document’s encoding as “ANSI” (in the status bar), sending SCI_GETCODEPAGE is effectively the same as calling the system’s ::GetACP function. You will get back whatever ACP value has been set in the registry under HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage.

In such case, the Windows ANSI encoding (CP_ACP) is used within Scintilla. Each character is represented by 1 byte, corresponding to the system’s ANSI codepage.

Except for the Double-byte Character Sets, which are (still!) the typical OEM encoding on PCs in East Asian countries. Scintilla has a dedicated API for those.

Thomas Knoefel

Thanks, i will go to focus also on ANSI CodePages. In between i found an error in the plugin handling multibyte by copying to the Clipboard i’m going to fix.

wonkawilly

I am working to on an advanced Find and Replace dialog, but I am not a C++ guy , so I just designed the GUI.

If you wish I can share some screenshots of what I have done since now if you are curious.

Alan Kilborn

@wonkawilly

Please don’t post any more of those screenshots here; just refer whomever you’re talking to to here, which seems to cover your ideas: https://github.com/notepad-plus-plus/notepad-plus-plus/issues/9627

Vitalii Dovgan

@rdipardo said in [New Plugin] MultiReplace:

Except for the Double-byte Character Sets, which are (still!) the typical OEM encoding on PCs in East Asian countries. Scintilla has a dedicated API for those.

OMG, I’ve completely forgotten about those!
So, looks like the most proper way is to invoke MultiByteToWideChar first and then to deal with Unicode strings (that consist of WCHAR characters) since they are natively supported by modern Windows. Actually, this is exactly what I’ve been doing in my code, mostly because WCHAR is native for Windows NT family.
Going further, this can be enhanced to properly handle Unicode Surrogate Pairs as well. (And these may not be handled correctly in my code because I did not add any specific processing for Surrogate Pairs. Actually, I am not sure whether the standard functions such as lstrlenW take Surrogate Pairs into account or not).

wonkawilly

@Alan-Kilborn That one is an old ver. I’ve updated it…

Alan Kilborn

@wonkawilly said in [New Plugin] MultiReplace:

I’ve updated it…

I’ll hand it to you; you’re tough. Even getting banned over it doesn’t dissuade you. :-)

wonkawilly

This post is deleted!

Thomas Knoefel

@rdipardo said in [New Plugin] MultiReplace:

because Scintilla maps the ANSI code page identifiers to the same values as the Win32 API.

Does it mean that UTF8 would directly match with ANSI in scintilla? I’m facing the Problem that normal characters are matching in ANSI but special Letters like Ä or Ö don’t. Anybody an idea how to convert a widestr into UTF8 for SCI_SEARCHINTARGET to find these Characters in ANSI? Unsurprisingly the letter Ä matches with Ã„ in ANSI if i convart Ä into ANSI. … i think i did it but pretty challenging topic.

wonkawilly

This post is deleted!

rdipardo

@Thomas-Knoefel said in [New Plugin] MultiReplace:

@rdipardo said in [New Plugin] MultiReplace:

because Scintilla maps the ANSI code page identifiers to the same values as the Win32 API.

Does it mean that UTF8 would directly match with ANSI in scintilla?

Based on how SCI_GETCODEPAGE works in practice, the alternative encoding to Unicode should be thought of as the “system default” rather than “ANSI”.

For most of N++'s history, the “ANSI” code page was indeed single-byte (or, in the case of the legacy CJK encodings, double-byte). But the addition of a UTF-8 OEM code page in Windows version 1903 makes “ANSI” a less useful identifier, even a potentially deceptive one. The system default is no longer directly opposed to Unicode as it once was.

So, yes, there may be times when “UTF8 would directly match with ANSI,” but only if 65001 is the value of the ACP key in the system’s registry. Check on this first:

reg query HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage /s /f "CP"

Anybody an idea how to convert a widestr into UTF8 for SCI_SEARCHINTARGET to find these Characters in ANSI?

When you see const char * in the prototype of a Scintilla API (as you will for SCI_SEARCHINTARGET), it means the expected input is a byte string (i.e. “ANSI”). The conversion you want is probably from wchar_t* to char*. A debugger can show you what the encoded text looks like after conversion.

Thomas Knoefel

Thanks, I was still a little bit too much focused on UTF8 with the preperation of ANSI. But this part is working now all directional.

@rdipardo said in [New Plugin] MultiReplace:

Except for the Double-byte Character Sets, which are (still!) the typical OEM encoding on PCs in East Asian countries. Scintilla has a dedicated API for those.

I’m trying to test DBCS on my non-Asian Windows system. Is this even possible somehow? When I test all encodings in Notepad++, SCI_GETCODEPAGE returns 0 for ANSI, and all the others give me 65001. Is there no chance of obtaining one of these encodings?
codePage == 932 || codePage == 936 || codePage == 949 || codePage == 950 || codePage == 136
I tried the BIG5 and Shift_JIS encodings, both of which are DBCS, but I obtained the same result. Even saving and reopening makes no difference. I have the feeling that i’m looking in the wrong place.

Vitalii Dovgan

@Thomas-Knoefel said in [New Plugin] MultiReplace:

I’m trying to test DBCS on my non-Asian Windows system. Is this even possible somehow?

Yes, go to the “Language & region” system settings, and by clicking the “Administrative language settings” a “Region” dialog is shown. This “Region” dialog has “Administrative” tab where there is a button “Change system locale” for non-Unicode programs.
( This is for Windows 11, it was much faster to find in Windows 7 :) )

Vitalii Dovgan

And regarding your other question about conversion between a custom multi-byte encoding (either ANSI or DBCS) and UTF-8, this actually is achieved by double conversion:

First, call MultiByteToWideChar to convert the input multi-byte string (e.g. ANSI/DBCS) to WCHAR string
Second, call WideCharToMultiByte to convert the WCHAR string from the step 1 into a resulting multi-byte string (e.g. UTF-8).

To convert from UTF-8 to ANSI/DBCS, just specify CP_UTF8 in the step 1 and then the desired ANSI/DBCS codepage in the step 2.

Michael Vincent

@Thomas-Knoefel

I opened a few issues and added some pull requests to your repo.

If you are willing to accept pull requests, I have a few more to add once those are merged.

Cheers.

Thomas Knoefel

@Michael-Vincent Thanks, I’ve seen it, and I’m going to commit them. However, the latest updates for codepage handling have not been committed yet. I still need to set up a VMware for Chinese Language settings in order to test DBCS. Once that is finished, I’ll upload the final updates.