[New Plugin] MultiReplace

Thomas Knoefel

Thanks, i will go to focus also on ANSI CodePages. In between i found an error in the plugin handling multibyte by copying to the Clipboard i’m going to fix.

wonkawilly

I am working to on an advanced Find and Replace dialog, but I am not a C++ guy , so I just designed the GUI.

If you wish I can share some screenshots of what I have done since now if you are curious.

Alan Kilborn

@wonkawilly

Please don’t post any more of those screenshots here; just refer whomever you’re talking to to here, which seems to cover your ideas: https://github.com/notepad-plus-plus/notepad-plus-plus/issues/9627

Vitalii Dovgan

@rdipardo said in [New Plugin] MultiReplace:

Except for the Double-byte Character Sets, which are (still!) the typical OEM encoding on PCs in East Asian countries. Scintilla has a dedicated API for those.

OMG, I’ve completely forgotten about those!
So, looks like the most proper way is to invoke MultiByteToWideChar first and then to deal with Unicode strings (that consist of WCHAR characters) since they are natively supported by modern Windows. Actually, this is exactly what I’ve been doing in my code, mostly because WCHAR is native for Windows NT family.
Going further, this can be enhanced to properly handle Unicode Surrogate Pairs as well. (And these may not be handled correctly in my code because I did not add any specific processing for Surrogate Pairs. Actually, I am not sure whether the standard functions such as lstrlenW take Surrogate Pairs into account or not).

wonkawilly

@Alan-Kilborn That one is an old ver. I’ve updated it…

Alan Kilborn

@wonkawilly said in [New Plugin] MultiReplace:

I’ve updated it…

I’ll hand it to you; you’re tough. Even getting banned over it doesn’t dissuade you. :-)

wonkawilly

This post is deleted!

Thomas Knoefel

@rdipardo said in [New Plugin] MultiReplace:

because Scintilla maps the ANSI code page identifiers to the same values as the Win32 API.

Does it mean that UTF8 would directly match with ANSI in scintilla? I’m facing the Problem that normal characters are matching in ANSI but special Letters like Ä or Ö don’t. Anybody an idea how to convert a widestr into UTF8 for SCI_SEARCHINTARGET to find these Characters in ANSI? Unsurprisingly the letter Ä matches with Ã„ in ANSI if i convart Ä into ANSI. … i think i did it but pretty challenging topic.

wonkawilly

This post is deleted!

rdipardo

@Thomas-Knoefel said in [New Plugin] MultiReplace:

@rdipardo said in [New Plugin] MultiReplace:

because Scintilla maps the ANSI code page identifiers to the same values as the Win32 API.

Does it mean that UTF8 would directly match with ANSI in scintilla?

Based on how SCI_GETCODEPAGE works in practice, the alternative encoding to Unicode should be thought of as the “system default” rather than “ANSI”.

For most of N++'s history, the “ANSI” code page was indeed single-byte (or, in the case of the legacy CJK encodings, double-byte). But the addition of a UTF-8 OEM code page in Windows version 1903 makes “ANSI” a less useful identifier, even a potentially deceptive one. The system default is no longer directly opposed to Unicode as it once was.

So, yes, there may be times when “UTF8 would directly match with ANSI,” but only if 65001 is the value of the ACP key in the system’s registry. Check on this first:

reg query HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage /s /f "CP"

Anybody an idea how to convert a widestr into UTF8 for SCI_SEARCHINTARGET to find these Characters in ANSI?

When you see const char * in the prototype of a Scintilla API (as you will for SCI_SEARCHINTARGET), it means the expected input is a byte string (i.e. “ANSI”). The conversion you want is probably from wchar_t* to char*. A debugger can show you what the encoded text looks like after conversion.

Thomas Knoefel

Thanks, I was still a little bit too much focused on UTF8 with the preperation of ANSI. But this part is working now all directional.

@rdipardo said in [New Plugin] MultiReplace:

Except for the Double-byte Character Sets, which are (still!) the typical OEM encoding on PCs in East Asian countries. Scintilla has a dedicated API for those.

I’m trying to test DBCS on my non-Asian Windows system. Is this even possible somehow? When I test all encodings in Notepad++, SCI_GETCODEPAGE returns 0 for ANSI, and all the others give me 65001. Is there no chance of obtaining one of these encodings?
codePage == 932 || codePage == 936 || codePage == 949 || codePage == 950 || codePage == 136
I tried the BIG5 and Shift_JIS encodings, both of which are DBCS, but I obtained the same result. Even saving and reopening makes no difference. I have the feeling that i’m looking in the wrong place.

Vitalii Dovgan

@Thomas-Knoefel said in [New Plugin] MultiReplace:

I’m trying to test DBCS on my non-Asian Windows system. Is this even possible somehow?

Yes, go to the “Language & region” system settings, and by clicking the “Administrative language settings” a “Region” dialog is shown. This “Region” dialog has “Administrative” tab where there is a button “Change system locale” for non-Unicode programs.
( This is for Windows 11, it was much faster to find in Windows 7 :) )

Vitalii Dovgan

And regarding your other question about conversion between a custom multi-byte encoding (either ANSI or DBCS) and UTF-8, this actually is achieved by double conversion:

First, call MultiByteToWideChar to convert the input multi-byte string (e.g. ANSI/DBCS) to WCHAR string
Second, call WideCharToMultiByte to convert the WCHAR string from the step 1 into a resulting multi-byte string (e.g. UTF-8).

To convert from UTF-8 to ANSI/DBCS, just specify CP_UTF8 in the step 1 and then the desired ANSI/DBCS codepage in the step 2.

Michael Vincent

@Thomas-Knoefel

I opened a few issues and added some pull requests to your repo.

If you are willing to accept pull requests, I have a few more to add once those are merged.

Cheers.

Thomas Knoefel

@Michael-Vincent Thanks, I’ve seen it, and I’m going to commit them. However, the latest updates for codepage handling have not been committed yet. I still need to set up a VMware for Chinese Language settings in order to test DBCS. Once that is finished, I’ll upload the final updates.

Thomas Knoefel

@Thomas-Knoefel These are the facts i fgured out. In Notepad++, when you ask about SCI_GETCODEPAGE, it’s always 0 for ANSI and 65001 for UTF8 you won’t encounter any other codepage. Asian codepages, like DBCS, only matter when you’re reading or writing files. So, these codepages won’t mess things up unless you’re working with files saved in these codepages. As for the Save and Load File feature of the plugin, which is designed for an internal store, it will always save in UTF8 format when handling CSV files.
I think this fact will simplify the handling of codepages.

Thomas Knoefel

@Michael-Vincent said in [New Plugin] MultiReplace:

I have a few more to add once those are merged.

Thank you for your input! All requests are welcome. I can just learn from it.

Thomas Knoefel

I have finished RC-2 version with fixed ANSI support and 32 Bit code compatibility. You can find it on Github.

Vitalii Dovgan

Thank you! I like it!
What probably may add more abilities to the plugin is: 1) to have a button that swaps the text between the Find What and Replace With fields; 2) to have checkboxes in the list to specify which Find-Replace pairs to activate and which to deactivate.

Thomas Knoefel

@Vitalii-Dovgan I will probably add both options before final release. Thanks for your input!