Notepad++ case conversion vs. regular expression case conversion
-
I’ve come across a problem that I’m trying figure out, and so far, I’ve been unable to even find it mentioned anywhere! The issue concerns case conversion of accented characters.
Notepad++ has case conversion commands in the Edit menu. These commands work fine, regardless of whether a character is accented or not.
The problem is that regular expressions do not work correctly! Any attempt to change the case of an accented character using a regular expression with search and replace fails.
Take the following example, a file containing only the following line:
è é ê ë ē ĕ ė ę ě ȅ ȇ ȩ ḕ ḗ e ḙ ḛ ḝ ẹ ẻ ẽ ế ề ể ễ ệ
Note the unaccented ‘e’ in the middle of the line. The letter ‘E’ was chosen at random for this example. The problem is the same for all accented letters.
Start by selecting the entire line. From the menu bar, select Edit->Convert Case to->UPPERCASE. The line changes to:
È É Ê Ë Ē Ĕ Ė Ę Ě Ȅ Ȇ Ȩ Ḕ Ḗ E Ḙ Ḛ Ḝ Ẹ Ẻ Ẽ Ế Ề Ể Ễ Ệ
With the line still selected, type “Ctrl+U”, the keyboard shortcut for “lowercase”. The line becomes:
**è é ê ë ē ĕ ė ę ě ȅ ȇ ȩ ḕ ḗ e ḙ ḛ ḝ ẹ ẻ ẽ ế ề ể ễ ệ
**
Now attempt the same thing using search and replace…- Type Ctrl+H to open the Search/Replace dialog
- Under Search Mode select “Regular expression”
- In the “Find what” box, enter: (.*)
- In the “Replace with” box, enter: \U\1\E
- Hit “Replace All”.
(Note that the “\E” in the replace line is just for completeness sake.)
The result is that the line changes to:
è é ê ë ē ĕ ė ę ě ȅ ȇ ȩ ḕ ḗ E ḙ ḛ ḝ ẹ ẻ ẽ ế ề ể ễ ệ
The only affected letter is the unaccented ‘e’, in the middle of the line. The others remain untouched!
Again, this happens with all accented letters, regardless of the letter or the accent. It also happens with the lowercase command, just like it does with the uppercase command.
What am I missing?!? I’ve looked all over for something that might explain this, but I’ve failed to find anything, so your help would be greatly appreciated!
Thanks!
-geo
-
Hello George and All,
Indeed, George, you’re completely right about that matter ! And, as I’m French, I should have noticed that issue, about accentuated letters, a long time ago :-((
So, seemingly, the replace modifiers
\u
,\l
,\U
and\L
act, only, on the normal ASCII alphabet, in the ranges[A-Z]
and[a-z]
While using ANSI encoded file, you, also, get the same wrong results. Too bad !
However, your problem can be solved, by using an alternate N++ Regex library, created by François-R Boyer, in June 2013. I’ve just put that library, in my v6.9 version and, indeed, the case of any accentuated character can, now, be changed :-))
Below, I just copied two parts of my posts to h-h-h-h, 7 months ago, where I explain the main features of that improved version, how to install that library and where I show some examples.
If you want to have a look to the entire discussion, just refer to the two links, below :
https://notepad-plus-plus.org/community/topic/9703/is-it-planned-to-switch-to-pcre2/10
https://notepad-plus-plus.org/community/topic/9703/is-it-planned-to-switch-to-pcre2/15
So, if you install the improved François-R Boyer version, of the BOOST regex engine, you’ll get some strong new regex features :
-
Search is performed in 32 bits code-points, so it can handle characters, over the BMP ( Basic Multilingual Plane ). An interesting feature for most Asiatic people !
-
It can manage NUL characters, both, in search and in replacement, too.
-
Look-behinds are correctly handled, even in case of OVERLAPPING, with the end of the previous match.
-
It can handle ALL the Universal Character Names ( UCN) of the UCS Transformation Format , from
\x{0}
to\x{7FFFFFFF}
, particularly, all those of code-points over\x{FFFF}
, which are outside the BMP. -
The backward regex search isn’t stopped, on matching a character, with Unicode code-point over
\x{00FF}
And, now, I can, also, add !
- The case modifiers
\u
,\l
,\U
and\L
do change any accentuated letter, in replacement !!
To get this Beta N++ regex code ( that has NEVER been part of an official N++ release ) :
-
Close any N++ session(s)
-
Rename your present SciLexer.dll file as, for instance, SciLexer.xxx
-
Download, from the link below, the modified SciLexer.dll file. of François-R Boyer
http://sourceforge.net/projects/npppythonplugsq/files/Beta N%2B%2B regex code/
-
Copy this file, in the installation folder, along with the Notepad++.exe and the SciLexer.xxx files
-
Restart Notepad++
IMPORTANT :
Don’t forget that this modified SciLexer.dll, build on May 2013, is based on the old Scintilla v2.2.7 !
Here are, below, a NON exhaustive list of issues with the current regex engine,_ which do NOT occur, with François-R Boyer’s version_ :
-
Overlapping lookbehinds and matched strings are NOT correctly handled. For instance, giving the 20 characters subject string aaaabaaababbbaabbabb and SEARCH =
(?<!a)ba*
, we get 6 matches, but, unfortunately, 2 results are wrong. With the improved version of François, it’s all OK ! -
We can’t use the NUL character in replacement. For example, the simple S/R : SEARCH =
ABC
and REPLACE =DEF\x00GHI
, the result is the string DEF only :-(. The François’s version do insert the NUL character between the strings DEF and GHI ! -
BACKWARD assertions are NOT correctly supported. E.g. : SEARCH =
\A.
matches, successively, all the characters of the FIRST line. With the François’s version it only matches the FIRST character of the current file -
It doesn’t search and replace characters, which are outside the Basic Multilingual Plane (BMP ). For instance, in an full UTF-8 file ( with a BOM ), if SEARCH =
\x{104A5}\x{20AC}
and REPLACE =\x{A3}\x{10482}
, The present regex engine answers Invalid regular expression ! as for the François’s version does the replacement correctly ! -
Now, let’s suppose, for instance, the French subject string Un événement, on a new line, and the simple SEARCH regex
\w
. After a click on the Find Next button, close the Replace dialog, and keep on searching some word characters, by hitting the F3 key. When you’re, about, at the end of the string, just go searching backwards, by hitting the SHIFT + F3 key. You’ll notice _that it CAN’T go backwards, past the é character !!!. The François’s version does works well, in both directions ! -
A last example : if you try to mark the matches of the simple SEARCH regex
(?<=.).
, the present regex engine marks any character, EVERY OTHER time. With the François’s version, it correctly find all characters, except for the very first of each line ! -
The George Karas’s goal is corrected solved : The SEARCH =
(.*)
and REPLACE =\U\1\E
does change any lowercase letter into its associated uppercase letter ! -
François-R Boyer also created a new option SCFIND_REGEXP_LOCALEORDER, to get ranges of characters, in a locale order, NOT in Unicode order. For instance, the regex range
[A-B]
, with the Match case option SET, would match all the following characters AÀÁÂÃÄÅĀĂĄǍǺẠẢẤẦẨẪẬẮẰẲẴẶǼB, in a true UTF-8 file, with a suitable font ! -
To end with, the François-R Boyer’s version could display the EXACT error messages, instead of the generic message Invalid regular expression. For instance, the regex
(\d+ab
would report the Unmatched marking parenthesis error message !
BTW, as my present knowledge about C/C++ is rather near zero, it would be nice if someone could merge that improved François-R Boyer’s version of the N++ regex engine, in the present Scilexer.dll file, based on Scintilla v3.3.4 !
And, generally speaking, may someone be able to find a way to include that improved version, whatever the versions of N++ and Scintilla are ?
Best Regards,
guy038
-
-
I would give it a try but I can’t promise anything.
Cheers
Claudia -
OK, there are hundreds of lines changed/different.
So, will take some time.Cheers
Claudia -
Hi, Claudia,
When I asked, in my previous post, for some help, I didn’t think to you, first, as you’re, already, very involved in this forum. Of course, your insight, in that matter, would be quite valuable. But, just take it easy !!
Best Regards,
guy038
-
Guy, yesterday as I saw that many differences I thought - oh my god - you’ll never get this fixed.
Today I learned, that mercurial has also some function which let me compare previous release of scintilla
with source code from François-R Boyer’ and it isn’t that worse anymore. In total 5 files are affected only,
but unfortunately 4 of them are completely different. So I have to try to understand what Don and François did
to get this merged. Well, if I want to become better in cpp there is the challenge. ;-)Cheers
Claudia -
Hi Guy,
I’m confused, I’ve downloaded the source from the link you’ve provided and
copied over the source. Tried to make the scintilla library and got an error.NMAKE : fatal error U1073: don't know how to make '../boostregex/UtfConversion.h'
Searched for UtfConversion.h and yes, it isn’t there - error makes sense as e.g. BoostRegExSearch.cxx has
an include statement.
Thought file must have been deleted at some point. Did a history search for all deleted files but no, file isn’t listed.
So I assume that file has to be part of François’s code but unfortunately it isn’t.So I concentrate to rebuild new scintilla lib based on the original functions.
Cheers
Claudia