Build boost::regex with ICU support
-
I was reading your REMARK from the above mentioned link.
May I ask you for a favor?
Can you provide me a few regex examples from that section
to see if my implementation works as expected?
For the range:\x{0} to \x{7FFFFFFF}
, is it ok if I would create
each code point on the fly and do a search to see if it matches?
Or is it needed to have multiple bytes of those values to be really
sure it is working??
Means, is each code point an entity of its own or
might it be that multiple code points form to one entity? -
Hi, @ekopalypse,
Just a first and quick anwwer, regarding the
readme.txt
of François-R Boyer… on2013-03-27
!
This folder contains my latest regex code (as of may 2013) for Notepad++ which is not yet in the release version.
The SciLexer.dll can directly replace the one from latest version of Notepad++ but not all features are accessible since the user interface has not been updated to support some new features.
It passes all automated tests that were done for the “new regex code” which is in current release, plus:
- correctly supports code points outside BPM (search is done with 32 bit codepoints instead of UTF-16);
- both search and replace strings can contain embedded null characters and/or escape sequences for null characters;
- lookbehinds are correctly handled in search and replace, even those overlapping with end of previous match;
- a new [[:inval:]] character class, to find invalid UTF-8 sequences;
- invalid UTF-8 characters can be kept in replace (e.g. replacing “(.*)” by “ab\1cd” will keep invalid UTF-8 sequences);
The following new features are not accessible in current Notepad++ user interface:
- a new SCFIND_REGEXP_LOCALEORDER option, to have character ranges in locale order instead of code point order (‘à’ is between ‘a’ and ‘b’ at least in French locale order, but is after in code point order, thus [a-b] will match also ‘à’ and other characters that would be between ‘a’ and ‘b’ in a dictionary);
- the error message can now be known when the regex is invalid (e.g. regex “(” will report an “Unmatched marking parenthesis”, while current Notepad++ only knows it is an “Invalid regular expression”);
Source: readme.txt, updated 2013-05-27
Now, @ekopalypse, I’ll try, these next days, to collect a bunch of regexes, which :
-
Does not work with our present implementation of The Boost Regex library
-
Does work properly with the François-R Boyer implementation
BR
guy038
-
@guy038 - thank you very much but take your time, no hurry.
I stay away from PC on weekends anyway and
there is still some open task for implementing ICU.
So have nice weekend to everyone. :) -
@Ekopalypse said in Build boost::regex with ICU support:
there is still some open task for implementing ICU.
Extreme necro, I know… this just turned up in a search.
There’s a nice, clean header-only implementation of boost::regex, but it doesn’t work properly for Unicode without ICU. I use boost::regex in my plugin, but I gave up trying to make sense of how to statically link whatever parts of ICU are needed by boost::regex so as to wind up with a single GitHub project / MSVC solution that compiles into one dll that works.
Did you ever figure it out?
Alternatively, did you find another source for the information needed to implement a traits class on Windows for UTF-32 as
char32_t
? I think getting those “traits” is the main hurdle, and why boost::regex uses ICU4C to implement Unicode. -
Yes, as far as I remember I was able to compile everything into a “huge” static library, but then had a problem using it with nim which made me give up, but I don’t remember exactly what steps I took back then.
No, I’m pretty sure I never looked for utf32 trait classes because I didn’t even understand the basics of cpp back then.
I’m not home at the moment, but when I get back I’ll take a look at the project to see if I left some notes for my future self.
-
Sorry, apart from my Nim test code I haven’t found anything else.
import std/[os] when not defined(cpp): {.error: "This projects needs to be compiled with cpp backend as it uses the boost::regex library.".} {.passC: "-std=gnu++17 -ID:\\Repositories\\vcpkg\\installed\\x64-windows\\include".} {.push header: "boost/regex.hpp".} type StdString {.importcpp: "std::string".} = object RegEx {.importcpp: "boost::regex".} = object Match {.importcpp: "boost::smatch".} = object SubMatch {.importcpp: "boost::ssub_match".} = object RegexError {.importcpp: "boost::regex_error".} = object {.pop.} # char compatible proc regexSearch(s: StdString, w: Match, e: RegEx): bool {.importcpp: "boost::regex_search(@)".} proc initStdString(s: cstring): StdString {.constructor, importcpp: "std::string(@)".} proc initRegEx(s: cstring): RegEx {.constructor, importcpp: "boost::regex(@)".} proc initMatch(): Match {.constructor, importcpp: "boost::smatch()".} proc size(self: Match): int {.importcpp: "size".} proc position(self: Match, i: int): int {.importcpp: "position".} proc length(self: Match, i: int): int {.importcpp: "length".} proc `[]`(self: Match, i: int32): SubMatch {.importcpp: "#[#]".} proc str(self: SubMatch): StdString {.importcpp: "str".} proc cStr(self: StdString): cstring {.importcpp: "(char *)#.c_str()".} proc what(err: RegexError): cstring {.importcpp: "(char *)#.what()".} proc position(self: RegexError): int {.importcpp: "position".} # https://www.boost.org/doc/libs/1_80_0/libs/regex/doc/html/boost_regex/ref/match_results.html # https://www.boost.org/doc/libs/1_80_0/libs/regex/doc/html/boost_regex/ref/sub_match.html when isMainModule: try: var s = initStdString("Boost Libraries Test".cstring) # std::string s = "Boost Libraries"; var e = initRegEx("(\\w+)\\s(\\w+)".cstring) # boost::regex expr{"(\\w+)\\s(\\w+)"}; var w = initMatch() # boost::smatch what; if regexSearch(s, w, e): # if (boost::regex_search(s, what, expr)) { echo(w[0].str().cStr()) # std::cout << what[0] << '\n'; echo(w[1].str().cStr(), "_", w[2].str().cStr()) # std::cout << what[1] << "_" << what[2] << '\n'; } for i in 0 ..< w.size(): let pos = w.position(i) echo(pos, "-", pos + w.length(i) - 1, " ", w[int32(i)].str().cStr()) else: echo(":-(") except RegexError as e: echo "Error in regex found at position:", e.position() # echo e.what() except: echo repr(getCurrentException())
-
@Ekopalypse said in Build boost::regex with ICU support:
Yes, as far as I remember I was able to compile everything into a “huge” static library, but then had a problem using it with nim which made me give up, but I don’t remember exactly what steps I took back then.
No, I’m pretty sure I never looked for utf32 trait classes because I didn’t even understand the basics of cpp back then.
I’m not home at the moment, but when I get back I’ll take a look at the project to see if I left some notes for my future self.
Thanks for looking.
Odd… I just came across one of my own older posts which mentioned this code from PythonScript in which they appear to have solved the problem (of creating a traits class, not of using ICU).
Now I can’t remember why I didn’t just copy that approach. There must have been a reason.
-
@Ekopalypse said in Build boost::regex with ICU support:
No, I’m pretty sure I never looked for utf32 trait classes because I didn’t even understand the basics of cpp back then.
I think I have it! I’m using the PythonScript approach of creating a new traits class, but not doing it quite the same way they do — instead I’m “delegating” everything I can to the wchar_t traits, and treating everything over 0xFFFF as opaque: no attempt to recognize word characters or digits or anything else up there… just [:unicode:] but otherwise not parts of any character class and not subject to case transformation. That way I’m pretty confident nothing will be worse than it is with
wchar_t
, but it works in Unicode code points with none of the surrogate nonsense.Not ready for public release yet, but it looks like it works. I can search and replace using expressions like
\x{1F809}
with no problem;.
matches one Unicode code point, including over 0xFFFF.So now my plan is to test this until I’m comfortable including it in a new release of Columns++. If that proves stable, I’ll raise the notion that perhaps Notepad++ could do the same thing.
-
A Alan Kilborn referenced this topic on
-
Hello, @ekopalypse, @alan-kilborn, @coises and All,
@coises and @ekopalypse, I don’t know if you’ve found the time and/or the inclination to take a look to the François-R Boyer work, just for inspiration !
Of course, this work dates from
2013
and a lot of time has passed ! Since this date, some improvements were made to our N++Boost
regex engine. In particular :-
The correct behavior of the backward assertions as
\A
-
The correct behavior of the `look-behind feature, even in case of overlapping
-
The explanation of an error in case of the
Find Invalid regular expression
message
But the highlights of this old build are still :
-
Searches and *replacements are performed in true
32 bits
code-points ( instead ofUTF-16
) -
Thus, it can handle ALL the Universal Character Names ( UCN) of the UCS Transformation Format , from
\x{0}
to\x{7FFFFFFF}
, particularly, all those of code-points over\x{FFFF}
, which are outside the BMP ( Basic Multilingual Plane ) -
Both, search and replace strings can contain embedded NUL characters and/or Escape sequences for NUL characters (
\x{0000}
) -
Backward regex search, for NON
ANSI
files, does not stop, anymore, when matching a character with Unicode code-point over\x{007F}
. -
A new
[[:inval:]]
character class, which allows you to find invalid UTF-8 sequences, which can be kept in replacement, too -
a new
SCFIND_REGEXP_LOCALEORDER
option, to have character ranges in locale order instead of code-point order ('Ã ’ is between ‘a’ and ‘b’ at least in French locale order, but is after in code point order, thus[a-b]
will match also 'Ã ’ and other characters that would be between ‘a’ and ‘b’ in a dictionary)
I tried to do some tests, installing the N++
v6.9.0
portable release on an USB key and replacing the defaultSciLexer.dll
with the @boyer’sSciLexer
. Unfortunately, when inserting this USB key on my Win-10 laptop, most of these tests cannot be performed properly because of important changes in N++ releasev8.0
-
The Scilexer.dll did not exist anymore and was included within N++ itself
-
The
UCS-2 BE BOM
andUCS-2 LE BOM
encodings were changed by theUTF-16 BE BOM
andUTF-16 LE BOM
encodings
For example, using my
Total_Chars
file, the search of the regex\x{10000}
wrongly return65
hits, instead of the right1
char ). I suppose that if we could use the Boyer implementation with a recent N++ release ( instead ofv6.9
), the result would be OK ?I also noticed a strange behavior regarding the backward regex searches, with N++
v6.9
and the Boyer build for anUTF-8
file : we have to click as many times toShift + F3
that the current character is coded with two, three or fourUTF-8
bytes !
However, one thing seems to work, as mentioned by François-R Boyer : the search and/or replacement of
NUL
character(s), whatever its syntax (\x0
,\x00
,\x{00}
,\x{000}
or\x{0000}
). For example :SEARCH
ABC\x00WYZ
REPLACE
\x0--$0--\x{000}
So, I wish you all the best to your quest towards a fully consistent regular expression engine, using
32-bits
code-points with possible local order of characters !Best Regards,
guy038
-
-
To be honest, no, I didn’t look further into boost::regex and Unicode after the problems, probably only caused by my ignorance, occurred with Nim. I admit that an implementation for the EnhanceAnyLexer plugin would be beneficial, but the interaction with cpp code still gives me a stomach ache.