Build boost::regex with ICU support
-
Two things:
What the heck is ICU? The link doesn’t explain the “what” of it.
Also, I understand why it is not OT here, but there may be better places to ask the same question.
I think I know why you are asking, and I’m intrigued, so do keep us posted on what you find out.
-
ICU allows boost::regex to correctly parse
utf8, utf16 and utf32 encoded text. -
@Ekopalypse said in Build boost::regex with ICU support:
ICU allows boost::regex to correctly parse
Pretty important then! :-)
-
Yes, very important :)
-
Wouldn’t N++ and Pythonscript be building boost with that enabled? Can’t you follow their models for getting it built?
-
PS seems to do its own utf8 parsing - I try to avoid it if boost has a native way of doing it. But maybe I have to do it.
The same seems to be the case how npp handles this. -
Hello, @ekopalypse, @alan-kilborn and All,
As Alan said, I do think that the
ICU
project, of the Unicode consortium, is really very important !May be, you could examine the improved Beta N++ regex code of François-R Boyer. Probably, it’s not related at all with the present discussion. But, who knows ! You may find out some valuable information ;-))
To that matter, just follow my road map, at the end of the post, below, in the remark section :
https://community.notepad-plus-plus.org/topic/15765/faq-desk-where-to-find-regex-documentation
Briefly :
-
Download a portable N++
v6.9.0
release -
Install it in any location, different from Windows common folders
-
Rename the
SciLexer.dll
, whatever you want -
Download the
SciLexer.dll
version of François-R Boyer, at the same location -
Start N++
v6.9.0
Of course, if, from the examination of this old modified
SciLexer.dll
file, you could understand and apply the Boyer’s improvements to our presentSciLexer.dll
file, a big step would have been taken ! Sure that you would deserve many packs of beer, as a reward ;-))Cheers, … by advance,
guy038
-
-
Okay, quick information.
To compile boost::regex with ICU support the trick is to find
both, the release builds and the debug builds of ICU.
More about this here. -
I was reading your REMARK from the above mentioned link.
May I ask you for a favor?
Can you provide me a few regex examples from that section
to see if my implementation works as expected?
For the range:\x{0} to \x{7FFFFFFF}
, is it ok if I would create
each code point on the fly and do a search to see if it matches?
Or is it needed to have multiple bytes of those values to be really
sure it is working??
Means, is each code point an entity of its own or
might it be that multiple code points form to one entity? -
Hi, @ekopalypse,
Just a first and quick anwwer, regarding the
readme.txt
of François-R Boyer… on2013-03-27
!
This folder contains my latest regex code (as of may 2013) for Notepad++ which is not yet in the release version.
The SciLexer.dll can directly replace the one from latest version of Notepad++ but not all features are accessible since the user interface has not been updated to support some new features.
It passes all automated tests that were done for the “new regex code” which is in current release, plus:
- correctly supports code points outside BPM (search is done with 32 bit codepoints instead of UTF-16);
- both search and replace strings can contain embedded null characters and/or escape sequences for null characters;
- lookbehinds are correctly handled in search and replace, even those overlapping with end of previous match;
- a new [[:inval:]] character class, to find invalid UTF-8 sequences;
- invalid UTF-8 characters can be kept in replace (e.g. replacing “(.*)” by “ab\1cd” will keep invalid UTF-8 sequences);
The following new features are not accessible in current Notepad++ user interface:
- a new SCFIND_REGEXP_LOCALEORDER option, to have character ranges in locale order instead of code point order (‘à’ is between ‘a’ and ‘b’ at least in French locale order, but is after in code point order, thus [a-b] will match also ‘à’ and other characters that would be between ‘a’ and ‘b’ in a dictionary);
- the error message can now be known when the regex is invalid (e.g. regex “(” will report an “Unmatched marking parenthesis”, while current Notepad++ only knows it is an “Invalid regular expression”);
Source: readme.txt, updated 2013-05-27
Now, @ekopalypse, I’ll try, these next days, to collect a bunch of regexes, which :
-
Does not work with our present implementation of The Boost Regex library
-
Does work properly with the François-R Boyer implementation
BR
guy038
-
@guy038 - thank you very much but take your time, no hurry.
I stay away from PC on weekends anyway and
there is still some open task for implementing ICU.
So have nice weekend to everyone. :)