Community
    • Login

    Build boost::regex with ICU support

    Scheduled Pinned Locked Moved General Discussion
    19 Posts 4 Posters 1.1k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Alan KilbornA
      Alan Kilborn @Alan Kilborn
      last edited by

      @Ekopalypse

      Wouldn’t N++ and Pythonscript be building boost with that enabled? Can’t you follow their models for getting it built?

      1 Reply Last reply Reply Quote 0
      • EkopalypseE
        Ekopalypse
        last edited by

        PS seems to do its own utf8 parsing - I try to avoid it if boost has a native way of doing it. But maybe I have to do it.
        The same seems to be the case how npp handles this.

        1 Reply Last reply Reply Quote 3
        • guy038G
          guy038
          last edited by guy038

          Hello, @ekopalypse, @alan-kilborn and All,

          As Alan said, I do think that the ICU project, of the Unicode consortium, is really very important !

          May be, you could examine the improved Beta N++ regex code of François-R Boyer. Probably, it’s not related at all with the present discussion. But, who knows ! You may find out some valuable information ;-))

          To that matter, just follow my road map, at the end of the post, below, in the remark section :

          https://community.notepad-plus-plus.org/topic/15765/faq-desk-where-to-find-regex-documentation


          Briefly :

          • Download a portable N++ v6.9.0 release

          • Install it in any location, different from Windows common folders

          • Rename the SciLexer.dll, whatever you want

          • Download the SciLexer.dll version of François-R Boyer, at the same location

          • Start N++ v6.9.0


          Of course, if, from the examination of this old modified SciLexer.dll file, you could understand and apply the Boyer’s improvements to our present SciLexer.dll file, a big step would have been taken ! Sure that you would deserve many packs of beer, as a reward ;-))

          Cheers, … by advance,

          guy038

          EkopalypseE 1 Reply Last reply Reply Quote 1
          • EkopalypseE
            Ekopalypse
            last edited by

            Okay, quick information.
            To compile boost::regex with ICU support the trick is to find
            both, the release builds and the debug builds of ICU.
            More about this here.

            1 Reply Last reply Reply Quote 1
            • EkopalypseE
              Ekopalypse @guy038
              last edited by

              @guy038

              I was reading your REMARK from the above mentioned link.
              May I ask you for a favor?
              Can you provide me a few regex examples from that section
              to see if my implementation works as expected?
              For the range: \x{0} to \x{7FFFFFFF}, is it ok if I would create
              each code point on the fly and do a search to see if it matches?
              Or is it needed to have multiple bytes of those values to be really
              sure it is working??
              Means, is each code point an entity of its own or
              might it be that multiple code points form to one entity?

              1 Reply Last reply Reply Quote 0
              • guy038G
                guy038
                last edited by

                Hi, @ekopalypse,

                Just a first and quick anwwer, regarding the readme.txt of François-R Boyer… on 2013-03-27 !


                This folder contains my latest regex code (as of may 2013) for Notepad++ which is not yet in the release version.

                The SciLexer.dll can directly replace the one from latest version of Notepad++ but not all features are accessible since the user interface has not been updated to support some new features.

                It passes all automated tests that were done for the “new regex code” which is in current release, plus:

                • correctly supports code points outside BPM (search is done with 32 bit codepoints instead of UTF-16);
                • both search and replace strings can contain embedded null characters and/or escape sequences for null characters;
                • lookbehinds are correctly handled in search and replace, even those overlapping with end of previous match;
                • a new [[:inval:]] character class, to find invalid UTF-8 sequences;
                • invalid UTF-8 characters can be kept in replace (e.g. replacing “(.*)” by “ab\1cd” will keep invalid UTF-8 sequences);

                The following new features are not accessible in current Notepad++ user interface:

                • a new SCFIND_REGEXP_LOCALEORDER option, to have character ranges in locale order instead of code point order (‘à’ is between ‘a’ and ‘b’ at least in French locale order, but is after in code point order, thus [a-b] will match also ‘à’ and other characters that would be between ‘a’ and ‘b’ in a dictionary);
                • the error message can now be known when the regex is invalid (e.g. regex “(” will report an “Unmatched marking parenthesis”, while current Notepad++ only knows it is an “Invalid regular expression”);

                Source: readme.txt, updated 2013-05-27


                Now, @ekopalypse, I’ll try, these next days, to collect a bunch of regexes, which :

                • Does not work with our present implementation of The Boost Regex library

                • Does work properly with the François-R Boyer implementation

                BR

                guy038

                EkopalypseE 1 Reply Last reply Reply Quote 0
                • EkopalypseE
                  Ekopalypse @guy038
                  last edited by

                  @guy038 - thank you very much but take your time, no hurry.
                  I stay away from PC on weekends anyway and
                  there is still some open task for implementing ICU.
                  So have nice weekend to everyone. :)

                  CoisesC 1 Reply Last reply Reply Quote 0
                  • CoisesC
                    Coises @Ekopalypse
                    last edited by

                    @Ekopalypse said in Build boost::regex with ICU support:

                    there is still some open task for implementing ICU.

                    Extreme necro, I know… this just turned up in a search.

                    There’s a nice, clean header-only implementation of boost::regex, but it doesn’t work properly for Unicode without ICU. I use boost::regex in my plugin, but I gave up trying to make sense of how to statically link whatever parts of ICU are needed by boost::regex so as to wind up with a single GitHub project / MSVC solution that compiles into one dll that works.

                    Did you ever figure it out?

                    Alternatively, did you find another source for the information needed to implement a traits class on Windows for UTF-32 as char32_t? I think getting those “traits” is the main hurdle, and why boost::regex uses ICU4C to implement Unicode.

                    EkopalypseE 2 Replies Last reply Reply Quote 0
                    • EkopalypseE
                      Ekopalypse @Coises
                      last edited by

                      @Coises

                      Yes, as far as I remember I was able to compile everything into a “huge” static library, but then had a problem using it with nim which made me give up, but I don’t remember exactly what steps I took back then.

                      No, I’m pretty sure I never looked for utf32 trait classes because I didn’t even understand the basics of cpp back then.

                      I’m not home at the moment, but when I get back I’ll take a look at the project to see if I left some notes for my future self.

                      CoisesC 2 Replies Last reply Reply Quote 1
                      • EkopalypseE
                        Ekopalypse @Coises
                        last edited by

                        @Coises

                        Sorry, apart from my Nim test code I haven’t found anything else.

                        import std/[os]
                        
                        when not defined(cpp): {.error: "This projects needs to be compiled with cpp backend as it uses the boost::regex library.".}
                        
                        {.passC: "-std=gnu++17 -ID:\\Repositories\\vcpkg\\installed\\x64-windows\\include".}
                        
                        {.push header: "boost/regex.hpp".}
                        type
                            StdString {.importcpp: "std::string".} = object
                            RegEx {.importcpp: "boost::regex".} = object
                            Match {.importcpp: "boost::smatch".} = object
                            SubMatch {.importcpp: "boost::ssub_match".} = object
                            RegexError {.importcpp: "boost::regex_error".} = object
                        {.pop.}
                        
                        # char compatible
                        proc regexSearch(s: StdString, w: Match, e: RegEx): bool {.importcpp: "boost::regex_search(@)".}
                        proc initStdString(s: cstring): StdString {.constructor, importcpp: "std::string(@)".}
                        proc initRegEx(s: cstring): RegEx {.constructor, importcpp: "boost::regex(@)".}
                        proc initMatch(): Match {.constructor, importcpp: "boost::smatch()".}
                        
                        proc size(self: Match): int {.importcpp: "size".}
                        proc position(self: Match, i: int): int {.importcpp: "position".}
                        
                        proc length(self: Match, i: int): int {.importcpp: "length".}
                        proc `[]`(self: Match, i: int32): SubMatch {.importcpp: "#[#]".}
                        proc str(self: SubMatch): StdString {.importcpp: "str".}
                        proc cStr(self: StdString): cstring {.importcpp: "(char *)#.c_str()".}
                        proc what(err: RegexError): cstring {.importcpp: "(char *)#.what()".}
                        
                        proc position(self: RegexError): int {.importcpp: "position".}
                        
                        # https://www.boost.org/doc/libs/1_80_0/libs/regex/doc/html/boost_regex/ref/match_results.html
                        # https://www.boost.org/doc/libs/1_80_0/libs/regex/doc/html/boost_regex/ref/sub_match.html
                        
                        when isMainModule:
                            try:
                                var s = initStdString("Boost Libraries Test".cstring)   # std::string s = "Boost Libraries";
                                var e = initRegEx("(\\w+)\\s(\\w+)".cstring)            # boost::regex expr{"(\\w+)\\s(\\w+)"};
                                var w = initMatch()                                     # boost::smatch what;
                                if regexSearch(s, w, e):                                # if (boost::regex_search(s, what, expr)) {
                                    echo(w[0].str().cStr())                             #     std::cout << what[0] << '\n';
                                    echo(w[1].str().cStr(), "_", w[2].str().cStr())     #     std::cout << what[1] << "_" << what[2] << '\n'; }
                        
                                    for i in 0 ..< w.size():
                                        let pos = w.position(i)
                                        echo(pos, "-", pos + w.length(i) - 1, " ", w[int32(i)].str().cStr())
                                else:
                                    echo(":-(")
                        
                            except RegexError as e:
                                echo "Error in regex found at position:", e.position()
                                # echo e.what()
                            except:
                                echo repr(getCurrentException())
                        
                        1 Reply Last reply Reply Quote 2
                        • CoisesC
                          Coises @Ekopalypse
                          last edited by Coises

                          @Ekopalypse said in Build boost::regex with ICU support:

                          @Coises

                          Yes, as far as I remember I was able to compile everything into a “huge” static library, but then had a problem using it with nim which made me give up, but I don’t remember exactly what steps I took back then.

                          No, I’m pretty sure I never looked for utf32 trait classes because I didn’t even understand the basics of cpp back then.

                          I’m not home at the moment, but when I get back I’ll take a look at the project to see if I left some notes for my future self.

                          Thanks for looking.

                          Odd… I just came across one of my own older posts which mentioned this code from PythonScript in which they appear to have solved the problem (of creating a traits class, not of using ICU).

                          Now I can’t remember why I didn’t just copy that approach. There must have been a reason.

                          1 Reply Last reply Reply Quote 1
                          • CoisesC
                            Coises @Ekopalypse
                            last edited by

                            @Ekopalypse said in Build boost::regex with ICU support:

                            No, I’m pretty sure I never looked for utf32 trait classes because I didn’t even understand the basics of cpp back then.

                            I think I have it! I’m using the PythonScript approach of creating a new traits class, but not doing it quite the same way they do — instead I’m “delegating” everything I can to the wchar_t traits, and treating everything over 0xFFFF as opaque: no attempt to recognize word characters or digits or anything else up there… just [:unicode:] but otherwise not parts of any character class and not subject to case transformation. That way I’m pretty confident nothing will be worse than it is with wchar_t, but it works in Unicode code points with none of the surrogate nonsense.

                            Not ready for public release yet, but it looks like it works. I can search and replace using expressions like \x{1F809} with no problem; . matches one Unicode code point, including over 0xFFFF.

                            So now my plan is to test this until I’m comfortable including it in a new release of Columns++. If that proves stable, I’ll raise the notion that perhaps Notepad++ could do the same thing.

                            1 Reply Last reply Reply Quote 4
                            • Alan KilbornA Alan Kilborn referenced this topic on
                            • guy038G
                              guy038
                              last edited by guy038

                              Hello, @ekopalypse, @alan-kilborn, @coises and All,

                              @coises and @ekopalypse, I don’t know if you’ve found the time and/or the inclination to take a look to the François-R Boyer work, just for inspiration !

                              Of course, this work dates from 2013 and a lot of time has passed ! Since this date, some improvements were made to our N++ Boost regex engine. In particular :

                              • The correct behavior of the backward assertions as \A

                              • The correct behavior of the `look-behind feature, even in case of overlapping

                              • The explanation of an error in case of the Find Invalid regular expression message


                              But the highlights of this old build are still :

                              • Searches and *replacements are performed in true 32 bits code-points ( instead of UTF-16 )

                              • Thus, it can handle ALL the Universal Character Names ( UCN) of the UCS Transformation Format , from \x{0} to \x{7FFFFFFF}, particularly, all those of code-points over \x{FFFF}, which are outside the BMP ( Basic Multilingual Plane )

                              • Both, search and replace strings can contain embedded NUL characters and/or Escape sequences for NUL characters ( \x{0000} )

                              • Backward regex search, for NON ANSI files, does not stop, anymore, when matching a character with Unicode code-point over \x{007F}.

                              • A new [[:inval:]] character class, which allows you to find invalid UTF-8 sequences, which can be kept in replacement, too

                              • a new SCFIND_REGEXP_LOCALEORDER option, to have character ranges in locale order instead of code-point order ('Ã ’ is between ‘a’ and ‘b’ at least in French locale order, but is after in code point order, thus [a-b] will match also 'Ã ’ and other characters that would be between ‘a’ and ‘b’ in a dictionary)


                              I tried to do some tests, installing the N++ v6.9.0 portable release on an USB key and replacing the default SciLexer.dll with the @boyer’s SciLexer. Unfortunately, when inserting this USB key on my Win-10 laptop, most of these tests cannot be performed properly because of important changes in N++ release v8.0

                              • The Scilexer.dll did not exist anymore and was included within N++ itself

                              • The UCS-2 BE BOM and UCS-2 LE BOM encodings were changed by the UTF-16 BE BOM and UTF-16 LE BOM encodings

                              For example, using my Total_Chars file, the search of the regex \x{10000} wrongly return 65 hits, instead of the right 1 char ). I suppose that if we could use the Boyer implementation with a recent N++ release ( instead of v6.9 ), the result would be OK ?

                              I also noticed a strange behavior regarding the backward regex searches, with N++ v6.9 and the Boyer build for an UTF-8 file : we have to click as many times to Shift + F3 that the current character is coded with two, three or four UTF-8 bytes !


                              However, one thing seems to work, as mentioned by François-R Boyer : the search and/or replacement of NUL character(s), whatever its syntax ( \x0 , \x00, \x{00}, \x{000} or \x{0000} ). For example :

                              SEARCH ABC\x00WYZ

                              REPLACE \x0--$0--\x{000}

                              f2bdd090-29e4-4b33-9675-81a225d46d13-image.png


                              So, I wish you all the best to your quest towards a fully consistent regular expression engine, using 32-bits code-points with possible local order of characters !

                              Best Regards,

                              guy038

                              EkopalypseE 1 Reply Last reply Reply Quote 1
                              • EkopalypseE
                                Ekopalypse @guy038
                                last edited by

                                @guy038

                                To be honest, no, I didn’t look further into boost::regex and Unicode after the problems, probably only caused by my ignorance, occurred with Nim. I admit that an implementation for the EnhanceAnyLexer plugin would be beneficial, but the interaction with cpp code still gives me a stomach ache.

                                1 Reply Last reply Reply Quote 1
                                • First post
                                  Last post
                                The Community of users of the Notepad++ text editor.
                                Powered by NodeBB | Contributors