Is it planned to switch to PCRE2?

  • Hello,

    the PCRE library has changed its API and will provide new features only by means of the new API.

    Is it planned to switch to the new versions with the new API, or may I create an issue?

    See also:

  • Okay. I’ve done that…

  • Notepad++ does not use the lib from It uses boost::regex, which is completely unrelated to PCRE.

  • @gerdb42 said:

    Notepad++ does not use the lib from It uses boost::regex, which is completely unrelated to PCRE.

    Why then does the features page say:

    PCRE (Perl Compatible Regular Expression) Search/Replace

    Is it just an implementation of the exact same rules?

  • @milipili said:

    Actually we would like to get rid of boost::regex and to directly use pcre.

    That’s good news!

  • Hello h-h-h-h, milipili and All,

    So, h-h-h-h and milipili, you would prefer to switch to the PCRE2 regex library. May I ask you what are the main reasons for ?

    Below, I tried to examine some differences between BOOST and PCRE regex library and, to my mind, we don’t lack important features, keeping our present regex engine ! So, it’s up to you to tell me in the different ways I could be wrong :-)

    BTW, this post is quite long and not easily readable :-(( So, have a drink and begin reading this damned post !!!

    I end, with a link to an improved version of our BOOST regex library, created by François-R Boyer, on May 2013, which could be good enough, for most of N++ users !?

    At the bottom of the Wikipedia article below,

    there are a description of some differences, between PCRE and PERL regex expressions.

    Below, I’ll try to test the current Boost Regex library, v1.55, included in N++, by Dave Brotherstone, from the 6.0 version, against the given differences !

    Given a slight modification of the first example, the regex ^(<(?:[^<!>]+|(?2)|(?1))*>)(!>!>!>)$ does match the subject string <<<<!>!>!>>>><>>!>!>!>.

    The process can be split in :

        < < < < !>!>!> > > >    <>  >    !>!>!>
    4         ----------
    3       --------------
    1     ------------------    --
    0   -----------------------------    ------

    Although the FIRST alternative [^<!>], of the NON capturing group, can’t NEVER be matched, in the subject string, either, at level 0, outside recursion OR, in higher levels, in recursion, the TWO other alternatives ( the called subpattern (?2), idem !>!>!> OR the recursive subpattern (?1), have also been tried, in the RECURSION process, by the regex engine.

    Therefore, seemingly, with the BOOST regex library, RECURSIVE matches are NON atomic, like in PERL, and UNLIKE PRCE.

    If we consider the Search-Replacement SEARCH = ^(a(b|c){0,3})+$ and REPLACE = >\1<>\2<,

    Against the subject string abbababbaccca, we obtain the replacement string >a<>c<.

    So, with the BOOST regex library, like in PCRE and unlike PERL, any quantified capture group, with LOW limit is 0, contains the last NON NULL value matched, of that group, EVEN IF the last match, of the subject string, DOESN’T include that group.

    The different backtracking control verbs, (*FAIL), (*F), (*PRUNE), (*SKIP), (*THEN), (*COMMIT), (ACCEPT) and (:NAME), inside a regex pattern, implies the invalid regular expression message, in the Replace dialog.

    So, the backtracking control verbs are NOT allowed, with the current N++ BOOST regex library.

    The regex (?<A456>\d+)\l+\g<A456> does match the following strings :


    But, the regex (?<456>\d+)\l+\g<456> is considered as an INVALID regular expression.

    So, with the BOOST regex library, like in PERL and unlike PRCE, names of capture groups must NOT be TRUE numbers.

    The form (?!.*s{3,5}).+, that you may test against the example text below, is a valid regex, in N++.


    Then, with the BOOST regex library, NEGATIVE look-ahead can, seemingly, contain quantifiers.

    On my old Win XP laptop ( with 1 Gb of RAM only ! ), the regex (.+)+X does match the following TWO strings


    but, wrongly, select ALL the file, with the longer subject string, below :


    This is due to the multiple matching tries of the combination of the two PLUS quantifiers, during backtracking from the end of the subject string to the X character. Of course, the limit, between these two behaviours, may change, according to your technical configuration !

    Therefore, as PCRE and unlike PERL, seemingly, the BOOST regex library has a HARD limit in recursion depth.

    Just compare with the more simple regex (.+)X which perfectly works, whatever the length of the subject string.

    Now, from the link, below,

    if I test the BOOST regex library, on all the points, listed on that page, beginning with the oldest, the missing features, comparing to true PCRE patters, and NOT previously discussed, are the following :

    • The inline modifier (?U), to turn on the ungreedy mode, is absent. Therefore we need, systematically, to add the question mark character, after a quantifier, to get an ungreedy behaviour, in regular expressions.

    • The named groups, written (?P<foo>....), are not allowed, nor are the back-references (?P=foo). However, these forms can be changed, with BOOST, into (?<foo>....) and the back-references \g<foo> or \k<foo>.

    • The callouts (?C#) and (?C'abc'), which can call an external function, are, seemingly, NOT supported by the BOOST regex library, but it’s rather useless, as for the simple S/R dialog, used in Notepad++.

    • The form \C, which matches a single byte, EVEN in UTF-8 mode, doesn’t work and, with the BOOST regex library, is just an equivalent to the DOT special character. They, both, stand for [^\n\f\r]

    Using PCRE, a safe syntax to manage the individual UTF-8 bytes of characters, could be the following regex :

    (?x) (?| (?=[\x00-\x7f])(\C) | (?=[\x80-\x{7ff}])(\C)(\C) | (?=[\x{800}-\x{ffff}])(\C)(\C)(\C) | (?=[\x{10000}- \x{1fffff}])(\C)(\C)(\C)(\C))

    • The [negative] Unicode categories forms, as \p{L} or \P{Nd}, and the Unicode script names forms, as \p{Arabic}, are NOT supported in the BOOST regex library, because, in N++, it has been compiled WITHOUT the "Unicode character property support". However, note that matching characters, by Unicode property, isn’t very fast, because it had to search in a structure of over 15000 characters, even in the Basic Multilingual Plane ( BMP ) only !

    • Unlike PCRE, from v7.20, the conditional relative capture groups are NOT allowed with the BOOST regex library. For instance the BOOST regex (\d)?\d : (?(1)two|one) digit(?(1)s) number does matches the two strings 23 : two digits number and 5 : one digit number, but the regex (\d)?\d : (?(-1)two|one) digit(?(-1)s) number is an INVALID regular expression ! Luckily, it isn’t used very often and there are plenty of equivalent regexes. For instance, the above example could be simply rewritten : \d(\d : two digits| : one digit) number !

    • The Line Break modifiers ( (*CR), (*LF), (*CRLF), (*ANYCRLF) and (*ANY) ), as well as the BSR modifiers ( (*BSR_ANYCRLF) and (*BSR_UNICODE) ), the UTF modifiers ( (*UTF), (*UTF8), (*UTF16) and (*UTF32) ) and the Unicode modifier (*UCP) don’t exist in the BOOST regex library.

    • The syntax \N, which matches, in PCRE, any character different than a line break, EVEN when the (?s) begins the regex, is an INVALID form, in the BOOST regex library. The same result can be obtained, in N++, with the regex [^\n\r]

    • Finally, all the new options and control verbs, starting with the new API, PCRE2, are NOT supported, in the BOOST regex library !

    To my mind and to sum up, except for the \C syntax and, may be, the line break and the encoding modifiers, we don’t miss major features, with the current BOOST regex library.

    On the contrary, if we move to the PCRE2 library, we likely miss two main features, of the BOOST regex library, used in the Replacement part :

    • The CONDITIONAL replacements ( (?#...) and (?#...:...) ). Let’s suppose a list of names and ages, below :


    If the search regex is ^(\d+)?.+$ and the replacement (?1Age :Name) : $&, at once, that list is, magically, changed into :

    Name : Peter
    Age  : 35
    Name : John
    Age  : 52
    Name : Marie
    Age  : 18
    • The case modifiers ( \U, \L, \u, \l and \E ). For instance, given the subject string ShaKesPeare wiLLiam, the SEARCH regex (\w+) (\w)(\w+) and the REPLACEMENT syntax \U\1 \2\L\3, we obtain the string SHAKESPEARE William

    On this page, below, relative to the new PCRE2 version,

    it is said, at point #5 :

    Patterns, subject strings, and replacement strings may all contain binary
    and, for this reason, are always passed as a pointer and a length.

    Presently, the BOOST regex library can deal with NUL characters in subject strings and in search regexes, but are NOT ALLOWED, in replacement strings.

    Luckily, if you install the improved François-R Boyer version, of the BOOST regex engine, you’ll get some strong new features :

    • Search is performed in 32 bits code-points, so it can handle characters, over the BMP ( Basic Multilingual Plane ). An interesting feature for most Asiatic people !

    • It can manage NUL characters, both, in search and in replacement, too.

    • Look-behinds are correctly handled, even in case of OVERLAPPING, with the end of the previous match.

    • It can handle ALL the Universal Character Names ( UCN) of the UCS Transformation Format , from \x{0} to \x{7FFFFFFF}, particularly, all those of code-points over \x{FFFF}, which are outside the BMP.

    • The backward regex search isn’t stopped, on matching a character, with Unicode code-point over \x{00FF}

    To get this Beta N++ regex code ( that has NEVER been part of an official N++ release ) :

    • Rename your present SciLexer.dll file as, for instance,

    • Download, from the link below, the modified SciLexer.dll file. of François-R Boyer N%2B%2B regex code/

    • Copy this file, in the installation folder, along with the Notepad++.exe and the files


    Don’t forget that this modified SciLexer.dll, build on May 2013, is based on the old Scintilla v2.2.7 !

    Thank you, very much, to be still there and quite awoken !!!

    Best Regards,


  • BTW, this post is quite long and not easily readable

    Indeed. Also, I don’t have this much insight to compare boost and PCRE2. Before starting this thread, I wasn’t even aware of the usage of the boost library because the Notepad++ website states using PCRE. PCRE2 ist just the future of PCRE.

    You can read a feature that’s surely missing in the boost library on the issue page:

  • Hi, h-h-h-h,

    Concerning your example, with a list of fruits, let’s suppose the four wanted replacements, below :

    apple      -> pear
    lemon      -> orange
    strawberry -> raspberry
    apricot    -> plum

    With the current BOOST regex engine, used in N++, you can use the following S/R :

    SEARCH = (apple)|(lemon)|(strawberry)|(apricot) and REPLACE = (?3raspberry)(?2orange)(?4plum)(?1pear). Then, after a click on the Replace All button, the list below :


    is changed into :


    Notes :

    • You’ll notice that, in the replacement block, the conditional replacements don’t need to be enumerated, in the same order, than in the search block.

    • A trick : if a replacement string, for instance, relative to the group #5, begins with a number and contains parenthesis, you can use the syntax (?{5}123\(abc\))

    • Free Spacing is ALLOWED, with the BOOST regex library. For instance, the search regex (?x) ( S \. O \. S \. ) \ \- \ \1, or also, the regex

      \- [ ] \1
      , both, match the subject string S.O.S. - S.O.S.

    • The two BOOST syntaxes (?#.......) and (?x)...#......... define a COMMENT string. For instance, the five regexes, below, are equivalent to the simple regex T+CA :

    T+(?# UPPER T, 1 or MORE times)CA

    T+CA(?#UPPER T, 1 or MORE times, followed with 'CA' )

    (?x) T+ (?# UPPER T, 1 or MORE times) CA

    (?x) T+ CA (?# UPPER T, 1 or MORE times, followed with 'CA' )

    (?x) T+ C A # UPPER T, 1 or MORE times, followed with 'CA'

    I quite agree with your GitHub issue #565. But, presently, it would still be, like below !!

     •- Search mode -------------------------•
     | ( ) Normal                            |
     | ( ) Extended (\r, \n, \t, \x..., \0)  |
     | (•) Regular expression (BOOST 1.55.0) |
     |     [ ] . matches newline             |

    Did you have a try of the François-R Boyer’s version ? It’s a very powerful one !

    BTW, as my present knowledge about C/C++ is rather near zero, it would be nice if someone could merge that improved François-R Boyer’s version of the N++ regex engine, in the present Scilexer.dll file, based on Scintilla v3.3.4 !

    And, generally speaking, may someone be able to find a way to include that improved version, whatever the both versions of N++ and Scintilla are ?



  • @h-h-h-h

    The term PCRE Search/Replace just states that the regex engine used is PERL compatible.

  • @guy038:
    The syntax (apple)|(lemon), (?1pear)(?2orange) is a good one. Doesn’t PCRE have something similar? Is it possible with named capturing groups, too?

    These boost versions you speak of don’t seem to be maintained as good as PCRE2. One year you mentioned was 2013. Further, I think PCRE2 has a syntax better known. To me this is important. doesn’t even mention boost.

    You mentioned positive aspects about PCRE2. So, you aren’t against it?

    Where did you get the information about the boost regex syntax? I haven’t found a boost regex documentation.

    That’s strange because PCRE is also an official name of a regex library with a specific syntax.

  • Hi, h-h-h-h,

    I apologize for this late reply, but I was very busy, at work, this week and I preferred to rest ! I’m, no more, the young man that I used to be, before :-((. Again, this post is quite long ! So, let’s have a second drink :-))

    Some years ago ( I can’t find again this quoted text below, but I had printed it ! ) Jan Goyvaerts, the author of the site below :

    said, in the Replacement Text Reference section :

    a list of replacement text flavours is NOT the same as the list of regular expression flavours. Indeed, replacements are NOT made by the regular expression engine, but by the tool or programming library, providing the search-and-replace capability. So, tools or languages, using the SAME regex engine, may behave DIFFERENTLY, when it comes to making replacements. E.g. The PCRE library does NOT provide a search-and-replace function => All tools and languages, implementing PCRE, use their OWN search-and-replace feature, which may result in differences in the replacement syntax.

    Of course, from the link below :

    we know that, from PCRE2 version 10.00, a new pcre2_subsitute() function has been implemented. However, if your read the two sections Using PCRE2, and Substituting Matches of the page, below,

    the handling of PCRE2 is, seemingly, not as easy as it was with PCRE and the substitute function has rather simple features, if we compare with the present BOOST extended format string replacement tool, in Notepad++ ! Here are, below, some nice features about the present BOOST replacement tool :

    • With the BOOST extended format string tool, named groups can be used and any group, named or not, which doesn’t match anything, is just replaced by an empty string.

    For instance, if SEARCH = (?<letters>[A-Za-z]+) *(?<digits>\d+)|(\d+) *([A-Za-z]+) and REPLACE = Name : $+{letters}\4 Age : $+{digits}\3, from the text

    Peter 35
    63   Edith

    you get the text :

    Name : Peter  Age : 35
    Name : Marie  Age : 18
    Name : David  Age : 52
    Name : Edith  Age : 63
    • With the BOOST extended format string tool, the conditional replacements can be nested. So, if SEARCH = (\d)?(\d)?\d and REPLACE = a (?2three:(?1two:one)) digit(?1s) number, the list of numbers :


    is changed into :

    a one digit number
    a two digits number
    a three digits number
    a two digits number

    Note : For a TWO digits number, group 1 is the TEN digit, group 2 is EMPTY and the last \d is the UNIT digit !

    • With the BOOST extended format string tool, the context sequences, below, are supported :

      $MATCH or ${^MATCH} or $& or $0 or ${0}
      $PREMATCH or ${^PREMATCH} or $`
      $POSTMATCH or ${^POSTMATCH} or $’

    For instance, $^N represents the contents of the last capture group, presently matched. So, giving the subject string —abcdef—, SEARCH = (a)|b|(c)(d)e|(f) and REPLACE = <$^N>, we obtain the replacement string, below :


    Why ? Well, just because :

    When it matches (a) or (f), the value of $^N is the group itself, a or f
    When it matches b ( NO group ), the value of $^N is an EMPTY string
    When it matches (c)(d)e, the value of $^N is d ( the contents of the UPPEST group matched )

    • With the BOOST extended format string tool, the five case conversions \u, \l, \U, \L and \E are possible.

    For example, the Proper Case capitalization rule can be obtained with SEARCH = (\w)(\w*) and REPLACE = \u\1\L\2. So, the sentence “thIs is a tEST” will give the nicer text "This Is A Test"

    So, h-h-h-h, to sum up, I’m not FOR or AGAINST the new PCRE2 library. It’s just that I wouldn’t lose the features above, and some others, that we can already use in replacement strings !

    Secondly, to my mind, we do need to improve the present regular S/R regex engine, by using the François-R Boyer version. Of course, between the N++ version 6.0 and version 6.4.2, some improvements were done and some bugs were fixed by, both, Dave BrotherStone and François-R Boyer ( as the Zero length match call-tip message,… )

    However, although François’s version simply relies on the BOOST library, he was able to fix major issues, relative to look-behinds and backward assertions, and succeeded to manage all UNICODE characters, as well as NUL characters, in replacement !

    Here are, below, a NON exhaustive list of issues with the current regex engine,_ which DON’T occur, with François-R Boyer’s version_ :

    • Overlapping lookbehinds and matched strings are NOT correctly handled. For instance, giving the 20 characters subject string aaaabaaababbbaabbabb and SEARCH = (?<!a)ba*, we get 6 matches, but, unfortunately, 2 results are wrong. With the improved version of François, it’s all OK !

    • We can’t use the NUL character in replacement. For example, the simple S/R : SEARCH = ABC and REPLACE = DEF\x00GHI, the result is the string DEF only :-(. The François’s version does insert the NUL character between the strings DEF and GHI !

    • BACKWARD assertions are NOT correctly supported. E.g. : SEARCH = \A. matches, successively, all the characters of the FIRST line. With the François’s version it only matches, as expected, the FIRST character of the current file

    • It doesn’t search and replace characters, which are outside the Basic Multilingual Plane (BMP ). For instance, in an full UTF-8 file ( with a BOM ), if SEARCH = \x{104A5}\x{20AC} and REPLACE = \x{A3}\x{10482}, The present regex engine answers Invalid regular expression ! as for the François’s version does the replacement correctly !

    Note :

    Of course, for that specific S/R, you need a font, that can display the Osmanya characters, and which is affected as the default style font, in the Style Configurator… dialogue ! To that purpose, download the Andagii font at :

    and have a look to Osmanya characters at :

    • Now, let’s suppose, for instance, the French subject string Un événement, on a new line, and the simple SEARCH regex \w. After a click on the Find Next button, close the Replace dialog, and keep on searching some word characters, by hitting the F3 key. When you’re, about, at the end of the string, just go searching backwards, by hitting the SHIFT + F3 key. You’ll notice _that it CAN’T go backwards, past the é character !!!. The François’s version does works well, in both directions !

    • A last example : if you try to mark the matches of the simple SEARCH regex (?<=.)., the present regex engine marks any character, EVERY OTHER time. With the François’s version, it correctly find all characters, except for the very first of each line !

    • François-R Boyer also created a new option SCFIND_REGEXP_LOCALEORDER, to get ranges of characters, in a locale order, NOT in Unicode order. For instance, the regex range [A-B], with the Match case option SET, would match all the following characters AÀÁÂÃÄÅĀĂĄǍǺẠẢẤẦẨẪẬẮẰẲẴẶǼB, in a true UTF-8 file, with a suitable font !

    • To end with, the François-R Boyer’s version could display the EXACT error messages, instead of the generic message Invalid regular expression. For instance, the regex (\d+ab would report the Unmatched marking parenthesis error message !

    So, h-h-h-h, it wouldn’t be worth switching to the PCRE2 regex engine, while keeping all these issues. To my mind, we should aim the best regex engine but, also, the best replacement tool and the best integration to Notepad++ ! Just remember that François-R Boyer could produce this nice version, with the present BOOST library only !

    I end this post with some links to the BOOST library. I haven’t the software abilities to verify these assertions, but I think that, in N++, we currently use the BOOST v1.55 library, with the PERL syntax, and without the Unicode support !

    The Home BOOST C++ Regex library page can be found at :

    The BOOST regex SEARCH syntax is explained at :

    And the BOOST-Extended REPLACEMENT format syntax can be read at :

    Seemingly, the latest BOOST C++ Regex library version is Boost-Regex 5.0.1 ( Boost-1.59.0 ). So, the latest main page, on BOOST C++ Regex library, can be obtained at :

    And the history of the the BOOST C++ Regex library is at :

    Best Regards


    P.S. :

    • Concerning the SEARCH regex documentation, there are few typographic and syntactic errors ( which are different for each version ! ). If you still wonder about a specific BOOST syntax, I’ll be able to point out all these errors, next time !

    • From the two links below, I’m going to determine, shortly, ALL the syntaxes, that are NOT SUPPORTED yet, by the present BOOST regex engine, implemented in Notepad++.

  • Hello h-h-h-h,

    I have studied, from some days, all the sections, from the two reference chapters, of the site, below :

    and, here are, my deductions, about the possible missing BOOST features. Luckily, most of them are not main features. Of course, this list is NOT exhaustive at all.


    I won’t mention missing features that can be achieved with an other regex syntax or a specific regex. For instance :

    • The special quantifier syntax {,m} may be simulated, by the simple BOOST syntax {0,m} )

    • The character class subtraction [a-z-[aeiuoy]] can be replaced with the BOOST regex (?![aeiouy])[a-z]

    • The TCL modifier (?p) can be changed by the (?ms) combined modifiers BOOST form

    • The (?P<name>....) construction, for a named capturing group, in Python, may be obtained with the BOOST syntaxes (?<name>....) or (?'name'....)

    • The match context form $_, standing for the whole regex match, can be replaced by the BOOST syntax below :


    And so on…Therefore, I pointed out :

    A) In SEARCH regexes :

    • The specific syntaxes \i, \c and its negative forms \I,\C, used in XML Schema or XPath, that apply to XML names

    • All the syntaxes, related to the Unicode properties of text ( as \p{Lu}, \pM, \P{IsCntrl}, … )

    • The conditional form (?(+n)...|....) or (?(-n)...|....), where the condition is the relative nth group after, or before, the current group

    • The modifier (?n), used by .NET and JGsoft, that make all unnamed groups, non-capturing groups

    • The modifier (?J), used by PCRE, Delphi, PHP, which allows duplicate group names

    • The modifier (?U), used by PCRE , which switches the syntax, between greedy and lazy quantifiers

    • The modifier (?X), used by PCRE, that generate an error, when a no-valid token is escaped

    • The modifiers (?b) and (?e), used in TCL, which interprets the regex as a POSIX Basic RE or as a POSIX Extended RE

    B) In Replacement regexes :

    • The notation \o{####}, where #### stands for a octal number, when it lies between \o{1000} and \o{7777}

    • The \0 syntax, meaning the NUL character, which CANNOT be inserted, in replacement, yet !

    So, if we would switch to the PCRE2 library, we could benefit from most of the present missing features, listed just above, but, as I said in my previous post, we would lose some nice replacement features, too !

    Best Regards,


    P.S. :

    If you think of an other missing BOOST feature, compared to PCRE2 ones, just let me know !

  • @guy038:
    See for evidence that PCRE2 supports the (?{name}true:false) syntax for substitution, now.

  • @guy038 said:


    @guy038 You had me backtracking control verbs :)

    I’ve been wondering why this wasn’t working with my regular expressions within the IDE. Now I understand why. This would be a HUGE improvement to Notepad++. There’s a lot of macros that I could finally implement that would make my life so much simpler to develop software.

  • I like using regular expressions.

    For me the biggest problem is the amount of different engines that have their own expression syntax. Yes, that includes the one’s that claim to be Perl compatible. IMO compatible in most cases means sharing some, re-interpreting some and extending some of the syntax instead of just extending the syntax. As a result the RE’s successfully used with one RE engine do not work with another RE engine (unless the RE’s are very-very basic).

    So at some moment you think: “I know RE’s, I can do this.” and then it appears you don’t :)