Regular Expressions slightly broken in 6.9.1?



  • Hello, first time poster so I’m hoping this is the right place to post what appears to be a bug.

    I’m attempting to use a regex that should match the 3-character extension in a list of filenames in a search/replace intended to strip them off. The regex search pattern I’m using is “\.[^.]+$” (without the quotes, of course), and the replacement text box is empty. The pattern should match only on a period followed by one or more characters that are not a period, up to the end of each line.

    Unfortunately, it appears to be matching (and therefore removing) blank lines, and lines that do not match the pattern at all with the exception of the very first line (it does successfully remove the extensions, though!). It does so regardless of the setting “. matches newline.”

    The source file looks like this prior to search/replace:

    Schema
    VHINT_HL7_PROVATION_2.7.1X.hl7

    Routing Rules
    VHINT.FromVhEpicMultiToProvationRoutingRule.cls
    VHINT.FromProvationRoutingRule.cls

    DTLs
    VHINT.FromProvationORUToEpicMDMDTL.cls
    VHINT.FromEpic700ToProvationAdtDTL.cls
    VHINT.FromEpic502ToProvationSIUDTL.cls
    VHINT.FromEpic502ToProvationSIUAddProcDTL.cls

    Components
    Settings:ToProvation.ptd
    Settings:FromVhEpicMultiToProvationRouter.ptd
    Settings:FromProvationRouter.ptd
    Settings:FromProvation.ptd
    ProvApptResource.lut
    ProvApptReason.lut

    After:

    Schema
    VHINT_HL7_PROVATION_2.7.1X
    VHINT.FromVhEpicMultiToProvationRoutingRule
    VHINT.FromProvationRoutingRule
    VHINT.FromProvationORUToEpicMDMDTL
    VHINT.FromEpic700ToProvationAdtDTL
    VHINT.FromEpic502ToProvationSIUDTL
    VHINT.FromEpic502ToProvationSIUAddProcDTL
    Settings:ToProvation
    Settings:FromVhEpicMultiToProvationRouter
    Settings:FromProvationRouter
    Settings:FromProvation
    ProvApptResource
    ProvApptReason

    Am I missing something here?



  • Am I missing something here?

    One of the quirks of regular expressions that even took me a while to understand.

    This part of your regex:

    [^.]+
    

    Is the problem. This is called “greedy” matching. So what happens is the + operator grabs as much input as possible. In your case this was also grabbing the empty lines. This needs to be “lazy” by adding a ? after it. Making it lazy will try to grab as little input as possible. The full regex would look like this.

    \.[^.]+?$
    

    Also this concept of lazy matching applies to * by adding a ? after it as well.

    Depending on how much you know about the input, it is usually better to me more specific when possible. So you could do something like:

    \.[^.]{3}$
    

    That way you only match 3 characters.



  • I very much appreciate the response, and do understand the concept of “greedy” matching. However, the $ anchor is quite unambiguous as the indicator for “end of line” and should not have been consumed as part of this pattern. And if it was truly greedy matching irrespective of line-end characters, the pattern would have matched the entire file, replacing it with the empty string I supplied in the dialog. Finally, there is a specific option for matching across line endings that (one might assume) should be selected to take effect, and it was not used in my case.

    Thanks for the alternate solutions, but I still think the feature is broken. In this case I have only 3 character extensions, but that may not always be true.



  • However, the $ anchor is quite unambiguous as the indicator for “end of line” and should not have been consumed as part of this pattern

    Not sure what you mean. The $ isn’t actually the new line character(s). It is the boundary just before the newline.

    And if it was truly greedy matching irrespective of line-end characters, the pattern would have matched the entire file

    No because [^.]+ wouldn’t match the entire file because there are literal periods in the file. It would match up until an a line doesn’t contain a period.



  • I’m not understanding the difference between “the boundary just before the newline” and “end of line.” I do agree that the anchor does not consume the EOL character(s), so I think we’re talking about the same thing.

    With the cursor positioned at the beginning of the file, the match selection indicates that the following portion is matched, represented by the square brackets (I’ve inserted representations of the specific EOL characters in the source file, which np++ recognizes as “DOS/Windows”):

    Schema\r\n
    VHINT_HL7_PROVATION_2.7.1X[.hl7\r\n
    \r\n
    Routing Rules]\r\n

    So it’s matching across multiple line endings, even though that option is not selected in the Search/Replace dialog. I’ve used regular expressions in Perl, awk, sed, and languages with both pcre-based and alternate regular expression engines and have never seen this behavior.

    Here’s an example using Perl, showing what’s matched:

    $ perl -ne ‘print $1 if /.([^.]+)$/’ testme.txt

    hl7
    cls
    cls
    cls
    cls
    cls
    cls
    ptd
    ptd
    ptd
    ptd
    lut
    lut

    And awk (note that awk is showing the lines that match, as you need gawk to support capturing):

    $ awk ‘/.[^.]+$/ {print}’ testme.txt

    VHINT_HL7_PROVATION_2.7.1X.hl7
    VHINT.FromVhEpicMultiToProvationRoutingRule.cls
    VHINT.FromProvationRoutingRule.cls
    VHINT.FromProvationORUToEpicMDMDTL.cls
    VHINT.FromEpic700ToProvationAdtDTL.cls
    VHINT.FromEpic502ToProvationSIUDTL.cls
    VHINT.FromEpic502ToProvationSIUAddProcDTL.cls
    Settings:ToProvation.ptd
    Settings:FromVhEpicMultiToProvationRouter.ptd
    Settings:FromProvationRouter.ptd
    Settings:FromProvation.ptd
    ProvApptResource.lut
    ProvApptReason.lut

    And sed (actually performing search/replace):

    $ sed -r -e ‘s/\.[^.]+$//g’ testme.txt

    Schema
    VHINT_HL7_PROVATION_2.7.1X

    Routing Rules
    VHINT.FromVhEpicMultiToProvationRoutingRule
    VHINT.FromProvationRoutingRule

    DTLs
    VHINT.FromProvationORUToEpicMDMDTL
    VHINT.FromEpic700ToProvationAdtDTL
    VHINT.FromEpic502ToProvationSIUDTL
    VHINT.FromEpic502ToProvationSIUAddProcDTL

    Components
    Settings:ToProvation
    Settings:FromVhEpicMultiToProvationRouter
    Settings:FromProvationRouter
    Settings:FromProvation
    ProvApptResource
    ProvApptReason

    Still think np++ is the odd man out here (i.e. broken).



  • I do agree that the anchor does not consume the EOL character(s), so I think we’re talking about the same thing.

    I agree.

    it’s matching across multiple line endings, even though that option is not selected in the Search/Replace dialog

    That option doesn’t effect this regular expression at all. Both instances of . are literal, and not used as a character class.

    I’m not familiar enough with Perl, etc but I’m assuming it is due to them handling a single line of input at a time and not considering the entire stream of data.



  • @Jeff-Drumm You and I have the same expectation, but I’m beginning to think it may not be correct.

    If I search a file using Find Next, which can be instructive since it will highlight whatever NPP finds, using a similar expression,

    x[^x]+$
    

    then generally, only the characters following “x” on a line are highlighted. But it turns out that is because the file has an “x” on nearly every line. However, I discovered that if “x” does not appear on a line, the matched text does indeed spill over to following lines. So, it looks like the “$” forces a match at the end of a line, but not necessarily the line where the match started.

    Some example text:

    DEFINE  DSC_FLASH_PHYSMEMBASE                = 0xFE000000
    DEFINE  DSC_BIOS_PHYSMEMBASE                 = 0xFF310000
    Test 0xabcd  0x1234
    # The sizes of the various flash regions:
    # more testing
    # Is the an x on this line?
    DEFINE  DSC_MAIN_BIOS_SIZE                   = 0x00470000
    DEFINE  DSC_OEMID_SIZE                       = 0x00040000
    

    What is {{found}} by each Find Next:

    DEFINE  DSC_FLASH_PHYSMEMBASE                = 0{{xFE000000}}
    DEFINE  DSC_BIOS_PHYSMEMBASE                 = 0{{xFF310000}}
    Test  0{{x1234
    # The sizes of the various flash regions:
    # more testing}}
    # Is the an {{x on this line?}}
    DEFINE  DSC_MAIN_BIOS_SIZE                   = 0{{x00470000}}
    DEFINE  DSC_OEMID_SIZE                       = 0{{x00040000}}
    

    I always thought that “$” would force a match to terminate on the line where it started, but this seems to not be the case. Dail’s explanation certainly matches what I observe, but it doesn’t keep my head from spinning. :-(

    I’m wondering how you construct a search that behaves as we thought this one should have behaved…



  • @Jim-Dailey

    I always thought that “$” would force a match to terminate on the line where it started

    As I briefly mentioned I believe it is due to command line programs dealing with a single “line” of text as opposed to a text editor dealing with multiple lines of text. The greedy part of the regex will happily grab as much text (and lines) as possible, since strictly speaking [^.] does in fact match \r and \n.

    I’m wondering how you construct a search that behaves as we thought this one should have behaved…

    Personally I’d either make sure the search was lazy (as above), or specifically not match newline characters by using:

    \.[^.\r\n]+$
    

    Or

    \.[A-Za-z]+$


  • @dail I tried

    x[^x\R]+$
    

    but it made no difference. But, sure enough, this does work:

    x[^x\r\n]+$
    

    Any idea why \R doesn’t work in this case?



  • @Jim-Dailey

    I’m not sure why \R doesn’t work in that case. I’ve ran across it before but never took the time to look through the boost documentation to see if that was intended behavior.



  • Ok, I give up. I went back to my old standby, Emacs, and its replace-regexp function behaves identically to Notepad++.

    That said, this is quite counter-intuitive to someone that spends a lot of time at the Unix/Linux command line. IM (probably Not So) HO, a line should be treated like a line whether it’s in a GUI editor or not unless a specific option is set.

    Also, my assumption was that the ‘.’ in ‘. matches newline’ didn’t specifically refer to the ‘.’ character but was used as shorthand for ‘any character,’ and that it would affect the behavior of a character collection.

    Live and learn.



  • @Jeff-Drumm I feel your pain. It was certainly not intuitive to me after tens of years of AWK scripting (can’t speak for how PERL, Python, or other such languages work).



  • Hello Jeff, dail, Jim and All,

    Like Dail and Jim, I spent some time to fully understand the problem !! Of course, I’ll narrow my explanations to the N++ regex engine, based on the BOOST library, v1.55. But, as my post is a bit long, Jeff and All, just let’s have a drink !!


    First of all, Jeff, you’ll find good documentation, about the new Boost C++ Regex library, v1.55.0 ( similar to the PERL Regular Common Expressions, v1.48.0 ), used by Notepad++, since its 6.0 version, at the TWO addresses below :

    http://www.boost.org/doc/libs/1_48_0/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html

    http://www.boost.org/doc/libs/1_48_0/libs/regex/doc/html/boost_regex/format/boost_format_syntax.html

    • The FIRST link explains the syntax, of regular expressions, in the SEARCH part

    • The SECOND link explains the syntax, of regular expressions, in the REPLACEMENT part

    We may, also, look for valuable informations, on the sites, below :

    http://www.regular-expressions.info

    http://www.rexegg.com

    http://perldoc.perl.org/perlre.html


    Secondly, about NEGATIVE classes :

    A NEGATIVE class [^....] matches any character which is DIFFERENT from all the characters of its POSITIVE equivalent class […]

    For instance, the regex [^.0-9|] matches any character different from all the characters of the regex [.0-9|] ( that is to say the dot, any usual digit or the pipe character )

    This implies that the negative class [^.0-9|] can obviously match the space character, letters and some characters as punctuation or mathematic characters, but, also, ALL the control characters between \x00 and \x1f ( so the End of Line characters \r and \n ), as well as any UNICODE character ( Latin, Greek, Russian, Arab …, Symbols, Arrows, Ideograms, Dingbats … ) witch are different from the dot, the pipe characters and the digits 0 to 9 !!

    So, you, really, must pay attention when using NEGATIVE classes and try to limit the number of possible matches, in your text !

    In your case , Jeff, it’s normal that, thinking about the possible extensions, you thought about any range of standard character(s), different from a dot and simply wrote the regex [^.]+. Moreover, as the characters not allowed, in Windows filenames, are the characters /\:*?"<>|, you could even have narrowed down the search to all but these 9 characters, in addition to the dot, with the regex [^./\\:*?"<>|] ( Note : the backslash must be, itself, escaped with an other backslash, inside the class )

    However, due to the definition of a negative class, given above, and, as Dail said, it should not be enough. The correct regex is, rather, [^.\r\n]+ or [^\r\n./\\:*?"<>|]+


    Thirdly, about the . matches newline option :

    As you guessed it, Jeff, this option controls the behaviour of the DOT meta-character, in N++ regexes :

    • If this option is NOT checked, then a dot, matches for any UNICODE character, different from the three characters \f ( Form Feed ), \r ( Carriage Return ) and \n ( Line Feed )

    • If this option is CHECKED, then a dot, matches, absolutely, any UNICODE character. So, if the cursor is located at the very beginning of the file, the simple regex .+ would select all the contents of the current file, exactly like the CTRL + A shortcut would !

    But, the nice thing is that, you may change this behaviour, dynamically, inserting the modifier (?s), and/or its opposite form (?-s) in your regexes !

    For instance, the search regex (?-s)(abc.*\R)(?s).+?(?-s)(^.*xyz) and the replacement \1\2 would delete any non null amount of text, EVEN on several lines, between a line, that would contain the string abc and the first line , downwards, that would contain the string xyz !

    Notes :

    • The beginning of the regex (?-s)(abc.*\R), matches for a string abc and all the following characters of that line, exclusively, including its End of Line character(s)

    • The middle of the regex (?s).+?, matches for a non null range of any Unicode characters, EVEN IF they are located on several lines, till the nearest match of the sub-regex, below :

    • The end of the regex (?-s)(^.*xyz), matches for any range of characters, from the beginning,till the string xyz, of that same line, exclusively


    Finally, given the example text, below, we’ll try to understand, Jeff, what your damned regex \.[^.]+$ really matches :-)). In order to be clear for everybody, once you copy/paste this text example in a new tab, just click on the Show All characters button, or use the menu option View - Show Symbol - Show All characters !

    
    
    Test.123.doc  ( Line 3 )
    
    blabla        ( Line 5 )
    
    
    Jeff.test     ( Line 8 )
    
    
    
    Example.txt   ( Line 12 )
    

    In fact, in its present form, your regex asks the regex engine to find a literal dot followed by a non null range of characters, different from a dot, till an End of Line character(s) ( CRLF, LF or CR ). So, after a first click on the Find Next button, it matches from the last literal dot character of the line 3, till the two End of Line characters ( CRLF ) of the empty line 6

    Indeed, the two End of Line characters of the line 6, are immediately followed by 2 other End of Line characters of the empty line 7. And, as expected, the boundary $ does represent the zero length position between one the searched characters ( the last character LF of line 6 ) and the Windows End of Line characters CRLF, of the empty line 7 !

    However, you could tell me : But why the match stops at line 6, and not at the end of line 7, for example ? Well, because, in that case, the next End of Line ( the $ boundary ), that is possible to reach, would be the 2 End of Line characters of the line 8. But this case is impossible as it would have, also, matched the dot character of the filename “Jeff.txt”, which is strictly forbidden ! ( Remember the regex [^.]+ )

    To be totally convinced, now, just delete the dot character, from the filename Jeff.test, in order to get the file, without extension, Jefftest. Move back to the very beginning of the example file and run, again, the search :

    This time, you should understand why it matches from the last literal dot character of the line 3, till the two End of Line characters ( CRLF ), of the empty line 10


    In other words, the $ assertion is not exactly the boundary between the last standard character of a line and its End of Line characters, but, simply, the boundary between a ( or one of the ) searched character(s) and an End of Line ( assuming that all the other parts, of the regex, are, also, matched !! )

    I must admit, that it’s a bit strange but the $ boundary may be, sometimes, the position between an End of Line character ( searched ) and other character(s) End of Line. But that’s just logic ! Thanks to you, guys, I finding out something new about regexes ! Marvellous :-)))


    Of course, I already knew the syntax \.[^.\r\n]+$, given by Dail, which allows, after a literal dot, the search of *standard characters, exclusively, till the end of a line. But, I did NOT know the second dail syntax \.[^.)]+?$, which is shorter than the former ! Quite a clever solution :-))

    Indeed, due to the interrogation mark before the $ assertion, this regex matches, after a literal dot character, any character,different from a dot, till the nearest End of Line character(s), excluded. So, by that means, the search is, also, restricted to a single line, ONLY !

    Best Regards,

    guy038

    P.S. :

    We could have the same reasoning about the ^ beginning of line assertion :

    Given the regex ^[^.]+\. and the simple text, below :

    
    Line 2
    Line 3
    
    
    Line 6 ( Test.123.doc )
    

    It would match from the first End of Line character of the empty line 1 ( CR ) till the.FIRST literal dot of the line 6, between the word Test and the number 123.

    So, from that example, it’s easy to notice that the ^ boundary is the position between an End of Line character of the previous line,( or the beginning of the file ), and the first allowed character ( the \r character, in our example, whereas the last character matched is the T letter, six lines, downwards, before the first dot character !


    P.P.S :

    Somewhere, above, in that post, I said …assuming that all the other parts of the regex are, also, matched…. We, generally, forget this obvious rule that a regex is fully matched, ONLY IF ALL parts, of this regex, have been matched !!

    For instance, let’s imagine the simple sentence this is a test to see what happens_with_that_regex and the search regex (.+?)a\w+$, which can be split, in the following four sub-regexes :

    (.+?) , a, \w+ and $

    At first sight, you would say, ( and me too ! ) that the first lowercase letter a, after the lazy quantifier +?, in the regex, would match the first a ( the article ) of the subject string ? NOT at ALL :-((

    Indeed :

    • It CAN’T be the first or the second a , of the subject string, because the third sub \w+, would have been forced to match some space characters, which are definitively NOT word characters !

    • It CAN’T be, too, the last a character , in the word that, of the subject string, because of the lazy quantifier +?, in the first sub-regex, witch forces to get the closest letter a, assuming that the remainder of the regex, \w+$, is also matched !

    • Finally, it’s the third letter a, of this example, in the word happens, which is matched, by the letter a of the regex !

    To be convinced, just do a S/R, with, for instance, the following replacement regex $0>\1<

    In other words, you may, also, consider that the string, matched by the regex a\w+$, is the CLOSEST string, after the string matched by the first sub-regex (.+?) :-))

    Remark :

    • If you suppress the question mark, in the regex above, you, now, get the greedy quantifier +. So, this time, the a character, in the regex, would match the fourth letter a of the subject string, in the word “that” !

Log in to reply