Replace - Regular Expression - need help on some regex snippets



  • Hi!

    I have to do a lot Search & Replace all the time, but many of my RegEx snippets don’t work in some cases or don’t work at all. Would be nice if someone could take a look at them and:

    • correct and tell what I did wrong
    • explain the corrected code
    • tell, if they are fail-safe or if they break in certain situations

    1 . multiple replace (f)
    Example: ä Ä ż Żae Ae z Z
    Find: (ä)|(Ä)|(ż)|(Ż)
    Replace: (?1ae)(?2&Ae)(?3z)(?4Z)

    2 . replace text but keep numbering (f) (z)
    Example:
    jnfvdsertzuikl 0006.jpgblub 6.jpg
    mnbvcxdfrtgzhjuki 0007.jpegblub 7.jpeg
    mnbdsderzui 0008.bmpblub 8.bmp
    Find: .*([0-9]+)(\.\w+)$
    Replace: blub $1\2

    3 . replace but increment value (w) (z)
    Example:
    ...images/69thStreet010.jpg...images/xyz001.jpg
    ...images/69thStreet011.jpg...images/xyz002.jpg
    ...images/69thStreet012.jpg...images/xyz003.jpg
    Find: images/69thStreet([0-9]+).jpg
    Replace: images/xyz$1.jpg

    4 . remove all lines except those containing URI (n)
    This bastard is really tricky! Haven’t yet figured out how get it done. Maybe this is beyond the possibilities of RegEx.

    Example:
    [08 Jun 15 21:57] * Joe: Hi
    check this out: http://files.example.com/blah/blub/picture.jpg
    [08 Jun 15 21:59] * John: Nice
    what about tis one
    https://www2.files3.example.com/asdf/qwerty/video.mp4

    Find: ^(?!.*http(s)?:\/\/(\w+\.)+\w+\/(\w+\/)*\w+\.\w+)$
    Replace: _clear_

    I had even more snippets, but I haven’t saved them and can’t remember what they were about.

    ______________________________
    (f) fails with some expressions
    (n) doesn’t work at all
    (w) has worked so far
    (z) leading zeros are stripped



  • Hello Pete,

    With the PCRE regex engine, introduced first, in N++ v6.0, by Dave Brotherstone, we can perform very powerful and intelligent searches/replacements !

    Concerning your first S/R, relative to accentuated characters, the syntax of simultaneous replacements is quite correct. However I’m wondering if, in the replacement part, the right form should be (?1ae)(?2Ae)(?3z)(?4Z) ( instead of (?1ae)(?2&Ae)(?3z)(?4Z) ! )

    I can’t see any possibility, for that S/R to not work right, as the letters to search for, are clearly independent, providing the Match case option of the Replace dialog is CHECKED. If not, all the replaced characters would be letter case, as, in your S/R, the alternative (ä) is placed before the (Ä) and idem for letter ż.

    So, could you give me an example where this first S/R fails ?


    Let’s see your second S/R. It’s an interesting example, as it allows us to speak about lazy and greedy quantifiers. For instance, if we consider the subject string 12345a123456789b123456789b123456789b12345, just see the difference between the two simple regexes a.*b and a.*?b

    • In the first case, the regex engine matches the longest string between the first found a and a letter b, so the string a123456789b123456789b123456789b

    • In the second case, the regex engine matches the shortest string between the first found a and a letter b, so the string a123456789b

    • In the first case, the dot do match the letter b, too. So, it will stop only when it reaches the last letter b, of the current line. It’s the standard behaviour and one speaks of greedy quantifiers.

    • In the second case, due to the exclamation mark, after the quantifier, it will stop when it reaches the first found letter b, and one speaks of lazy quantifiers. If your regex engine can’t support this syntax, you need to use the the special regex a[^b\r\n]*b to get a similar result !

    So, if we consider the string jnfvdsertzuikl 0006.jpg, with your search regex .*([0-9]+)(\.\w+)$ :

    • The .* represents the string jnfvdsertzuikl 000

    • The ([0-9]+) is, only, the number 6 ( the last digit, before the file extension )

    Now, using the syntax .*?([0-9]+) will include the leading zeros, in the replacement, because :

    • The .*? stands for the string jnfvdsertzuikl, with a space at the end

    • The ([0-9]+) will contain, this time, all the digits ( the 0006 string, before the file extension )

    Moreover, adding the extension part, in the regex, is useless, as it’s kept unchanged during the S/R. So, assuming that any line ends with a number, followed by a file extension, your 2nd S/R can be shorten in :

    SEARCH = .*?([0-9]+) OR .*?(\d+) and REPLACE = blub $1 OR blub \1

    With your syntax, the two strings mnbdsderzui 0008.bmp and mnbdsderzui 0018.bmp would have given the same result blub 8.bmp !

    If you prefer that the file name abc 0002.jpg gives blub 2.jpg and xyz0057.bmp gives blub 57.bmp, we’ll choose the search regex, below :

    SEARCH = .*?([1-9][0-9]*) OR .*?([1-9]\d*) and REPLACE = blub $1 OR blub \1

    This regex matches any possible sequence of characters till the first found digit, different from 0, followed by a possible sequence of digits. In that regex, the possible leading zeros are included in the .*? form, at the beginning of the regex.


    Concerning your 3rd S/R, I quite surprised when you say that this S/R could increment numbers !? I rather think that it’s just one of the limits of regexes ! For instance, to change abc 003.jpg into abc 010.jpg ( offset = +7 ), def 027.bmp into def 034 ( +7 ), and so on…, seems only possible with a python or Gawk script !

    Of course, you could use the same structure, as in your 1st S/R: SEARCH (003)|....|(027)|.... and REPLACEMENT ....(?{3}010)....(?{27}034)...., but it would be very fastidious for a big range of values !

    BTW, note that, in the replacement part, I enclosed the group number between braces, to separate the group number from the digits to replace. Indeed, without braces, the conditional replacement (?27034) would be quite ambiguous Would it mean :

    • If group 2 matches, replace with string 7034

    • If group 27 matches, replace with string 034

    • If group 270 matches, replace with string 34

    Anyway, this 3rd S/R doesn’t give the same replacement file names, that you gave, in your post !

    Instead of your sequence images/xyz001.jpg, images/xyz002.jpg, images/xyz003.jpg,…, it, simply, gives the sequence images/xyz010.jpg, images/xyz011.jpg, images/xyz012.jpg !


    Finally, let’s study your fourth and final S/R ( The bastard one !! ) I preferred to split the problem in two parts :

    • Firstly, find a suitable regex to exactly match YOUR specific kind of URL

    • Secondly, find a regex to delete any line, which DOESN’T contain your kind of URL

    From the two examples of URL, that you gave in your post, these are set up from 4 parts

    • The string http:// or https://

    • The name of a site

    • An absolute pathname

    • A picture file name

    However, the three last parts contain only lowercase letters, digits, the slash or the dot. So, a first try would give the regex https?://[\w/.]+. But, as some links may contain dashes, sharp characters and the percent sign ( as %20 for space ), I would rather use the regex https?://[\w/.#%-]+

    Note that the strict regex, which would allow lowercase letters only, would be https?://[\l\d_/.#%-]+


    Now, to get the final behaviour, we just have to use, from the location start of line ( assertion ^ ) a negative look-ahead to detect the NO-match of this URL, further on the current line. In that case, we’ll take all the characters of the current line .*, as well as its End of Line character(s) to delete them, in the replacement part.

    Therefore, the correct regex should be : SEARCH = ^(?!.*https?://[\w/.#%-]+).*\R and REPLACE = NOTHING

    • The first part ^(?!.*https?://[\w/.#%-]+) which is evaluated, at position 1 of each line, look for a possible URL, inside the current line. Be aware that matches, in look-arounds, are never part of the final regex to search.

    • So, the final part `.*\R`` does stand for the entire line, followed by its End of Line characters, _ xhich need to be deleted_

    For memory, the \R syntax matches, among other things, the string \r\n, in the Windows files, the \n string, in the Unix files or \r in some old MAC files


    Hoping that some parts of my long post will be useful to you,

    Best Regards,

    guy038

    P.S. :

    You’ll find good documentation, about the new Boost C++ Regex library, v1.55.0 ( similar to the PERL Regular Common Expressions, v1.48.0 ), used by Notepad++, since its 6.0 version, at the TWO addresses below :

    http://www.boost.org/doc/libs/1_48_0/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html

    http://www.boost.org/doc/libs/1_48_0/libs/regex/doc/html/boost_regex/format/boost_format_syntax.html

    • The FIRST link explains the syntax, of regular expressions, in the SEARCH part

    • The SECOND link explains the syntax, of regular expressions, in the REPLACEMENT part



  • Hi Pete,

    I’m NOT able to modify an already posted message :-(, I always got the weird message :

    Error

    You are only allowed to edit posts for
    5 second(s) after posting

    Don’t understand what that means ? Anyway, the final dotted line, of my initial post, must be understood :

    • So, the final part .*\R does stand for the entire line, followed by its End of Line characters, which need to be deleted

    Cheers,

    guy038