Need help, please - regular expressions



  • Can not cope with it:
    example source:

    <img lorem-ipsum-dolor="/lorem/ipsum/dolor-2015-and/123456789012345/lorem_ipsum/1a2b3c4dd9651/a23w34m87.jpg" class=“lorem123” src=“11LOREM%202%20%20IpSuM%20Dolor%20sit%20%20amet%20consecteur%20-%20AdiPISCIng%20123456%20elit%20Curabitur%20QWERTY%20202020%20yes%20urna%20Interdeum%20%20Off%20Cras_files/a01b02c68.png” alt=“a01b02c68.bmp” height=“101” width=“102”>

    regex search:

    (?:<img lorem-ipsum-dolor)(?:.?)( class=|)(".?")( src=)(?:.?)(?:_files/)(.?.[jpg|png|tif|gif]")

    replace:

    <img\1\2\3"\4

    and I get this:

    <img"/lorem/ipsum/dolor-2015-and/123456789012345/lorem_ipsum/1a2b3c4dd9651/a23w34m87.jpg" class=“lorem123” src=“a01b02c68.png” alt=“a01b02c68.bmp” height=“101” width=“102”>

    This should not be:

    “/lorem/ipsum/dolor-2015-and/123456789012345/lorem_ipsum/1a2b3c4dd9651/a23w34m87.jpg”

    It should look like:

    <img class=“lorem123” src=“a01b02c68.png” alt=“a01b02c68.bmp” height=“101” width=“102”>

    What am I doing wrong?



  • at the top cuts stars, sorry

    regex search:
    (?:<img lorem-ipsum-dolor)(?:.*?)( class=|)(".*?")( src=)(?:.*?)(?:_files/)(.*?.[jpg|png|tif|gif]")



  • Very close. You had one minor issue. This:

    ( class=|)

    Should be

    ( class=)



  • Also, take a look at your original expression here. You can see the regular expression is allowing it to skip class=.

    Note: that website might not use the exact same regular expression engine but should be close enough to reference.



  • Shortening again takes revenge, i want shortly, came badly.
    Expression is also fit to:

    <img lorem-ipsum-dolor="/lorem/ipsum/dolor-1999-and/123456789012345/lorem_ipsum/1a2b3c4dd9651/c011XX001.tif" src=“11LOREM%202%20%20IpSuM%20Dolor%20sit%20%20amet%20consecteur%20-%20AdiPISCIng%20123456%20elit%20Curabitur%20QWERTY%20202020%20yes%20urna%20Interdeum%20%20Off%20Cras_files/c01vv0x01.jpg” alt=“c01vv0x01.jpeg” height=“567” width=“789”>

    IMHO ( class=|) It meant with it, or without it. I emphasize IMHO.
    Without it, not checked second expression. Because | means OR, right?
    And so it looks, sorry…



  • You are right. ( class=|) can skip it. If you look at the image I linked, if it skips group 1, it must still “consume” something for group 2. So you would want a regular expression that only captures group 2 if 1 exists.

    So this section in your expression:

    ( class=|)(".*?")

    Should be replaced with something like:

    (?:( class=)(".*?"))?

    Note this isn’t perfect but should point you in the right direction.



  • In the picture looks great. In real it’s always left on the front :(



  • (?:<img lorem-ipsum-dolor)(?:.*?)(?>( class=|)(".*?"))?(".*?")( src=)(?:.*?)(?:_files/)(.*?[jpg|png|tif|gif]")

    atomic group, works
    Thx for right direction!



  • I’m sorry I was blind doesn’t work



  • Finally
    <code>(?:<img lorem-ipsum-dolor)(?:.*?)(?:( class=)(".*?")( src=)|( src=))(?:.*?)(?:_files/)(.*?[jpg|png|tif|gif]")</code>

    ps. I don’t know why me didn’t show in red :)



  • Hello Jan and Dail,

    Jan, I didn’t try to consider your regex S/R, first, trying to fully understand it. I just notice two points :

    • After copying your example source, in my Notepad++, the delimiters of the different tags ( class, src, alt, height and width ) are the couple “.....”, that is to say the Left DOUBLE quotation MARK ( of Unicode code-point \x{201c} ) and the Right DOUBLE quotation MARK ( of Unicode code-point \x201d ). These characters are different from the usual QUOTATION MARK " ( \x22 )

    Therefore, the regex, proposed below, is based on these two characters \x{201c} and \x{201d}

    • Seemingly, your pictures files can have the .jpg, .png, .tiff or .gif extension. Well, but the regex you use to match these extensions ( [jpg|png|tif|gif] ) is totally WRONG, because the | symbol is taken literally, between square brackets !. Indeed, this syntax is a single range of characters, which matches an unique character, which can be the pipe symbol (| ), OR one of the letters j, p, g, n, t, i, f, whatever their case. In other words, this subset, of your entire regex, could be simply rewritten [fgijnpt|]

    So, the correct regex is simply (jpg|png|tiff|gif) : one extension, among the four possible ones !


    Then, I propose the following regex S/R, below :

    SEARCH (?i-s).*?(class=“.*?” src=“).*?_files/(.*?(jpg|png|tiff|gif)) ( with a space, before the tag src )

    REPLACE <img \L\1\2

    Notes :

    • The two modifiers (?i-s) forces matches, in an insensitive way and that dot matches standard characters only. In replacement, however, the two groups \1 and \2 are rewritten, in lower case, due to the \L syntax

    • The four forms .*? represents the shortest list of characters, before each string, located after .*?

    • All text, before the first string class, of a line, NOT located between round brackets, is therefore deleted, after replacement

    • The group \1 is the string class=“…” src=“ and the group \2 is the name of the picture, with its extension. They, both, are rewritten, in lower case, after an initial <img string.

    If your really need that the line begins with the string <img lorem-ipsum-dolor, just change the search regex into :

    SEARCH (?i-s)<img lorem-ipsum-dolor.*?(class=“.*?” src=“).*?_files/(.*?(jpg|png|tiff|gif))

    Best Regards,

    guy038



  • Thank you very much for analysis. I know, abnormal brackets, do not have the right to work. But in this specific example, work.
    Example:
    <img lorem-ipsum-dolor="/lorem/ipsum/dolor-2015-and/123456789012345/lorem_ipsum/1a2b3c4dd9651/a23w34m87.jpg" class="lorem123" src="11LOREM%202%20%20IpSuM%20Dolor%20sit%20%20amet%20consecteur%20-%20AdiPISCIng%20123456%20elit%20Curabitur%20QWERTY%20202020%20yes%20urna%20Interdeum%20%20Off%20Cras_files/a01b02c68.png" alt="a01b02c68.bmp" height="101" width="102">

    <img lorem-ipsum-dolor="/lorem/ipsum/dolor-1999-and/123456789012345/lorem_ipsum/1a2b3c4dd9651/c011XX001.tif" src="11LOREM%202%20%20IpSuM%20Dolor%20sit%20%20amet%20consecteur%20-%20AdiPISCIng%20123456%20elit%20Curabitur%20QWERTY%20202020%20yes%20urna%20Interdeum%20%20Off%20Cras_files/c01vv0x01.jpg" alt="c01vv0x01.jpeg" height="567" width="789">

    Regex:
    (?:<img lorem-ipsum-dolor)(?:.*?)(?:( class=)(".*?")( src=)|( src=))(?:.*?)(?:_files\/)(.*?[jpg|png|tif|gif]")
    replace:
    <img\1\2\3\4"\5

    After changing to the correct brackets, also works:
    (?:<img lorem-ipsum-dolor)(?:.*?)(?:( class=)(".*?")( src=)|( src=))(?:.*?)(?:_files\/)(.*?(jpg|png|tif|gif)")
    but then there are 6 groups, the sixth just do not need to call.

    With all due respect, your as much as possible correct regex is not working.

    Very sorry for my English, still I am learning.



  • Hi Jan,

    OK. I, now, understood two main points, about your problem :

    • Firstly, the values of the different tags are surrounded by the usual quotation mark ( " ), of Unicode code-point \x{0022}. Of course, my previous regex, based on the two delimiters \x{201c} and \x{201d}, COULDN’T work at all !

    • Secondly, the tag class="........" may, sometimes, be absent, in a line. Again, my previous regex supposed that this tag was always present:-((

    So, aware of the two facts, above, my new proposed regex is :

    SEARCH (?i-s)<img lorem-ipsum-dolor.*?((?:class=".*?" )?src=").*?_files/(.*?(jpg|png|tiff|gif))

    REPLACE <img \L\1\2

    After running your S/R and mine, they, both, give the same results :-)) Nice !


    NOTES : Compared to my previous try :

    • I changed the special delimiters “.....”, by the usual ones ".....", in the search regex

    • I added a new non-capturing group (?:class=".*?" )?, that can exists or NOT, due to the final question mark ?

    • There a space, ending the non-capturing group, before the ending round bracket

    • The replacement regex has NOT changed

    Cheers,

    guy038



  • Thank you for your commitment
    Best regards,
    Jan


Log in to reply