Need help, please - regular expressions
-
at the top cuts stars, sorry
regex search:
(?:<img lorem-ipsum-dolor)(?:.*?)( class=|)(“.*?”)( src=)(?:.*?)(?:_files/)(.*?.[jpg|png|tif|gif]") -
Very close. You had one minor issue. This:
( class=|)
Should be
( class=)
-
Also, take a look at your original expression here. You can see the regular expression is allowing it to skip
class=
.Note: that website might not use the exact same regular expression engine but should be close enough to reference.
-
Shortening again takes revenge, i want shortly, came badly.
Expression is also fit to:<img lorem-ipsum-dolor=“/lorem/ipsum/dolor-1999-and/123456789012345/lorem_ipsum/1a2b3c4dd9651/c011XX001.tif” src=“11LOREM%202%20%20IpSuM%20Dolor%20sit%20%20amet%20consecteur%20-%20AdiPISCIng%20123456%20elit%20Curabitur%20QWERTY%20202020%20yes%20urna%20Interdeum%20%20Off%20Cras_files/c01vv0x01.jpg” alt=“c01vv0x01.jpeg” height=“567” width=“789”>
IMHO ( class=|) It meant with it, or without it. I emphasize IMHO.
Without it, not checked second expression. Because | means OR, right?
And so it looks, sorry… -
You are right.
( class=|)
can skip it. If you look at the image I linked, if it skips group 1, it must still “consume” something for group 2. So you would want a regular expression that only captures group 2 if 1 exists.So this section in your expression:
( class=|)(".*?")
Should be replaced with something like:
(?:( class=)(".*?"))?
Note this isn’t perfect but should point you in the right direction.
-
In the picture looks great. In real it’s always left on the front :(
-
(?:<img lorem-ipsum-dolor)(?:.*?)(?>( class=|)(“.*?”))?(“.*?”)( src=)(?:.*?)(?:_files/)(.*?[jpg|png|tif|gif]")
atomic group, works
Thx for right direction! -
I’m sorry I was blind doesn’t work
-
Finally
<code>(?:<img lorem-ipsum-dolor)(?:.*?)(?:( class=)(“.*?”)( src=)|( src=))(?:.*?)(?:_files/)(.*?[jpg|png|tif|gif]")</code>ps. I don’t know why me didn’t show in red :)
-
Hello Jan and Dail,
Jan, I didn’t try to consider your regex S/R, first, trying to fully understand it. I just notice two points :
- After copying your example source, in my Notepad++, the delimiters of the different tags ( class, src, alt, height and width ) are the couple
“.....”
, that is to say the Left DOUBLE quotation MARK ( of Unicode code-point\x{201c}
) and the Right DOUBLE quotation MARK ( of Unicode code-point\x201d
). These characters are different from the usual QUOTATION MARK"
(\x22
)
Therefore, the regex, proposed below, is based on these two characters
\x{201c}
and\x{201d}
- Seemingly, your pictures files can have the .jpg, .png, .tiff or .gif extension. Well, but the regex you use to match these extensions (
[jpg|png|tif|gif]
) is totally WRONG, because the|
symbol is taken literally, between square brackets !. Indeed, this syntax is a single range of characters, which matches an unique character, which can be the pipe symbol (|
), OR one of the letters j, p, g, n, t, i, f, whatever their case. In other words, this subset, of your entire regex, could be simply rewritten[fgijnpt|]
So, the correct regex is simply
(jpg|png|tiff|gif)
: one extension, among the four possible ones !
Then, I propose the following regex S/R, below :
SEARCH
(?i-s).*?(class=“.*?” src=“).*?_files/(.*?(jpg|png|tiff|gif))
( with a space, before the tag src )REPLACE
<img \L\1\2
Notes :
-
The two modifiers
(?i-s)
forces matches, in an insensitive way and that dot matches standard characters only. In replacement, however, the two groups\1
and\2
are rewritten, in lower case, due to the\L
syntax -
The four forms
.*?
represents the shortest list of characters, before each string, located after .*? -
All text, before the first string class, of a line, NOT located between round brackets, is therefore deleted, after replacement
-
The group
\1
is the string class=“…” src=“ and the group\2
is the name of the picture, with its extension. They, both, are rewritten, in lower case, after an initial <img string.
If your really need that the line begins with the string <img lorem-ipsum-dolor, just change the search regex into :
SEARCH
(?i-s)<img lorem-ipsum-dolor.*?(class=“.*?” src=“).*?_files/(.*?(jpg|png|tiff|gif))
Best Regards,
guy038
- After copying your example source, in my Notepad++, the delimiters of the different tags ( class, src, alt, height and width ) are the couple
-
Thank you very much for analysis. I know, abnormal brackets, do not have the right to work. But in this specific example, work.
Example:
<img lorem-ipsum-dolor="/lorem/ipsum/dolor-2015-and/123456789012345/lorem_ipsum/1a2b3c4dd9651/a23w34m87.jpg" class="lorem123" src="11LOREM%202%20%20IpSuM%20Dolor%20sit%20%20amet%20consecteur%20-%20AdiPISCIng%20123456%20elit%20Curabitur%20QWERTY%20202020%20yes%20urna%20Interdeum%20%20Off%20Cras_files/a01b02c68.png" alt="a01b02c68.bmp" height="101" width="102">
<img lorem-ipsum-dolor="/lorem/ipsum/dolor-1999-and/123456789012345/lorem_ipsum/1a2b3c4dd9651/c011XX001.tif" src="11LOREM%202%20%20IpSuM%20Dolor%20sit%20%20amet%20consecteur%20-%20AdiPISCIng%20123456%20elit%20Curabitur%20QWERTY%20202020%20yes%20urna%20Interdeum%20%20Off%20Cras_files/c01vv0x01.jpg" alt="c01vv0x01.jpeg" height="567" width="789">
Regex:
(?:<img lorem-ipsum-dolor)(?:.*?)(?:( class=)(".*?")( src=)|( src=))(?:.*?)(?:_files\/)(.*?[jpg|png|tif|gif]")
replace:
<img\1\2\3\4"\5
After changing to the correct brackets, also works:
(?:<img lorem-ipsum-dolor)(?:.*?)(?:( class=)(".*?")( src=)|( src=))(?:.*?)(?:_files\/)(.*?(jpg|png|tif|gif)")
but then there are 6 groups, the sixth just do not need to call.With all due respect, your as much as possible correct regex is not working.
Very sorry for my English, still I am learning.
-
Hi Jan,
OK. I, now, understood two main points, about your problem :
-
Firstly, the values of the different tags are surrounded by the usual quotation mark (
"
), of Unicode code-point\x{0022}
. Of course, my previous regex, based on the two delimiters\x{201c}
and\x{201d}
, COULDN’T work at all ! -
Secondly, the tag
class="........"
may, sometimes, be absent, in a line. Again, my previous regex supposed that this tag was always present:-((
So, aware of the two facts, above, my new proposed regex is :
SEARCH
(?i-s)<img lorem-ipsum-dolor.*?((?:class=".*?" )?src=").*?_files/(.*?(jpg|png|tiff|gif))
REPLACE
<img \L\1\2
After running your S/R and mine, they, both, give the same results :-)) Nice !
NOTES : Compared to my previous try :
-
I changed the special delimiters
“.....”
, by the usual ones"....."
, in the search regex -
I added a new non-capturing group
(?:class=".*?" )?
, that can exists or NOT, due to the final question mark?
-
There a space, ending the non-capturing group, before the ending round bracket
-
The replacement regex has NOT changed
Cheers,
guy038
-
-
Thank you for your commitment
Best regards,
Jan