Find and Replace with RegEx help
-
EDIT: The forum software keeps truncating some of the code examples for some reason. It shows up fine in the preview pane, but once submitting, it’s gone. I don’t know how to get around this, so here’s a screenshot of what things should look like.
I can’t wrap my head around this. I’m trying to do a complicated find and replace on a MediaWiki source file. I’d like to convert a page full of image links in this format:
[[File:ImageNameRare.png|50px|link=Page Name]]
to this format:
{{ItemPic|Page Name|3||50px}}
Here are a couple of things to note. In the original format, all image names will have the suffixes of Basic, Common, Uncommon, Rare, SuperRare, or Legendary. These would need to be converted to numbers from 0 to 5, respectively. You can see where the “Rare” suffix was converted to a 3. Also, the Page Name would need to be preserved from the original to the new format. Nothing else needs to be preserved. The ImageName.png is no longer necessary and all links will feature 50px images, so that won’t change.
Finally, there are instances on the page of this file format that I do not want to replace. In these instances, the link= will be blank, as in this example:
[[File:ImageName.png|50px|link=]]
If link= is followed immediately by two end brackets then the line should be ignored.
I really hate asking this question here because I feel it’s outside of the scope of these forums, but I don’t know where else to ask. Can anybody help with this? I have over 3000 instances of this text in the document that need to be replaced, so doing it by hand is completely out of the question.
Thank you.
-
given your example
Find what:
\[\[File:.+((Basic)|((?<!Un)Common)|(Uncommon)|((?<!Super)Rare)|(SuperRare)|(Legendary))\.png\|(\d+px)\|link=(.+)\]\]
and replace with:
{{ItemPic|$9|(?{2}0)(?{3}1)(?{4}2)(?{5}3)(?{6}4)(?{7}5)||$8}}
turns this
[[File:ImageNameBasic.png|50px|link=Page Name]] [[File:ImageNameCommon.png|50px|link=Page Name]] [[File:ImageNameUncommon.png|50px|link=Page Name]] [[File:ImageNameRare.png|50px|link=Page Name]] [[File:ImageNameSuperRare.png|50px|link=Page Name]] [[File:ImageNameLegendary.png|50px|link=Page Name]] [[File:ImageName|50px|link=]]
into that
{{ItemPic|Page Name|0||50px}} {{ItemPic|Page Name|1||50px}} {{ItemPic|Page Name|2||50px}} {{ItemPic|Page Name|3||50px}} {{ItemPic|Page Name|4||50px}} {{ItemPic|Page Name|5||50px}} [[File:ImageName|50px|link=]]
If you are interested and need help understanding what it is doing let us know.
Cheers
Claudia -
If you are interested and need help understanding what it is doing let us know.
If you want to explain it, I’ll definitely be interested. I’m not sure how much I’ll retain. I’ve tried learning RegEx before, but I give up after a few minutes of staring at the seemingly random mix of symbols.
-
[[File:.+((Basic)|((?<!Un)Common)|(Uncommon)|((?<!Super)Rare)|(SuperRare)|(Legendary)).png|(\d+px)|link=(.+)]]
Let#s divide it
\[\[File:
we need to escape the square bracket [ by using \[ as the regex engine is using it internally too..+
whatever data((Basic)|((?<!Un)Common)|(Uncommon)|((?<!Super)Rare)|(SuperRare)|(Legendary))
this is basically an alternation list means we are looking for a match of one of the ones listed
(?<!Un)Common - this means Common should not be prefixed by Un
(?<!Super)Rare) - guess what, yes, Rare should not be prefixed by Super\.png
escape the dot as it has a special meaning in regex too
\|
same is true for the pipe sign
(\d+px)
match any amount of digits, at lest one followed by the literals px
\|
again escaping pipe
link=(.+)
followed by string link= and whatever follows should be matched until
\]\]
closing square brackets, again escaping,This results in 9 match groups which can be reused by using $NUMBER or (?{NUMBER}) syntax
6 matches are reserved by the alternation list and those are the ones which
get assigned to match numbers 2-7, match group 1 is not of interest for us (not even sure why it was needed to be defined),
match group 8 is what is returned by (\d+px) and match group 9 what is
matched after link=.Creating the string
{{ItemPic|$9|(?{2}0)(?{3}1)(?{4}2)(?{5}3)(?{6}4)(?{7}5)||$8}}
$9 = match group 9
(?{2}0) = what matched in match group 2 should be replaced by 0 IF it was found ELSE nothing
(?{3}1) = what matched in match group 3 should be replaced by 1 IF it was found ELSE nothing
…
$8 = match group 8the rest
{{ItemPic| | || }}
is what you wanted to build.
Hope this makes sense to you.
Cheers
Claudia -
Thank you very much. I was able to figure out how the matching expression worked after a time, but I couldn’t figure out the replacement expression. All I was able to figure out is that there were 9 groupings and I could see where each was being used, but I wasn’t able to figure out how the 9 groupings were divided. I also wasn’t exactly sure how the IF ELSE syntax worked. Heck, I didn’t even know it was an IF/ELSE statement.
This should help me immensely in the future. I’m an admin and an every day contributor to a wiki with almost 1000 unique articles, hundreds more pages and templates and I’ve been wanting to do some mass updates like this for a long time.
Thank you again.
-
Hello, @ksomeone-msomeone, @claudia-frank and All
Interesting S/R, indeed ! BTW, this post is quite long, so just have a drink or read it, in several times ;-))
Claudia, you’ve already explained fully these regexes, while I was thinking about a solution, too ! Anyway, @ksomeone-msomeone, will also get additional information from my post !
Regarding your initial text :
[[File:ImageNameRare.png|50px|link=Page Name]]
-
I assumed that you don’t care about the first part
File:ImageName
, after the two opening square brackets -
I think that you don’t care, also, about the second part
.png|50px|link
, coming after the suffixesBasic
,Rare
or else ! -
And, as you said, only the
Page Name
, after the=
sign and before the two ending square brackets, must be rewritten, during the replacement process -
Finally, if the line does not contain any
Page name
, and ends aslink=]]
, NO replacement must occur !
Now, regarding the final expected text :
{{ItemPic|Page Name|3||50px}}
-
I assumed that the part
ItemPic|
is a simple literal string -
Then, the
Page name
, memorized from the initial text, must be rewritten, followed by a|
symbol -
Now, as you explained, it should rewrite a single digit, from
0
to5
, according to the6
corresponding states Basic , Common, Uncommon , Rare , SuperRare and Legendary -
Finally, we just write the literal string
||50px
, after the one-digit, as you said that all images are50px
, anyway ! -
Note that if the line ends with
link=]]
, after replacement, it will be unchanged, and keep the square brackets, instead of the curly braces !
So, from the hypotheses, above, I built the following regex S/R :
SEARCH
(?-si)\[\[.+?(?:(Basic)|(Uncommon)|(Common)|(SuperRare)|(Rare)|(Legendary)).+=(.+)\]\]
REPLACE
{{ItemPic|$7|(?{1}0)(?{2}2)(?{3}1)(?{4}4)(?{5}3)(?{6}5)||50px}}
And, given the Claudia’s example text, below :
[[File:ImageNameBasic.png|50px|link=Page Name]] [[File:ImageNameCommon.png|50px|link=Page Name]] [[File:ImageNameUncommon.png|50px|link=Page Name]] [[File:ImageNameRare.png|50px|link=Page Name]] [[File:ImageNameSuperRare.png|50px|link=Page Name]] [[File:ImageNameLegendary.png|50px|link=Page Name]] [[File:ImageName|50px|link=]]
after performing this S/R, this text turns into :
{{ItemPic|Page Name|0||50px}} {{ItemPic|Page Name|1||50px}} {{ItemPic|Page Name|2||50px}} {{ItemPic|Page Name|3||50px}} {{ItemPic|Page Name|4||50px}} {{ItemPic|Page Name|5||50px}} [[File:ImageName|50px|link=]]
Before explaining this regex S/R, it should be useful to remember some rules !
-
When you search, either, for two words, whose one is part of the second, as for example Rare and SuperRare, the correct syntax of the alternative is :
SuperRare|Rare
with the longest word as the first branch ( and NOTRare|SuperRare
). Indeed, the later form will never match the SuperRare string ! -
This explains the location of the 6 states :
(Basic)|(Uncommon)|(Common)|(SuperRare)|(Rare)|(Legendary)
. With this syntax, Claudia, no need to use look-around at all. Of course, the order becomes0 2 1 4 3 5
;-)) -
These 6 states are surrounded by the syntax
(?:.......)
, which is a non-capturing group, So, group(Basic)
is group1
, group(UnCommon)
is group2
and so on … ( BTW, you writeUncommon
and NOTUnCommon
, whereas you useSuperRare
?!) -
But I came across a problem : Let’s imagine the simple regex S/R
SEARCH
(?-si)^(.+)(SuperRare|Rare)
, with a case sensitive searchREPLACE
>\1< and >\2<
Then, this regex would never find the first alternative, with the SuperRare word, just because the
.+
would catch the greatest range of characters and just backtracks for matching the Rare string !! So the text :ImageNameSuperRare ImageNameRare
would give :
>ImageNameSuper< and >Rare< >ImageName< and >Rare<
Now, if we change the greedy quantifier
+
( meaning{1,}
, BTW ) with the lazy form+?
, this time, it would catch, first, the tallest range of characters, before matching one of the two branches of the alternative. So it does match the SuperRare string and, after replacement, we get the expected text :>ImageName< and >SuperRare< >ImageName< and >Rare<
BTW, do you guess why the similar search regex
(?-si)^(.+)(Uncommon|Common)
does not need the lazy quantifier ?.. Just because the search is case sensitive and there is no ambiguity between the string Uncommon
and the stringCommon
;-))
Notes, on the regex expressions :
-
As usual, the two modifiers
(?-si)
, at beginning of the search regex :-
Forces the regex engine to do a case-sensitive search, (
-i
) meaning NON-Insensitive -
Tell the regex engine to consider the special dot character (
.
) as standing for a single standard character, and not line-break ones !
-
-
Then,
\[\[
matches two opening square brackets. Note that these special characters must be escaped to be considered as literal characters -
Now, the part
.+?
represents the tallest range of any standard character, till the states Basic, Common, … -
The syntax
(?:(Basic)|(Uncommon)|(Common)|(SuperRare)|(Rare)|(Legendary))
stands for an alternative, in a non-capturing group, with 6 branches, stored as groups from1
to6
-
Then, the part
.+=
, matches the range of characters, till the unique=
sign, which does not need to be stored -
Finally, the syntax
(.+)\]\]
represents the Page Name, stored as group7
, followed by the two ending square brackets, which, again, must be escaped -
In replacement, we, first, rewrite the literal string
{{ItemPic|
, followed by the Page Name ($7
) and a|
symbol -
Then, the part
(?{1}0)(?{2}2)(?{3}1)(?{4}4)(?{5}3)(?{6}5)
are juxtaposed conditional replacements, in any order, of the general form(?{x}y)
. It means : IF groupx
is matched, in search, rewrite, in replacement, the expressiony
. In our case, note thaty
is, simply, a digit number ! -
Finally, the part
||50px}}
is the literal string, which have to be rewritten, on any line !
Best Regards,
guy038
P.S. :
For noob people, about regular expressions concept and syntax, begin with that article, in N++ Wiki :
http://docs.notepad-plus-plus.org/index.php/Regular_Expressions
In addition, you’ll find good documentation, about the Boost C++ Regex library, v1.55.0 ( similar to the PERL Regular Common Expressions, v5.8 ), used by
Notepad++
, since its6.0
version, at the TWO addresses below :http://www.boost.org/doc/libs/1_55_0/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html
http://www.boost.org/doc/libs/1_55_0/libs/regex/doc/html/boost_regex/format/boost_format_syntax.html
-
The FIRST link explains the syntax, of regular expressions, in the SEARCH part
-
The SECOND link explains the syntax, of regular expressions, in the REPLACEMENT part
You may, also, look for valuable informations, on the sites, below :
http://www.regular-expressions.info
http://perldoc.perl.org/perlre.html
Be aware that, as any documentation, it may contain some errors ! Anyway, if you detected one, that’s good news : you’re improving ;-))
-
-
@guy038 said:
Regarding your initial text :
[[File:ImageNameRare.png|50px|link=Page Name]]
-
I assumed that you don’t care about the first part
File:ImageName
, after the two opening square brackets -
I think that you don’t care, also, about the second part
.png|50px|link
, coming after the suffixesBasic
,Rare
or else ! -
And, as you said, only the
Page Name
, after the=
sign and before the two ending square brackets, must be rewritten, during the replacement process -
Finally, if the line does not contain any
Page name
, and ends aslink=]]
, NO replacement must occur !
Now, regarding the final expected text :
{{ItemPic|Page Name|3||50px}}
-
I assumed that the part
ItemPic|
is a simple literal string -
Then, the
Page name
, memorized from the initial text, must be rewritten, followed by a|
symbol -
Now, as you explained, it should rewrite a single digit, from
0
to5
, according to the6
corresponding states Basic , Common, Uncommon , Rare , SuperRare and Legendary -
Finally, we just write the literal string
||50px
, after the one-digit, as you said that all images are50px
, anyway ! -
Note that if the line ends with
link=]]
, after replacement, it will be unchanged, and keep the square brackets, instead of the curly braces !
You got it exactly right. Thank you for the detailed explanation. I’ll definitely be bookmarking this post so that I can come back and read this again if I need help with something like this.
-
-
I was close, wasn’t I? :-D
thx for the detailed information and improvements - always a pleasure :-)
Cheers
Claudia -
@Claudia-Frank said:
I was close, wasn’t I? :-D
thx for the detailed information and improvements - always a pleasure :-)
Cheers
ClaudiaYours was the one I used before guy038 made his post and it worked perfectly, so I would say you were more than close. :)