Find and Replace with RegEx help



  • EDIT: The forum software keeps truncating some of the code examples for some reason. It shows up fine in the preview pane, but once submitting, it’s gone. I don’t know how to get around this, so here’s a screenshot of what things should look like.

    I can’t wrap my head around this. I’m trying to do a complicated find and replace on a MediaWiki source file. I’d like to convert a page full of image links in this format:

    [[File:ImageNameRare.png|50px|link=Page Name]]
    

    to this format:

    {{ItemPic|Page Name|3||50px}}
    

    Here are a couple of things to note. In the original format, all image names will have the suffixes of Basic, Common, Uncommon, Rare, SuperRare, or Legendary. These would need to be converted to numbers from 0 to 5, respectively. You can see where the “Rare” suffix was converted to a 3. Also, the Page Name would need to be preserved from the original to the new format. Nothing else needs to be preserved. The ImageName.png is no longer necessary and all links will feature 50px images, so that won’t change.

    Finally, there are instances on the page of this file format that I do not want to replace. In these instances, the link= will be blank, as in this example:

    [[File:ImageName.png|50px|link=]]
    

    If link= is followed immediately by two end brackets then the line should be ignored.

    I really hate asking this question here because I feel it’s outside of the scope of these forums, but I don’t know where else to ask. Can anybody help with this? I have over 3000 instances of this text in the document that need to be replaced, so doing it by hand is completely out of the question.

    Thank you.



  • @Ksomeone-Msomeone

    given your example

    Find what:

    [[File:.+((Basic)|((?<!Un)Common)|(Uncommon)|((?<!Super)Rare)|(SuperRare)|(Legendary))\.png\|(\d+px)\|link=(.+)]]
    

    and replace with:

    {{ItemPic|$9|(?{2}0)(?{3}1)(?{4}2)(?{5}3)(?{6}4)(?{7}5)||$8}}
    

    turns this

    [[File:ImageNameBasic.png|50px|link=Page Name]]
    [[File:ImageNameCommon.png|50px|link=Page Name]]
    [[File:ImageNameUncommon.png|50px|link=Page Name]]
    [[File:ImageNameRare.png|50px|link=Page Name]]
    [[File:ImageNameSuperRare.png|50px|link=Page Name]]
    [[File:ImageNameLegendary.png|50px|link=Page Name]]
    [[File:ImageName|50px|link=]]
    

    into that

    {{ItemPic|Page Name|0||50px}}
    {{ItemPic|Page Name|1||50px}}
    {{ItemPic|Page Name|2||50px}}
    {{ItemPic|Page Name|3||50px}}
    {{ItemPic|Page Name|4||50px}}
    {{ItemPic|Page Name|5||50px}}
    [[File:ImageName|50px|link=]]
    

    If you are interested and need help understanding what it is doing let us know.

    Cheers
    Claudia



  • If you are interested and need help understanding what it is doing let us know.

    If you want to explain it, I’ll definitely be interested. I’m not sure how much I’ll retain. I’ve tried learning RegEx before, but I give up after a few minutes of staring at the seemingly random mix of symbols.



  • @Ksomeone-Msomeone

    [[File:.+((Basic)|((?<!Un)Common)|(Uncommon)|((?<!Super)Rare)|(SuperRare)|(Legendary)).png|(\d+px)|link=(.+)]]

    Let#s divide it

    [[File:
    we need to escape the square bracket [ by using [ as the regex engine is using it internally too.

    .+
    whatever data

    ((Basic)|((?<!Un)Common)|(Uncommon)|((?<!Super)Rare)|(SuperRare)|(Legendary))
    this is basically an alternation list means we are looking for a match of one of the ones listed
    (?<!Un)Common - this means Common should not be prefixed by Un
    (?<!Super)Rare) - guess what, yes, Rare should not be prefixed by Super

    \.png
    escape the dot as it has a special meaning in regex too
    \|
    same is true for the pipe sign
    (\d+px)
    match any amount of digits, at lest one followed by the literals px
    \|
    again escaping pipe
    link=(.+)
    followed by string link= and whatever follows should be matched until
    ]]
    closing square brackets, again escaping,

    This results in 9 match groups which can be reused by using $NUMBER or (?{NUMBER}) syntax
    6 matches are reserved by the alternation list and those are the ones which
    get assigned to match numbers 2-7, match group 1 is not of interest for us (not even sure why it was needed to be defined),
    match group 8 is what is returned by (\d+px) and match group 9 what is
    matched after link=.

    Creating the string

    {{ItemPic|$9|(?{2}0)(?{3}1)(?{4}2)(?{5}3)(?{6}4)(?{7}5)||$8}}

    $9 = match group 9
    (?{2}0) = what matched in match group 2 should be replaced by 0 IF it was found ELSE nothing
    (?{3}1) = what matched in match group 3 should be replaced by 1 IF it was found ELSE nothing

    $8 = match group 8

    the rest

    {{ItemPic| | || }}

    is what you wanted to build.

    Hope this makes sense to you.

    Cheers
    Claudia



  • Thank you very much. I was able to figure out how the matching expression worked after a time, but I couldn’t figure out the replacement expression. All I was able to figure out is that there were 9 groupings and I could see where each was being used, but I wasn’t able to figure out how the 9 groupings were divided. I also wasn’t exactly sure how the IF ELSE syntax worked. Heck, I didn’t even know it was an IF/ELSE statement.

    This should help me immensely in the future. I’m an admin and an every day contributor to a wiki with almost 1000 unique articles, hundreds more pages and templates and I’ve been wanting to do some mass updates like this for a long time.

    Thank you again.



  • Hello, @ksomeone-msomeone, @claudia-frank and All

    Interesting S/R, indeed ! BTW, this post is quite long, so just have a drink or read it, in several times ;-))

    Claudia, you’ve already explained fully these regexes, while I was thinking about a solution, too ! Anyway, @ksomeone-msomeone, will also get additional information from my post !


    Regarding your initial text :

    [[File:ImageNameRare.png|50px|link=Page Name]]
    
    • I assumed that you don’t care about the first part File:ImageName, after the two opening square brackets

    • I think that you don’t care, also, about the second part .png|50px|link, coming after the suffixes Basic, Rare or else !

    • And, as you said, only the Page Name, after the = sign and before the two ending square brackets, must be rewritten, during the replacement process

    • Finally, if the line does not contain any Page name, and ends as link=]], NO replacement must occur !


    Now, regarding the final expected text :

    {{ItemPic|Page Name|3||50px}}
    
    • I assumed that the part ItemPic| is a simple literal string

    • Then, the Page name, memorized from the initial text, must be rewritten, followed by a | symbol

    • Now, as you explained, it should rewrite a single digit, from 0 to 5, according to the 6 corresponding states Basic , Common, Uncommon , Rare , SuperRare and Legendary

    • Finally, we just write the literal string ||50px, after the one-digit, as you said that all images are 50px, anyway !

    • Note that if the line ends with link=]], after replacement, it will be unchanged, and keep the square brackets, instead of the curly braces !


    So, from the hypotheses, above, I built the following regex S/R :

    SEARCH (?-si)[[.+?(?:(Basic)|(Uncommon)|(Common)|(SuperRare)|(Rare)|(Legendary)).+=(.+)]]

    REPLACE {{ItemPic|$7|(?{1}0)(?{2}2)(?{3}1)(?{4}4)(?{5}3)(?{6}5)||50px}}

    And, given the Claudia’s example text, below :

    [[File:ImageNameBasic.png|50px|link=Page Name]]
    [[File:ImageNameCommon.png|50px|link=Page Name]]
    [[File:ImageNameUncommon.png|50px|link=Page Name]]
    [[File:ImageNameRare.png|50px|link=Page Name]]
    [[File:ImageNameSuperRare.png|50px|link=Page Name]]
    [[File:ImageNameLegendary.png|50px|link=Page Name]]
    [[File:ImageName|50px|link=]]
    

    after performing this S/R, this text turns into :

    {{ItemPic|Page Name|0||50px}}
    {{ItemPic|Page Name|1||50px}}
    {{ItemPic|Page Name|2||50px}}
    {{ItemPic|Page Name|3||50px}}
    {{ItemPic|Page Name|4||50px}}
    {{ItemPic|Page Name|5||50px}}
    [[File:ImageName|50px|link=]]
    

    Before explaining this regex S/R, it should be useful to remember some rules !

    • When you search, either, for two words, whose one is part of the second, as for example Rare and SuperRare, the correct syntax of the alternative is : SuperRare|Rare with the longest word as the first branch ( and NOT Rare|SuperRare ). Indeed, the later form will never match the SuperRare string !

    • This explains the location of the 6 states : (Basic)|(Uncommon)|(Common)|(SuperRare)|(Rare)|(Legendary). With this syntax, Claudia, no need to use look-around at all. Of course, the order becomes 0 2 1 4 3 5 ;-))

    • These 6 states are surrounded by the syntax (?:.......), which is a non-capturing group, So, group (Basic) is group 1, group (UnCommon) is group 2 and so on … ( BTW, you write Uncommon and NOT UnCommon, whereas you use SuperRare ?!)

    • But I came across a problem : Let’s imagine the simple regex S/R

    SEARCH (?-si)^(.+)(SuperRare|Rare), with a case sensitive search

    REPLACE >\1< and >\2<

    Then, this regex would never find the first alternative, with the SuperRare word, just because the .+ would catch the greatest range of characters and just backtracks for matching the Rare string !! So the text :

    ImageNameSuperRare
    
    ImageNameRare
    

    would give :

    >ImageNameSuper< and >Rare<
    
    >ImageName< and >Rare<
    

    Now, if we change the greedy quantifier + ( meaning {1,}, BTW ) with the lazy form +?, this time, it would catch, first, the tallest range of characters, before matching one of the two branches of the alternative. So it does match the SuperRare string and, after replacement, we get the expected text :

    >ImageName< and >SuperRare<
    
    >ImageName< and >Rare<
    

    BTW, do you guess why the similar search regex (?-si)^(.+)(Uncommon|Common) does not need the lazy quantifier ?.. Just because the search is case sensitive and there is no ambiguity between the string Uncommon and the string Common ;-))


    Notes, on the regex expressions :

    • As usual, the two modifiers (?-si), at beginning of the search regex :

      • Forces the regex engine to do a case-sensitive search, ( -i ) meaning NON-Insensitive

      • Tell the regex engine to consider the special dot character ( . ) as standing for a single standard character, and not line-break ones !

    • Then, [[ matches two opening square brackets. Note that these special characters must be escaped to be considered as literal characters

    • Now, the part .+? represents the tallest range of any standard character, till the states Basic, Common, …

    • The syntax (?:(Basic)|(Uncommon)|(Common)|(SuperRare)|(Rare)|(Legendary)) stands for an alternative, in a non-capturing group, with 6 branches, stored as groups from 1 to 6

    • Then, the part .+=, matches the range of characters, till the unique = sign, which does not need to be stored

    • Finally, the syntax (.+)]] represents the Page Name, stored as group 7, followed by the two ending square brackets, which, again, must be escaped

    • In replacement, we, first, rewrite the literal string {{ItemPic|, followed by the Page Name ( $7 ) and a | symbol

    • Then, the part (?{1}0)(?{2}2)(?{3}1)(?{4}4)(?{5}3)(?{6}5) are juxtaposed conditional replacements, in any order, of the general form (?{x}y). It means : IF group x is matched, in search, rewrite, in replacement, the expression y. In our case, note that y is, simply, a digit number !

    • Finally, the part ||50px}} is the literal string, which have to be rewritten, on any line !

    Best Regards,

    guy038

    P.S. :

    For noob people, about regular expressions concept and syntax, begin with that article, in N++ Wiki :

    http://docs.notepad-plus-plus.org/index.php/Regular_Expressions

    In addition, you’ll find good documentation, about the Boost C++ Regex library, v1.55.0 ( similar to the PERL Regular Common Expressions, v5.8 ), used by Notepad++, since its 6.0 version, at the TWO addresses below :

    http://www.boost.org/doc/libs/1_55_0/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html

    http://www.boost.org/doc/libs/1_55_0/libs/regex/doc/html/boost_regex/format/boost_format_syntax.html

    • The FIRST link explains the syntax, of regular expressions, in the SEARCH part

    • The SECOND link explains the syntax, of regular expressions, in the REPLACEMENT part


    You may, also, look for valuable informations, on the sites, below :

    http://www.regular-expressions.info

    http://www.rexegg.com

    http://perldoc.perl.org/perlre.html

    Be aware that, as any documentation, it may contain some errors ! Anyway, if you detected one, that’s good news : you’re improving ;-))



  • @guy038 said:

    Regarding your initial text :

    [[File:ImageNameRare.png|50px|link=Page Name]]
    
    • I assumed that you don’t care about the first part File:ImageName, after the two opening square brackets

    • I think that you don’t care, also, about the second part .png|50px|link, coming after the suffixes Basic, Rare or else !

    • And, as you said, only the Page Name, after the = sign and before the two ending square brackets, must be rewritten, during the replacement process

    • Finally, if the line does not contain any Page name, and ends as link=]], NO replacement must occur !


    Now, regarding the final expected text :

    {{ItemPic|Page Name|3||50px}}
    
    • I assumed that the part ItemPic| is a simple literal string

    • Then, the Page name, memorized from the initial text, must be rewritten, followed by a | symbol

    • Now, as you explained, it should rewrite a single digit, from 0 to 5, according to the 6 corresponding states Basic , Common, Uncommon , Rare , SuperRare and Legendary

    • Finally, we just write the literal string ||50px, after the one-digit, as you said that all images are 50px, anyway !

    • Note that if the line ends with link=]], after replacement, it will be unchanged, and keep the square brackets, instead of the curly braces !

    You got it exactly right. Thank you for the detailed explanation. I’ll definitely be bookmarking this post so that I can come back and read this again if I need help with something like this.



  • @guy038

    I was close, wasn’t I? :-D

    thx for the detailed information and improvements - always a pleasure :-)

    Cheers
    Claudia



  • @Claudia-Frank said:

    @guy038

    I was close, wasn’t I? :-D

    thx for the detailed information and improvements - always a pleasure :-)

    Cheers
    Claudia

    Yours was the one I used before guy038 made his post and it worked perfectly, so I would say you were more than close. :)


Log in to reply