Regex pattern needed

  • Hello,

    I am looking for a way to replace all lines of the form:
    <a href="https://www.server.example/path/something%20with%20spaces">something with spaces</a>
    Using a single search/replace operation.

    Currently I do it with one operation that transforms the link to an <h ref...> format, and another one to replace a %20 with a space after the closing angle bracket, which has to be repeated several times until all the %20 instances are replaced.

    Please assist.

  • @alexolog ,

    I would approach it as a two-step process.

    1. convert https://www.server.example/path/something%20with%20spaces to <a href="https://www.server.example/path/something%20with%20spaces">something%20with%20spaces</a> – because that’s a pretty easy regex
    2. convert the >something%20with%20spaces</a> to >something with spaces</a>

    I would do this because I assume that some of your URLs might have one %20, some might have two %20, and some might have more (or none). And coding a regex for all those edge cases is fragile. OTOH, if I can just search for a URL and break it into two pieces, that’s easy.

    1. FIND = (?-s)^(https?://\S*/)([^"\s/]*)$
      REPLACE = <a href="$1$2">$2</a>
      MODE = Regular expression
    <a href="https://www.server.example/path/something%20with%20spaces">something%20with%20spaces</a>
    <a href="https://www.different.example/path/one%20space">one%20space</a>
    <a href="https://www.third.example/path/spaceless">spaceless</a>

    2 . For this one, I would use @guy038’s generic “change data, but only between start and end markers” regex from this post
    * Generic = (?-i:BSR|(?!\A)\G)(?s:(?!ESR).)*?\K(?-i:FR)
    * BSR = > (for the end of the <a href="...">)
    * ESR = </a>
    * FR = %20
    * RR = \x20 (or a literal space
    * => FIND = (?-i:>|(?!\A)\G)(?s:(?!</a>).)*?\K(?-i:%20)
    REPLACE = \x20 (or a literal space)

    Unfortunately, when I did that, my test data became

    <a href="https://www.server.example/path/something%20with%20spaces">something with spaces</a>
    <a href="https://www.different.example/path/one space">one space</a>
    <a href="https://www.third.example/path/spaceless">spaceless</a>

    … and you can see that it replaced a %20 that was inside the href portion… I think because used such a small BSR expression. Unfortunately, my attempt at fixing it with BSR = <a[^\s>]*>, to be more specific, said it couldn’t find it at all. And unfortunately, I have to focus on my day job a bit more today, so I cannot continue debugging. But this is the path I’d follow.

    Maybe @guy038 will have time to tell us what I did wrong, or come up with a better BSR to keep the find-region out of the href value. Or maybe I will find some time this evening.

  • @PeterJones said in Regex pattern needed:

    Or maybe I will find some time this evening.

    Well, it was the next day, but…

    My mistake in yesterday’s modified BSR = <a[^\s>]*> was including \s in the complement character class, which meant it had to be <a...> without any spaces, which obviously cannot match <a href="...">. Once I realized that, it was easy to fix.

    • BSR = <a[^>]*>
    • ESR = </a>
    • FR = %20
    • RR = \x20 (or literal space)
    • FIND = (?-i:<a[^>]*>|(?!\A)\G)(?s:(?!</a>).)*?\K(?-i:%20)
    • REPLACE = \x20
    • Final Transformation of my previous data:
      <a href="https://www.server.example/path/something%20with%20spaces">something with spaces</a>
      <a href="https://www.different.example/path/one%20space">one space</a>
      <a href="https://www.third.example/path/spaceless">spaceless</a>

Log in to reply