• Login
Community
  • Login

Need help, please - regular expressions

Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
14 Posts 3 Posters 7.3k Views
Loading More Posts
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • J
    Jan Kowalsky
    last edited by Dec 9, 2015, 7:27 AM

    Can not cope with it:
    example source:

    <img lorem-ipsum-dolor=“/lorem/ipsum/dolor-2015-and/123456789012345/lorem_ipsum/1a2b3c4dd9651/a23w34m87.jpg” class=“lorem123” src=“11LOREM%202%20%20IpSuM%20Dolor%20sit%20%20amet%20consecteur%20-%20AdiPISCIng%20123456%20elit%20Curabitur%20QWERTY%20202020%20yes%20urna%20Interdeum%20%20Off%20Cras_files/a01b02c68.png” alt=“a01b02c68.bmp” height=“101” width=“102”>

    regex search:

    (?:<img lorem-ipsum-dolor)(?:.?)( class=|)(".?“)( src=)(?:.?)(?:_files/)(.?.[jpg|png|tif|gif]”)

    replace:

    <img\1\2\3"\4

    and I get this:

    <img"/lorem/ipsum/dolor-2015-and/123456789012345/lorem_ipsum/1a2b3c4dd9651/a23w34m87.jpg" class=“lorem123” src=“a01b02c68.png” alt=“a01b02c68.bmp” height=“101” width=“102”>

    This should not be:

    “/lorem/ipsum/dolor-2015-and/123456789012345/lorem_ipsum/1a2b3c4dd9651/a23w34m87.jpg”

    It should look like:

    <img class=“lorem123” src=“a01b02c68.png” alt=“a01b02c68.bmp” height=“101” width=“102”>

    What am I doing wrong?

    1 Reply Last reply Reply Quote 0
    • J
      Jan Kowalsky
      last edited by Dec 9, 2015, 10:04 AM

      at the top cuts stars, sorry

      regex search:
      (?:<img lorem-ipsum-dolor)(?:.*?)( class=|)(“.*?”)( src=)(?:.*?)(?:_files/)(.*?.[jpg|png|tif|gif]")

      1 Reply Last reply Reply Quote 0
      • D
        dail
        last edited by Dec 9, 2015, 2:27 PM

        Very close. You had one minor issue. This:

        ( class=|)

        Should be

        ( class=)

        1 Reply Last reply Reply Quote 0
        • D
          dail
          last edited by Dec 9, 2015, 2:33 PM

          Also, take a look at your original expression here . You can see the regular expression is allowing it to skip class=.

          Note: that website might not use the exact same regular expression engine but should be close enough to reference.

          1 Reply Last reply Reply Quote 0
          • J
            Jan Kowalsky
            last edited by Dec 9, 2015, 3:25 PM

            Shortening again takes revenge, i want shortly, came badly.
            Expression is also fit to:

            <img lorem-ipsum-dolor=“/lorem/ipsum/dolor-1999-and/123456789012345/lorem_ipsum/1a2b3c4dd9651/c011XX001.tif” src=“11LOREM%202%20%20IpSuM%20Dolor%20sit%20%20amet%20consecteur%20-%20AdiPISCIng%20123456%20elit%20Curabitur%20QWERTY%20202020%20yes%20urna%20Interdeum%20%20Off%20Cras_files/c01vv0x01.jpg” alt=“c01vv0x01.jpeg” height=“567” width=“789”>

            IMHO ( class=|) It meant with it, or without it. I emphasize IMHO.
            Without it, not checked second expression. Because | means OR, right?
            And so it looks, sorry…

            1 Reply Last reply Reply Quote 0
            • D
              dail
              last edited by dail Dec 9, 2015, 3:48 PM Dec 9, 2015, 3:44 PM

              You are right. ( class=|) can skip it. If you look at the image I linked, if it skips group 1, it must still “consume” something for group 2. So you would want a regular expression that only captures group 2 if 1 exists.

              So this section in your expression:

              ( class=|)(".*?")

              Should be replaced with something like:

              (?:( class=)(".*?"))?

              Note this isn’t perfect but should point you in the right direction.

              1 Reply Last reply Reply Quote 0
              • J
                Jan Kowalsky
                last edited by Dec 9, 2015, 4:04 PM

                In the picture looks great. In real it’s always left on the front :(

                1 Reply Last reply Reply Quote 0
                • J
                  Jan Kowalsky
                  last edited by Jan Kowalsky Dec 9, 2015, 5:59 PM Dec 9, 2015, 5:57 PM

                  (?:<img lorem-ipsum-dolor)(?:.*?)(?>( class=|)(“.*?”))?(“.*?”)( src=)(?:.*?)(?:_files/)(.*?[jpg|png|tif|gif]")

                  atomic group, works
                  Thx for right direction!

                  1 Reply Last reply Reply Quote 0
                  • J
                    Jan Kowalsky
                    last edited by Dec 9, 2015, 6:32 PM

                    I’m sorry I was blind doesn’t work

                    1 Reply Last reply Reply Quote 0
                    • J
                      Jan Kowalsky
                      last edited by Dec 10, 2015, 9:55 AM

                      Finally
                      <code>(?:<img lorem-ipsum-dolor)(?:.*?)(?:( class=)(“.*?”)( src=)|( src=))(?:.*?)(?:_files/)(.*?[jpg|png|tif|gif]")</code>

                      ps. I don’t know why me didn’t show in red :)

                      1 Reply Last reply Reply Quote 0
                      • G
                        guy038
                        last edited by guy038 Dec 10, 2015, 11:06 PM Dec 10, 2015, 9:58 PM

                        Hello Jan and Dail,

                        Jan, I didn’t try to consider your regex S/R, first, trying to fully understand it. I just notice two points :

                        • After copying your example source, in my Notepad++, the delimiters of the different tags ( class, src, alt, height and width ) are the couple “.....”, that is to say the Left DOUBLE quotation MARK ( of Unicode code-point \x{201c} ) and the Right DOUBLE quotation MARK ( of Unicode code-point \x201d ). These characters are different from the usual QUOTATION MARK " ( \x22 )

                        Therefore, the regex, proposed below, is based on these two characters \x{201c} and \x{201d}

                        • Seemingly, your pictures files can have the .jpg, .png, .tiff or .gif extension. Well, but the regex you use to match these extensions ( [jpg|png|tif|gif] ) is totally WRONG, because the | symbol is taken literally, between square brackets !. Indeed, this syntax is a single range of characters, which matches an unique character, which can be the pipe symbol (| ), OR one of the letters j, p, g, n, t, i, f, whatever their case. In other words, this subset, of your entire regex, could be simply rewritten [fgijnpt|]

                        So, the correct regex is simply (jpg|png|tiff|gif) : one extension, among the four possible ones !


                        Then, I propose the following regex S/R, below :

                        SEARCH (?i-s).*?(class=“.*?” src=“).*?_files/(.*?(jpg|png|tiff|gif)) ( with a space, before the tag src )

                        REPLACE <img \L\1\2

                        Notes :

                        • The two modifiers (?i-s) forces matches, in an insensitive way and that dot matches standard characters only. In replacement, however, the two groups \1 and \2 are rewritten, in lower case, due to the \L syntax

                        • The four forms .*? represents the shortest list of characters, before each string, located after .*?

                        • All text, before the first string class, of a line, NOT located between round brackets, is therefore deleted, after replacement

                        • The group \1 is the string class=“…” src=“ and the group \2 is the name of the picture, with its extension. They, both, are rewritten, in lower case, after an initial <img string.

                        If your really need that the line begins with the string <img lorem-ipsum-dolor, just change the search regex into :

                        SEARCH (?i-s)<img lorem-ipsum-dolor.*?(class=“.*?” src=“).*?_files/(.*?(jpg|png|tiff|gif))

                        Best Regards,

                        guy038

                        1 Reply Last reply Reply Quote 0
                        • J
                          Jan Kowalsky
                          last edited by Dec 11, 2015, 4:48 PM

                          Thank you very much for analysis. I know, abnormal brackets, do not have the right to work. But in this specific example, work.
                          Example:
                          <img lorem-ipsum-dolor="/lorem/ipsum/dolor-2015-and/123456789012345/lorem_ipsum/1a2b3c4dd9651/a23w34m87.jpg" class="lorem123" src="11LOREM%202%20%20IpSuM%20Dolor%20sit%20%20amet%20consecteur%20-%20AdiPISCIng%20123456%20elit%20Curabitur%20QWERTY%20202020%20yes%20urna%20Interdeum%20%20Off%20Cras_files/a01b02c68.png" alt="a01b02c68.bmp" height="101" width="102">

                          <img lorem-ipsum-dolor="/lorem/ipsum/dolor-1999-and/123456789012345/lorem_ipsum/1a2b3c4dd9651/c011XX001.tif" src="11LOREM%202%20%20IpSuM%20Dolor%20sit%20%20amet%20consecteur%20-%20AdiPISCIng%20123456%20elit%20Curabitur%20QWERTY%20202020%20yes%20urna%20Interdeum%20%20Off%20Cras_files/c01vv0x01.jpg" alt="c01vv0x01.jpeg" height="567" width="789">

                          Regex:
                          (?:<img lorem-ipsum-dolor)(?:.*?)(?:( class=)(".*?")( src=)|( src=))(?:.*?)(?:_files\/)(.*?[jpg|png|tif|gif]")
                          replace:
                          <img\1\2\3\4"\5

                          After changing to the correct brackets, also works:
                          (?:<img lorem-ipsum-dolor)(?:.*?)(?:( class=)(".*?")( src=)|( src=))(?:.*?)(?:_files\/)(.*?(jpg|png|tif|gif)")
                          but then there are 6 groups, the sixth just do not need to call.

                          With all due respect, your as much as possible correct regex is not working.

                          Very sorry for my English, still I am learning.

                          1 Reply Last reply Reply Quote 0
                          • G
                            guy038
                            last edited by guy038 Dec 11, 2015, 8:13 PM Dec 11, 2015, 8:11 PM

                            Hi Jan,

                            OK. I, now, understood two main points, about your problem :

                            • Firstly, the values of the different tags are surrounded by the usual quotation mark ( " ), of Unicode code-point \x{0022}. Of course, my previous regex, based on the two delimiters \x{201c} and \x{201d}, COULDN’T work at all !

                            • Secondly, the tag class="........" may, sometimes, be absent, in a line. Again, my previous regex supposed that this tag was always present:-((

                            So, aware of the two facts, above, my new proposed regex is :

                            SEARCH (?i-s)<img lorem-ipsum-dolor.*?((?:class=".*?" )?src=").*?_files/(.*?(jpg|png|tiff|gif))

                            REPLACE <img \L\1\2

                            After running your S/R and mine, they, both, give the same results :-)) Nice !


                            NOTES : Compared to my previous try :

                            • I changed the special delimiters “.....”, by the usual ones ".....", in the search regex

                            • I added a new non-capturing group (?:class=".*?" )?, that can exists or NOT, due to the final question mark ?

                            • There a space, ending the non-capturing group, before the ending round bracket

                            • The replacement regex has NOT changed

                            Cheers,

                            guy038

                            1 Reply Last reply Reply Quote 1
                            • J
                              Jan Kowalsky
                              last edited by Dec 11, 2015, 8:29 PM

                              Thank you for your commitment
                              Best regards,
                              Jan

                              1 Reply Last reply Reply Quote 0
                              6 out of 14
                              • First post
                                6/14
                                Last post
                              The Community of users of the Notepad++ text editor.
                              Powered by NodeBB | Contributors