Community
    • Login

    Need help, please - regular expressions

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    14 Posts 3 Posters 7.3k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • dailD
      dail
      last edited by

      Also, take a look at your original expression here. You can see the regular expression is allowing it to skip class=.

      Note: that website might not use the exact same regular expression engine but should be close enough to reference.

      1 Reply Last reply Reply Quote 0
      • Jan KowalskyJ
        Jan Kowalsky
        last edited by

        Shortening again takes revenge, i want shortly, came badly.
        Expression is also fit to:

        <img lorem-ipsum-dolor=“/lorem/ipsum/dolor-1999-and/123456789012345/lorem_ipsum/1a2b3c4dd9651/c011XX001.tif” src=“11LOREM%202%20%20IpSuM%20Dolor%20sit%20%20amet%20consecteur%20-%20AdiPISCIng%20123456%20elit%20Curabitur%20QWERTY%20202020%20yes%20urna%20Interdeum%20%20Off%20Cras_files/c01vv0x01.jpg” alt=“c01vv0x01.jpeg” height=“567” width=“789”>

        IMHO ( class=|) It meant with it, or without it. I emphasize IMHO.
        Without it, not checked second expression. Because | means OR, right?
        And so it looks, sorry…

        1 Reply Last reply Reply Quote 0
        • dailD
          dail
          last edited by dail

          You are right. ( class=|) can skip it. If you look at the image I linked, if it skips group 1, it must still “consume” something for group 2. So you would want a regular expression that only captures group 2 if 1 exists.

          So this section in your expression:

          ( class=|)(".*?")

          Should be replaced with something like:

          (?:( class=)(".*?"))?

          Note this isn’t perfect but should point you in the right direction.

          1 Reply Last reply Reply Quote 0
          • Jan KowalskyJ
            Jan Kowalsky
            last edited by

            In the picture looks great. In real it’s always left on the front :(

            1 Reply Last reply Reply Quote 0
            • Jan KowalskyJ
              Jan Kowalsky
              last edited by Jan Kowalsky

              (?:<img lorem-ipsum-dolor)(?:.*?)(?>( class=|)(“.*?”))?(“.*?”)( src=)(?:.*?)(?:_files/)(.*?[jpg|png|tif|gif]")

              atomic group, works
              Thx for right direction!

              1 Reply Last reply Reply Quote 0
              • Jan KowalskyJ
                Jan Kowalsky
                last edited by

                I’m sorry I was blind doesn’t work

                1 Reply Last reply Reply Quote 0
                • Jan KowalskyJ
                  Jan Kowalsky
                  last edited by

                  Finally
                  <code>(?:<img lorem-ipsum-dolor)(?:.*?)(?:( class=)(“.*?”)( src=)|( src=))(?:.*?)(?:_files/)(.*?[jpg|png|tif|gif]")</code>

                  ps. I don’t know why me didn’t show in red :)

                  1 Reply Last reply Reply Quote 0
                  • guy038G
                    guy038
                    last edited by guy038

                    Hello Jan and Dail,

                    Jan, I didn’t try to consider your regex S/R, first, trying to fully understand it. I just notice two points :

                    • After copying your example source, in my Notepad++, the delimiters of the different tags ( class, src, alt, height and width ) are the couple “.....”, that is to say the Left DOUBLE quotation MARK ( of Unicode code-point \x{201c} ) and the Right DOUBLE quotation MARK ( of Unicode code-point \x201d ). These characters are different from the usual QUOTATION MARK " ( \x22 )

                    Therefore, the regex, proposed below, is based on these two characters \x{201c} and \x{201d}

                    • Seemingly, your pictures files can have the .jpg, .png, .tiff or .gif extension. Well, but the regex you use to match these extensions ( [jpg|png|tif|gif] ) is totally WRONG, because the | symbol is taken literally, between square brackets !. Indeed, this syntax is a single range of characters, which matches an unique character, which can be the pipe symbol (| ), OR one of the letters j, p, g, n, t, i, f, whatever their case. In other words, this subset, of your entire regex, could be simply rewritten [fgijnpt|]

                    So, the correct regex is simply (jpg|png|tiff|gif) : one extension, among the four possible ones !


                    Then, I propose the following regex S/R, below :

                    SEARCH (?i-s).*?(class=“.*?” src=“).*?_files/(.*?(jpg|png|tiff|gif)) ( with a space, before the tag src )

                    REPLACE <img \L\1\2

                    Notes :

                    • The two modifiers (?i-s) forces matches, in an insensitive way and that dot matches standard characters only. In replacement, however, the two groups \1 and \2 are rewritten, in lower case, due to the \L syntax

                    • The four forms .*? represents the shortest list of characters, before each string, located after .*?

                    • All text, before the first string class, of a line, NOT located between round brackets, is therefore deleted, after replacement

                    • The group \1 is the string class=“…” src=“ and the group \2 is the name of the picture, with its extension. They, both, are rewritten, in lower case, after an initial <img string.

                    If your really need that the line begins with the string <img lorem-ipsum-dolor, just change the search regex into :

                    SEARCH (?i-s)<img lorem-ipsum-dolor.*?(class=“.*?” src=“).*?_files/(.*?(jpg|png|tiff|gif))

                    Best Regards,

                    guy038

                    1 Reply Last reply Reply Quote 0
                    • Jan KowalskyJ
                      Jan Kowalsky
                      last edited by

                      Thank you very much for analysis. I know, abnormal brackets, do not have the right to work. But in this specific example, work.
                      Example:
                      <img lorem-ipsum-dolor="/lorem/ipsum/dolor-2015-and/123456789012345/lorem_ipsum/1a2b3c4dd9651/a23w34m87.jpg" class="lorem123" src="11LOREM%202%20%20IpSuM%20Dolor%20sit%20%20amet%20consecteur%20-%20AdiPISCIng%20123456%20elit%20Curabitur%20QWERTY%20202020%20yes%20urna%20Interdeum%20%20Off%20Cras_files/a01b02c68.png" alt="a01b02c68.bmp" height="101" width="102">

                      <img lorem-ipsum-dolor="/lorem/ipsum/dolor-1999-and/123456789012345/lorem_ipsum/1a2b3c4dd9651/c011XX001.tif" src="11LOREM%202%20%20IpSuM%20Dolor%20sit%20%20amet%20consecteur%20-%20AdiPISCIng%20123456%20elit%20Curabitur%20QWERTY%20202020%20yes%20urna%20Interdeum%20%20Off%20Cras_files/c01vv0x01.jpg" alt="c01vv0x01.jpeg" height="567" width="789">

                      Regex:
                      (?:<img lorem-ipsum-dolor)(?:.*?)(?:( class=)(".*?")( src=)|( src=))(?:.*?)(?:_files\/)(.*?[jpg|png|tif|gif]")
                      replace:
                      <img\1\2\3\4"\5

                      After changing to the correct brackets, also works:
                      (?:<img lorem-ipsum-dolor)(?:.*?)(?:( class=)(".*?")( src=)|( src=))(?:.*?)(?:_files\/)(.*?(jpg|png|tif|gif)")
                      but then there are 6 groups, the sixth just do not need to call.

                      With all due respect, your as much as possible correct regex is not working.

                      Very sorry for my English, still I am learning.

                      1 Reply Last reply Reply Quote 0
                      • guy038G
                        guy038
                        last edited by guy038

                        Hi Jan,

                        OK. I, now, understood two main points, about your problem :

                        • Firstly, the values of the different tags are surrounded by the usual quotation mark ( " ), of Unicode code-point \x{0022}. Of course, my previous regex, based on the two delimiters \x{201c} and \x{201d}, COULDN’T work at all !

                        • Secondly, the tag class="........" may, sometimes, be absent, in a line. Again, my previous regex supposed that this tag was always present:-((

                        So, aware of the two facts, above, my new proposed regex is :

                        SEARCH (?i-s)<img lorem-ipsum-dolor.*?((?:class=".*?" )?src=").*?_files/(.*?(jpg|png|tiff|gif))

                        REPLACE <img \L\1\2

                        After running your S/R and mine, they, both, give the same results :-)) Nice !


                        NOTES : Compared to my previous try :

                        • I changed the special delimiters “.....”, by the usual ones ".....", in the search regex

                        • I added a new non-capturing group (?:class=".*?" )?, that can exists or NOT, due to the final question mark ?

                        • There a space, ending the non-capturing group, before the ending round bracket

                        • The replacement regex has NOT changed

                        Cheers,

                        guy038

                        1 Reply Last reply Reply Quote 1
                        • Jan KowalskyJ
                          Jan Kowalsky
                          last edited by

                          Thank you for your commitment
                          Best regards,
                          Jan

                          1 Reply Last reply Reply Quote 0
                          • First post
                            Last post
                          The Community of users of the Notepad++ text editor.
                          Powered by NodeBB | Contributors