• Login
Community
  • Login

Regex: Find String in HTML Not at Line Start or Following </p>

Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
regex html
15 Posts 5 Posters 1.1k Views
Loading More Posts
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • S
    Sylvester Bullitt
    last edited by Feb 4, 2024, 12:25 AM

    I’m trying to write a regular expression that finds strings in an HTML file, but only if the string is (1) not at the beginning of a line, and (2) not preceded by a <p> tag. I exclude those strings because I’m going to hyphenate the strings I do find, and in my layout rules, strings at the beginning of a line aren’t hyphenated.

    My attempts so far:

       (?<!^|<p>)string_to_find
       (?<!(^|<p>))string_to_find
    

    NPP says both of the above are invalid regular expressions.

    Anyone have an idea of how accomplish this kind of search?

    P 1 Reply Last reply Feb 4, 2024, 12:56 AM Reply Quote 0
    • P
      PeterJones @Sylvester Bullitt
      last edited by Feb 4, 2024, 12:56 AM

      @Sylvester-Bullitt ,

      NPP says both of the above are invalid regular expressions.

      Alternation is a “variable length” regex, which isn’t allowed in lookbehinds.

      Anyone have an idea of how accomplish this kind of search?

      In logic rules, the following are equivalent

      • NOT(A OR B)
      • NOT(A) AND NOT(B)

      Since NOT(A OR B) is variable width (because A and B are different widths), use NOT(A) AND NOT(B) instead: (?<!^)(?<!<p>)string_to_find

      S 1 Reply Last reply Feb 4, 2024, 1:03 AM Reply Quote 1
      • S
        Sylvester Bullitt @PeterJones
        last edited by Sylvester Bullitt Feb 4, 2024, 1:05 AM Feb 4, 2024, 1:03 AM

        @PeterJones

        Thanks for the quick reply. While Googling for a solution, I found a regex capability I didn’t know existed. Since lookbehinds don’t change the search position, you can have multiple lookbehinds, one after the other. Knowing that, I came up with this (slightly shorter) expression which (to my amazement) did the trick:

        (?<!^)(?<!<p>)string_to_find
        

        Thanks for your help!

        A D 2 Replies Last reply Feb 4, 2024, 11:47 AM Reply Quote 2
        • A
          Alan Kilborn @Sylvester Bullitt
          last edited by Alan Kilborn Feb 4, 2024, 11:47 AM Feb 4, 2024, 11:47 AM

          @Sylvester-Bullitt said in Regex: Find String in HTML Not at Line Start or Following </p>:

          I came up with this (slightly shorter) expression

          I think Peter came up with it, first (by roughly 7 minutes). :-)

          1 Reply Last reply Reply Quote 1
          • D
            dr ramaanand @Sylvester Bullitt
            last edited by Feb 7, 2024, 2:54 PM

            @Sylvester-Bullitt said in Regex: Find String in HTML Not at Line Start or Following </p>:

            (?<!^)(?<!<p>)string_to_find

            ^string_to_find(*SKIP)(*F)|<p>string_to_find(*SKIP)(*F)|string_to_find is a more, “easy to remember” Regular expression you can use. Why is it easy? It is easy because all that needs to be skipped should be on the left of (*SKIP)(*F)| and what needs to be found should be on its right.

            M S 2 Replies Last reply Feb 7, 2024, 3:08 PM Reply Quote 0
            • M
              Mark Olson @dr ramaanand
              last edited by Feb 7, 2024, 3:08 PM

              @dr-ramaanand said in Regex: Find String in HTML Not at Line Start or Following </p>:

              ^string_to_find(*SKIP)(*F)|<p>string_to_find(*SKIP)(*F)|string_to_find is a more, “easy to remember”

              I would never advocate using backtracking control verbs like (*SKIP) and (*F) when something else would suffice, unless the backtracking control approach is MUCH simpler, which this clearly is not. Very few regex implementations include backtracking control verbs, so you will usually need to rewrite this regex when you go somewhere else.

              1 Reply Last reply Reply Quote 0
              • S
                Sylvester Bullitt @dr ramaanand
                last edited by Feb 7, 2024, 3:10 PM

                @dr-ramaanand As it turns out, we’ve discovered a number of additional “first on line” scenarios that we’ve since added to our negative lookbehinds:

                (?#Not 1st word in line)(?<!^)(?<!^<q>)(?<!^“)(?<!<p>)(?<!<p><q>)(?<!<p>“)(?<!<p class="chorus">)(?<!<br>)
                

                We’d be willing to consider other ways of doing this if there are simpler or more understandable techniques. However, I’ll be the first to admit I’m unfamiliar with the most of the new regex you sent. Could you explain what its components do? And where is the documentation for them located?

                D M 3 Replies Last reply Feb 7, 2024, 3:25 PM Reply Quote 0
                • D
                  dr ramaanand @Sylvester Bullitt
                  last edited by Feb 7, 2024, 3:25 PM

                  @Sylvester-Bullitt

                  • https://community.notepad-plus-plus.org/post/55467
                  • https://community.notepad-plus-plus.org/post/60429
                  • https://community.notepad-plus-plus.org/topic/20432
                  • https://community.notepad-plus-plus.org/post/64421
                  • https://community.notepad-plus-plus.org/post/60332
                  • https://community.notepad-plus-plus.org/post/60220
                  1 Reply Last reply Reply Quote 0
                  • M
                    Mark Olson @Sylvester Bullitt
                    last edited by Feb 7, 2024, 3:26 PM

                    @Sylvester-Bullitt
                    I default to RexEgg.com for most regex-related questions. It is an excellent resource.

                    guy038 has also written a good explanation of backtracking control verbs.

                    S 1 Reply Last reply Feb 7, 2024, 3:33 PM Reply Quote 0
                    • D
                      dr ramaanand @Sylvester Bullitt
                      last edited by dr ramaanand Feb 7, 2024, 3:33 PM Feb 7, 2024, 3:32 PM

                      @Sylvester-Bullitt The (SKIP) and (FAIL) method is easy because all that needs to be skipped should be on the left of (*SKIP)(*F)| and what needs to be found should be on its right. You can add all that you need to skip with a string_to_skip(*SKIP)(*F)| on the left

                      1 Reply Last reply Reply Quote 0
                      • S
                        Sylvester Bullitt @Mark Olson
                        last edited by Feb 7, 2024, 3:33 PM

                        @Mark-Olson Thanks!

                        D 1 Reply Last reply Feb 7, 2024, 3:41 PM Reply Quote 0
                        • D
                          dr ramaanand @Sylvester Bullitt
                          last edited by dr ramaanand Feb 7, 2024, 3:51 PM Feb 7, 2024, 3:41 PM

                          @Sylvester-Bullitt
                          ^Achtung(*SKIP)(*F)|<p>Achtung(*SKIP)(*F)|Achtung will find the word Achtung except if it is at the beginning of the line or if it is preceded by a <p>
                          You may also use ^Achtung(*SKIP)(*F)|<p[^<>]*>Achtung(*SKIP)(*F)|<q>Achtung(*SKIP)(*F)|<p><q>Achtung(*SKIP)(*F)|Achtung which will skip every <p................................>, <q> and <p><q> if they are followed by the word Achtung but will find the word Achtung otherwise.

                          To skip <br> also, use the Regular expression ^Achtung(*SKIP)(*F)|<p[^<>]*>Achtung(*SKIP)(*F)|<q>Achtung(*SKIP)(*F)|<p><q>Achtung(*SKIP)(*F)|<br>(*SKIP)(*F)|Achtung

                          S 1 Reply Last reply Feb 7, 2024, 3:50 PM Reply Quote 0
                          • S
                            Sylvester Bullitt @dr ramaanand
                            last edited by Feb 7, 2024, 3:50 PM

                            @dr-ramaanand If I read this correctly, I’d have to repeat one of these (*SKIP)(*F) constructs for each of my current negative lookbehinds. So that would actually make the overall regex longer.

                            And Mark Olson makes a good point. It would also make our regex non-portable, which is a major consideration. We prefer to use grammar that is supported by the large majority of regex engines, when we have the choice.

                            D 1 Reply Last reply Feb 7, 2024, 3:54 PM Reply Quote 0
                            • D
                              dr ramaanand @Sylvester Bullitt
                              last edited by Feb 7, 2024, 3:54 PM

                              @Sylvester-Bullitt Please do whatever suits you. I am not commanding you to use the (SKIP)(FAIL) method only!

                              S 1 Reply Last reply Feb 7, 2024, 4:03 PM Reply Quote 0
                              • S
                                Sylvester Bullitt @dr ramaanand
                                last edited by Feb 7, 2024, 4:03 PM

                                @dr-ramaanand Understand. Thanks for your input and time!

                                1 Reply Last reply Reply Quote 0
                                1 out of 15
                                • First post
                                  1/15
                                  Last post
                                The Community of users of the Notepad++ text editor.
                                Powered by NodeBB | Contributors