Community
    • Login

    Regex: Select the text between certain words, only from the file that contains a certain word

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    12 Posts 2 Posters 613 Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Robin CruiseR
      Robin Cruise
      last edited by Robin Cruise

      good day. I have to select some text between words start and finnish in several html files. But, also, I need to select that particular text only if the file contains the word BABY SIN

      For example:

      Lorem ipsum, or lipsum as it is sometimes known, is dummy text used in laying out print, graphic or web designs, of a BABY SIN mark. The passage is attributed to an unknown typesetter in the 15th century who is thought to have START scrambled parts of Cicero's De Finibus Bonorum et Malorum for use in a FINNISH type specimen book.

      I made a regex, but something is not very good. And I remember that @guy038 made a post with something simillar, but cannot find that post.

      (?s)(.*\b(BABY SIN)\b.*)\K(?s)(START).*(FINNISH)

      Can anyone help me?

      1 Reply Last reply Reply Quote 0
      • guy038G
        guy038
        last edited by guy038

        Hello, @robin-cruise,

        I’m sorry but your phrasing is a bit ambiguous ! We must be precise as possible !

        You said :

        I have to select some text between words start and finnish in several html files. But, also, I need to select that particular text only if the file contains the word BABY SIN


        Here is what I understand :

        • You want to find an expression ( but what, please : a simple char, a word, a range of words, a complete sentence, a complete line, a bunch of lines ? ) between the string delimiter START and the string delimiter FINNISH , with this exact case

        • But you want that this search occurs ONLY IF the two words string BABY SIN, with this exact case, exists in current file, whatever the BABY SIN’s location, I suppose ( inside OR outside the START•••••••••••FINNISH interval ! )

        So, thank for developing your needs ?

        BR

        guy038

        1 Reply Last reply Reply Quote 0
        • Robin CruiseR
          Robin Cruise
          last edited by

          hello, this is what I want to select, only if the file contains the words BABY SIN

          alt text

          1 Reply Last reply Reply Quote 0
          • guy038G
            guy038
            last edited by guy038

            Hi , @robin-cruise,

            Ah… OK. So, when you said :

            I have to select SOME text between words start and finnish

            you wanted to express :

            I would like to select ALL text between the two words START and FININSH

            I agree that the nuance is subtle ;-))


            Now, from your picture, I see that, apparently, you also want to match the two delimiters START and FINNISH, themselves

            • However, you didn’t answer me about the possible locations of the BABY SIN string ( inside, .outside the START•••••••••••FINNISH section or before / after it ).

            • Also, does your HTML text contain only ONE or several START•••••••••••FINNISH sections ?

            BR

            guy038

            1 Reply Last reply Reply Quote 0
            • Robin CruiseR
              Robin Cruise
              last edited by

              oh, yes. Baby Sin can be located anywhere in the file. And my html contain only one START•••••••••••FINNISH section.

              (but as an alternative, it may be the case that I have 2 START•••••••••••FINNISH sections and I should select the first one, or other case the last one.

              1 Reply Last reply Reply Quote 0
              • guy038G
                guy038
                last edited by

                Hello, @Robin-cruise and All,

                The general problem is that the regex engine always searches from the left to the right. So, one the BABY SIN location is over, there no means for the regex engine to remember that current file contains that specific string :-((


                Or course, there’s a simple solution, used many times in regex topics ! Before speaking about it, in the second part of this post, I also considered the possibility to catch the BABY SIN string with this kind of regex :

                (?s-i)(?=\A.*?(BABY SIN))(*F)|(?(1).*?\KSTART.+?FINNISH)

                So, when the regex engine is right before the first char of current file :

                • The regex engine tests the first alternative (?s-i)(?=\A.*?(BABY SIN))(*F) and match an empty string if the string BABY SIN exists. So, now, the group 1 is defined as BABY SIN. Note that, at the end, the control verb (*F) cancels the current alternative but, luckily, does not reset the group 1

                • So, due to the (*F) syntax, the regex engine switches to the next alternative (?(1).*?\KSTART.+?FINNISH) which is a conditional expression that is true ONLY IF group 1 exists. So, still from the very beginning of file, it looks for minimum stuff ( .*?), forgotten because of the \K syntax, and, finally, looks for and finds the first START•••••••••FINNISH section. Nice !

                • However, let’s imagine that the current file contain a second START•••••FINNISH section. So, the regex engine goes on processing the overall regex :

                  • Current position is obviously not at the very beginning of file, so the first alternative cannot match and the group 1 is not defined. Moreover, this first alternative is canceled due to the (*F) syntax

                  • Thus the second alternative (?(1).*?\KSTART.+?FINNISH) is processed. Note that this regex is equivalent to the regex (?(1).*?\KSTART.+?FINNISH|) with an empty ELSE part. As the group 1 is not defined, this empty ELSE part simply matches an empty string at the location right after the FINNISH word and in all the subsequent locations till the end of file !

                This is absolutely not what is expected ! Unfortunately, and unlike programs and scripts, the regex groups and subroutines calls cannot be stored over two consecutive search processes !


                Thus, the sole practical and easy solution is to place an specific indicator at the very end of current document, which can be noticed with an look-ahead, and, for instance, the syntax (?=.*indicator\z)

                As you deal with HTML, I suppose that a comment after the last </html> tag, is allowed by the language ?

                So, we could change the last line </html> into the line </html><!-- Y --> with this regex S/R

                SEARCH (?s-i)\A.*BABY SIN.*</html>\K

                REPLACE <!-- Y -->

                Note that changing, LATER, the Y letter ( Yes ) to the N letter or anything else, in an HTML file, would not trigger the search of a START•••••FINNISH section for this specific file and vice-versa !


                Now, the search of a particular START•••••FINNISH section is rather easy ! To search for :

                • The first START•••••FINNISH section, use the regex (?s-i)\A.*?\KSTART.+?FINNISH(?=.*<!-- Y -->\Z)

                • The last START•••••FINNISH section, use the regex (?s-i)\A.*\KSTART.+?FINNISH(?=.*<!-- Y -->\Z)

                • The subsequent START•••••FINNISH sections, use the regex (?s-i).*?\KSTART.+?FINNISH(?=.*<!-- Y -->\Z)


                Remember to move the caret at the very beginning of current file, in case of an individual search with a click on the Find Next button !

                Best regards,

                guy038

                1 Reply Last reply Reply Quote 1
                • Robin CruiseR
                  Robin Cruise
                  last edited by Robin Cruise

                  ok, I don’t quite understand the last part from the last 3 regex, more special this <!-- Y -->

                  (?s-i)\A.*?\KSTART.+?FINNISH(?=.*<!-- Y -->\Z)

                  (?s-i)\A.*\KSTART.+?FINNISH(?=.*<!-- Y -->\Z)

                  (?s-i).*?\KSTART.+?FINNISH(?=.*<!-- Y -->\Z)

                  In my case, in this last 3 example, where should I place the words BABY SIN ?

                  something like this, will work: (?s-i)(?=\A.*?(BABY SIN))(*F)|(?s-i)\A.*?\KSTART.+?FINNISH

                  1 Reply Last reply Reply Quote 0
                  • guy038G
                    guy038
                    last edited by guy038

                    Hi @robin-cruise,

                    But, if you remove totally any BABY SIN keyword from your file, your last regex, derived from my own attempt, still finds START.....FINNISH sections ! Not what it is expected, isn’t it ?

                    Moreover, even if your file contains a BABY SIN string, your last regex would find the first START.....FINNISH section, only, and not the subsequent ones, in case of several sections !


                    I’m trying to rephrase my last post ! See you later

                    BR

                    guy038

                    guy038

                    1 Reply Last reply Reply Quote 1
                    • Robin CruiseR
                      Robin Cruise
                      last edited by

                      SELECT ALL INSTANCES: (?s-i)(?=\A.*?(BABY SIN))(*F)|(?s-i).*?\KSTART.+?FINNISH

                      SELECT FIRST INSTANCE: (?s-i)(?=\A.*?(BABY SIN))(*F)|(?s-i)\A.*\KSTART.+?FINNISH

                      SELECT LAST INSTANCE: (?s-i)(?=\A.*?(BABY SIN))(*F)|(?s-i)\A.*\KSTART.+?FINNISH

                      thanks, @guy038

                      1 Reply Last reply Reply Quote 0
                      • guy038G
                        guy038
                        last edited by

                        Hi, @robin-cruise,

                        I regret, but these three provided regexes do not give you initial goal which was to find START..... FINNISH sections ONLY IF the string BABY SIN is found anywhere in current file !

                        In addition, your second and third regexes seem identical !?

                        So, just wait for my next reply !

                        BR

                        guy038

                        1 Reply Last reply Reply Quote 0
                        • guy038G
                          guy038
                          last edited by guy038

                          Hello, @robin-cruise and All,

                          Robin, as you want to search for START•••••FINNISH section(s) in some HTML files but ONLY IF current file contains the string BABY SIN, and taking into account the limitations, outlined at the very beginning of my previous post :

                          https://community.notepad-plus-plus.org/post/65328

                          My goal, that I slightly improved, is then :

                          FIRST step :

                          • To add the <!-- Y --> comment at the very end of any HTML file which contains, at least, one string BABY SIN

                          • To add the <!-- N --> comment at the very end of any HTML file which does not contain any string BABY SIN

                          • So, open either :

                            • The Find in files dialog, if you need to search the START•••••FINNISH section(s) in several HTML files

                            • The Replace dialog, if you need to search the START•••••FINNISH section(s) in a single HTML file

                          • SEARCH (?s-i)\A(?:.*(BABY SIN)|).*</html>(?!<)\K

                          • REPLACE ?1<!-- Y -->:<!-- N -->

                          • Select *.html in the Filters zone, if necessary

                          • Tick the Wrap around option

                          • Click on the Replace All or Replace in Files button

                          Now, after this first step, you should have :

                          • Some HTML files with en ending comment <!-- Y --> ( Those which contain a BABY SIN string )

                          • Some HTML files with en ending comment <!-- N --> ( Those which do not contain any BABY SIN string )


                          SECOND step :

                          Now, thanks to that ending comment added, after the </html> tag, we can easily search for :

                          • The first START•••••FINNISH region, of current HTML file, if a BABY SIN string exists in current file :

                            • (?s-i)\A.*?\KSTART.+?FINNISH(?=.*<!-- Y -->\Z)
                          • The last START•••••FINNISH region, of current HTML file, if a BABY SIN string exists in current file :

                            • (?s-i)\A.*\KSTART.+?FINNISH(?=.*<!-- Y -->\Z)
                          • Any START•••••FINNISH region, in current HTML file, if a BABY SIN string exists in current file :

                            • (?s-i).*?\KSTART.+?FINNISH(?=.*<!-- Y -->\Z)

                          And :

                          • The first START•••••FINNISH region, of current HTML file, if no BABY SIN string exists in current file :

                            • (?s-i)\A.*?\KSTART.+?FINNISH(?=.*<!-- N -->\Z)
                          • The last START•••••FINNISH region, of current HTML file, if no BABY SIN string exists in current file :

                            • (?s-i)\A.*\KSTART.+?FINNISH(?=.*<!-- N -->\Z)
                          • Any START•••••FINNISH region, in current HTML file, if no BABY SIN string exists in current file :

                            • (?s-i).*?\KSTART.+?FINNISH(?=.*<!-- N -->\Z)

                          Best Regards,

                          guy038

                          1 Reply Last reply Reply Quote 1
                          • Robin CruiseR
                            Robin Cruise
                            last edited by

                            super answer, thank you sir @guy038

                            1 Reply Last reply Reply Quote 0
                            • First post
                              Last post
                            The Community of users of the Notepad++ text editor.
                            Powered by NodeBB | Contributors