Regex: Select the text between certain words, only from the file that contains a certain word
-
good day. I have to select some text between words
start
andfinnish
in several html files. But, also, I need to select that particular text only if the file contains the wordBABY SIN
For example:
Lorem ipsum, or lipsum as it is sometimes known, is dummy text used in laying out print, graphic or web designs, of a BABY SIN mark. The passage is attributed to an unknown typesetter in the 15th century who is thought to have START scrambled parts of Cicero's De Finibus Bonorum et Malorum for use in a FINNISH type specimen book.
I made a regex, but something is not very good. And I remember that @guy038 made a post with something simillar, but cannot find that post.
(?s)(.*\b(BABY SIN)\b.*)\K(?s)(START).*(FINNISH)
Can anyone help me?
-
Hello, @robin-cruise,
I’m sorry but your phrasing is a bit ambiguous ! We must be precise as possible !
You said :
I have to select some text between words start and finnish in several html files. But, also, I need to select that particular text only if the file contains the word BABY SIN
Here is what I understand :
-
You want to find an expression ( but what, please : a simple char, a word, a range of words, a complete sentence, a complete line, a bunch of lines ? ) between the string delimiter
START
and the string delimiterFINNISH
, with this exact case -
But you want that this search occurs ONLY IF the two words string
BABY SIN
, with this exact case, exists in current file, whatever theBABY SIN
’s location, I suppose ( inside OR outside theSTART•••••••••••FINNISH
interval ! )
So, thank for developing your needs ?
BR
guy038
-
-
hello, this is what I want to select, only if the file contains the words BABY SIN
-
Hi , @robin-cruise,
Ah… OK. So, when you said :
I have to select SOME text between words start and finnish
you wanted to express :
I would like to select ALL text between the two words
START
andFININSH
I agree that the nuance is subtle ;-))
Now, from your picture, I see that, apparently, you also want to match the two delimiters
START
andFINNISH
, themselves-
However, you didn’t answer me about the possible locations of the
BABY SIN
string ( inside, .outside theSTART•••••••••••FINNISH
section or before / after it ). -
Also, does your
HTML
text contain only ONE or severalSTART•••••••••••FINNISH
sections ?
BR
guy038
-
-
oh, yes.
Baby Sin
can be located anywhere in the file. And my html contain only oneSTART•••••••••••FINNISH
section.(but as an alternative, it may be the case that I have 2
START•••••••••••FINNISH
sections and I should select the first one, or other case the last one. -
Hello, @Robin-cruise and All,
The general problem is that the regex engine always searches from the left to the right. So, one the
BABY SIN
location is over, there no means for the regex engine to remember that current file contains that specific string :-((
Or course, there’s a simple solution, used many times in regex topics ! Before speaking about it, in the second part of this post, I also considered the possibility to catch the
BABY SIN
string with this kind of regex :(?s-i)(?=\A.*?(BABY SIN))(*F)|(?(1).*?\KSTART.+?FINNISH)
So, when the regex engine is right before the first char of current file :
-
The regex engine tests the first alternative
(?s-i)(?=\A.*?(BABY SIN))(*F)
and match an empty string if the stringBABY SIN
exists. So, now, the group1
is defined asBABY SIN
. Note that, at the end, the control verb(*F)
cancels the current alternative but, luckily, does not reset the group1
-
So, due to the
(*F)
syntax, the regex engine switches to the next alternative(?(1).*?\KSTART.+?FINNISH)
which is a conditional expression that is true ONLY IF group1
exists. So, still from the very beginning of file, it looks for minimum stuff (.*?
), forgotten because of the\K
syntax, and, finally, looks for and finds the firstSTART•••••••••FINNISH
section. Nice ! -
However, let’s imagine that the current file contain a second
START•••••FINNISH
section. So, the regex engine goes on processing the overall regex :-
Current position is obviously not at the very beginning of file, so the first alternative cannot match and the group
1
is not defined. Moreover, this first alternative is canceled due to the(*F)
syntax -
Thus the second alternative
(?(1).*?\KSTART.+?FINNISH)
is processed. Note that this regex is equivalent to the regex(?(1).*?\KSTART.+?FINNISH|)
with an empty ELSE part. As the group1
is not defined, this empty ELSE part simply matches an empty string at the location right after theFINNISH
word and in all the subsequent locations till the end of file !
-
This is absolutely not what is expected ! Unfortunately, and unlike programs and scripts, the regex groups and subroutines calls cannot be stored over two consecutive search processes !
Thus, the sole practical and easy solution is to place an specific indicator at the very end of current document, which can be noticed with an look-ahead, and, for instance, the syntax
(?=.*indicator\z)
As you deal with
HTML
, I suppose that a comment after the last</html>
tag, is allowed by the language ?So, we could change the last line
</html>
into the line</html><!-- Y -->
with this regex S/RSEARCH
(?s-i)\A.*BABY SIN.*</html>\K
REPLACE
<!-- Y -->
Note that changing, LATER, the
Y
letter ( Yes ) to theN
letter or anything else, in anHTML
file, would not trigger the search of aSTART•••••FINNISH
section for this specific file and vice-versa !
Now, the search of a particular
START•••••FINNISH
section is rather easy ! To search for :-
The first
START•••••FINNISH
section, use the regex(?s-i)\A.*?\KSTART.+?FINNISH(?=.*<!-- Y -->\Z)
-
The last
START•••••FINNISH
section, use the regex(?s-i)\A.*\KSTART.+?FINNISH(?=.*<!-- Y -->\Z)
-
The subsequent
START•••••FINNISH
sections, use the regex(?s-i).*?\KSTART.+?FINNISH(?=.*<!-- Y -->\Z)
Remember to move the caret at the very beginning of current file, in case of an individual search with a click on the
Find Next
button !Best regards,
guy038
-
-
ok, I don’t quite understand the last part from the last 3 regex, more special this
<!-- Y -->
(?s-i)\A.*?\KSTART.+?FINNISH(?=.*<!-- Y -->\Z)
(?s-i)\A.*\KSTART.+?FINNISH(?=.*<!-- Y -->\Z)
(?s-i).*?\KSTART.+?FINNISH(?=.*<!-- Y -->\Z)
In my case, in this last 3 example, where should I place the words
BABY SIN
?something like this, will work:
(?s-i)(?=\A.*?(BABY SIN))(*F)|(?s-i)\A.*?\KSTART.+?FINNISH
-
Hi @robin-cruise,
But, if you remove totally any
BABY SIN
keyword from your file, your last regex, derived from my own attempt, still findsSTART.....FINNISH
sections ! Not what it is expected, isn’t it ?Moreover, even if your file contains a
BABY SIN
string, your last regex would find the firstSTART.....FINNISH
section, only, and not the subsequent ones, in case of several sections !
I’m trying to rephrase my last post ! See you later
BR
guy038
guy038
-
SELECT ALL INSTANCES:
(?s-i)(?=\A.*?(BABY SIN))(*F)|(?s-i).*?\KSTART.+?FINNISH
SELECT FIRST INSTANCE:
(?s-i)(?=\A.*?(BABY SIN))(*F)|(?s-i)\A.*\KSTART.+?FINNISH
SELECT LAST INSTANCE:
(?s-i)(?=\A.*?(BABY SIN))(*F)|(?s-i)\A.*\KSTART.+?FINNISH
thanks, @guy038
-
Hi, @robin-cruise,
I regret, but these three provided regexes do not give you initial goal which was to find
START..... FINNISH
sections ONLY IF the stringBABY SIN
is found anywhere in current file !In addition, your second and third regexes seem identical !?
So, just wait for my next reply !
BR
guy038
-
Hello, @robin-cruise and All,
Robin, as you want to search for
START•••••FINNISH
section(s) in someHTML
files but ONLY IF current file contains the stringBABY SIN
, and taking into account the limitations, outlined at the very beginning of my previous post :https://community.notepad-plus-plus.org/post/65328
My goal, that I slightly improved, is then :
FIRST step :
-
To add the
<!-- Y -->
comment at the very end of anyHTML
file which contains, at least, one stringBABY SIN
-
To add the
<!-- N -->
comment at the very end of anyHTML
file which does not contain any stringBABY SIN
-
So, open either :
-
The
Find in files
dialog, if you need to search theSTART•••••FINNISH
section(s) in severalHTML
files -
The
Replace
dialog, if you need to search theSTART•••••FINNISH
section(s) in a singleHTML
file
-
-
SEARCH
(?s-i)\A(?:.*(BABY SIN)|).*</html>(?!<)\K
-
REPLACE
?1<!-- Y -->:<!-- N -->
-
Select
*.html
in theFilters
zone, if necessary -
Tick the
Wrap around
option -
Click on the
Replace All
orReplace in Files
button
Now, after this first step, you should have :
-
Some
HTML
files with en ending comment<!-- Y -->
( Those which contain aBABY SIN
string ) -
Some
HTML
files with en ending comment<!-- N -->
( Those which do not contain anyBABY SIN
string )
SECOND step :
Now, thanks to that ending comment added, after the
</html>
tag, we can easily search for :-
The first
START•••••FINNISH
region, of currentHTML
file, if aBABY SIN
string exists in current file :(?s-i)\A.*?\KSTART.+?FINNISH(?=.*<!-- Y -->\Z)
-
The last
START•••••FINNISH
region, of currentHTML
file, if aBABY SIN
string exists in current file :(?s-i)\A.*\KSTART.+?FINNISH(?=.*<!-- Y -->\Z)
-
Any
START•••••FINNISH
region, in currentHTML
file, if aBABY SIN
string exists in current file :(?s-i).*?\KSTART.+?FINNISH(?=.*<!-- Y -->\Z)
And :
-
The first
START•••••FINNISH
region, of currentHTML
file, if noBABY SIN
string exists in current file :(?s-i)\A.*?\KSTART.+?FINNISH(?=.*<!-- N -->\Z)
-
The last
START•••••FINNISH
region, of currentHTML
file, if noBABY SIN
string exists in current file :(?s-i)\A.*\KSTART.+?FINNISH(?=.*<!-- N -->\Z)
-
Any
START•••••FINNISH
region, in currentHTML
file, if noBABY SIN
string exists in current file :(?s-i).*?\KSTART.+?FINNISH(?=.*<!-- N -->\Z)
Best Regards,
guy038
-
-
super answer, thank you sir @guy038