How to find words with 12 or more alphabets that are between a <li style...........> and </li>
-
@dr-ramaanand said in How to find words with 12 or more alphabets that are between a <li style...........> and </li>:
(?-is:<span style=|(?!\A)\G).*?\K(?i)(\b[a-z]{12,}\b)
How to make the searching stop upon encountering a
</span>
? -
@guy038 The regular expression
(?-is:<span style=|(?!\A)\G).*?\K(?i)(\b[a-z]{12,}\b).*?<\/span>
helps find words of 12 alphabets or more between<span style=.....>
and</span>
. Is it correct? -
Block of text for testing:-
<span style=“padding: 0px; list-style-type: decimal; list-style-image: none; list-style-position: outside; font-family: “verdana”; font-size: 18px; color: black;”>Haemorrhoids <br>- piles</span> <span style=“padding: 0px; list-style-type: decimal; list-style-image: none; list-style-position: outside; font-family: “verdana”; font-size: 18px; color: black;”>Haemorrhoids<br>-piles</span> <span style=“padding: 0px; list-style-type: decimal; list-style-image: none; list-style-position: outside; font-family: “verdana”; font-size: 18px; color: black;”>Offensive haemorrhages</span> <span style=“padding: 0px; list-style-type: decimal; list-style-image: none; list-style-position: outside; font-family: “verdana”; font-size: 18px; color: black;”>CONSCIOUSNESS of womb. Hysterically inclined.</span> <p style=“padding: 0px; list-style-type: decimal; list-style-image: none; list-style-position: outside; font-family: “verdana”; font-size: 18px; color: black;”>Confirmatory symptoms</p> <ol><li style=“padding: 0px; list-style-type: decimal; list-style-image: none; list-style-position: outside; font-family: “verdana”; font-size: 18px; color: black;”>REMEDY RELATIONSHIPS</li></ol>
Everything after the
</span>
should be skipped -
@guy038 Tweaking your RegEx above to
(?-is:<span style=|(?!\A)\G).*?\K(?i)(\b[a-z]{12,}\b)
helped find the first word with 12 or more alphabets that are between a <span style…> and </span>.
Someone at https://regex101.com tweaked it to(?-i:<span style=[^>]*+>|(?!\A)\G)(?>[^<>\w]++|<(?!\/?span\b)[^>]*+>|\b\w{1,11}\b)*+\K(\b[a-z]{12,}\b)
which also helped find the first word with 12 or more alphabets that are between a <span style…> and </span> on my laptop but for them (and probably, even you) you are able to find/match the second word with 12 or more alphabets that are between a <span style…> and </span> also. I think this is a Notepad++ bug -
@dr-ramaanand said in How to find words with 12 or more alphabets that are between a <li style...........> and </li>:
I think this is a Notepad++ bug
Remember, regex101 is not specific to the Boost regex engine, which is the engine that Notepad++ uses. They might not be answering you with the Boost specifics in mind (though I haven’t read your discussions over there, so maybe they really are answering w/r/t Boost / Notepad++).
Every regex engine has its own design decisions, and just because the Boost regex design decisions are different than the engines they deal with does not mean that it’s a bug in Notepad++ (and if it is a bug in Notepad++, it would presumably be inherited from Boost regex, not inherent to Notepad++ itself, as far as I understand things).
-
@dr-ramaanand said in How to find words with 12 or more alphabets that are between a <li style...........> and </li>:
(?-i:<span style=[^>]+>|(?!\A)\G)(?>[^<>\w]++|<(?!/?span\b)[^>]+>|\b\w{1,11}\b)*+\K(\b[a-z]{12,}\b)
@PeterJones The person who helped me with the above RegEx at https://regex101.com was kind enough to send me a screenshot of his results with Notepad++ which found/matched every word of 12 alphabets or more but on my laptop, it found/matched only the first word of 12 alphabets or more between
<span.............
and</span>
-
send me a screenshot of his results with Notepad++ which found/matched every word of 12 alphabets
That means you obviously did something differently than he did, or you did not describe your data correctly to him. Oddly enough, the same thing happens here when Guy and the others try to help you here.
Calling it a “bug” in Notepad++, at this point, seems a rather premature conclusion on your part.
I took your most-recent data (from this post) along with the regex that you got from regex101, and it replaced all the instances of twelve-or-more-letter words for me. So I would agree with the regex101 , that the regex works.
Here are some screenshots: I start with the data pasted from your post, then I use the regex shown and use Replace All, and I see
- this:
- get converted into:
The five replacements are correct
Remember, with ANY regex that uses
\K
(which this one does), it will not work with single Replace – you must use Replace AllSo, if the data looks like the data you most-recently shared here, then the regex works to transform your data by using Replace All. If your data does not look like what you shared here, then all bets are off, and all I can do is reiterate what others have told you: regex alone is the wrong tool for editing HTML.
- this:
-
Hello, @dr-ramaanand, @peterjones and All,
@dr-ramaanand, you’re trying to use this regex
(?-i:<span style=[^>]+>|(?!\A)\G)(?>[^<>\w]++|<(?!/?span\b)[^>]+>|\b\w{1,11}\b)*+\K(\b[a-z]{12,}\b)
But this one, much more simple,
(?-is:<span style=|(?!\A)\G).*?\K(?i:\b[a-z]{12,}\b)
does EXACTLY the same search and find the same matches !!
It comes from the second form of my generic regex, discussed in this post :
where I said, textually :
When the BSR and the different matches of the FR regex are all located in a single line, any line-ending char(s) will implicitly break down the
\G
feature. The ESR part is then useless and the generic regex can be simplified into :SEARCH
(?-s)(?-i:
BSR|(?!\A)\G).*?\K(?-i:
FR)
REPLACE RR
Thus, @dr-ramaanand, just stay concentrated on the black zones to modify, in order to form a valid expression !! Do not try anything else, which is, finally, more complicated !!
So, if we just replace the BSR zone by
<span style=
and the FR zone by\b[a-z]{12,}\b
, we end up with the valid SEARCH regex, below :- SEARCH
(?-s)(?-i:<span style=|(?!\A)\G).*?\K(?i:\b[a-z]{12,}\b)
Which I just simplified as :
- SEARCH
(?-si:<span style=|(?!\A)\G).*?\K(?i:\b[a-z]{12,}\b)
And, given your INPUT text :
<span style=“padding: 0px; list-style-type: decimal; list-style-image: none; list-style-position: outside; font-family: “verdana”; font-size: 18px; color: black;”>Haemorrhoids <br>- piles</span> <span style=“padding: 0px; list-style-type: decimal; list-style-image: none; list-style-position: outside; font-family: “verdana”; font-size: 18px; color: black;”>Haemorrhoids<br>-piles</span> <span style=“padding: 0px; list-style-type: decimal; list-style-image: none; list-style-position: outside; font-family: “verdana”; font-size: 18px; color: black;”>Offensive haemorrhages</span> <span style=“padding: 0px; list-style-type: decimal; list-style-image: none; list-style-position: outside; font-family: “verdana”; font-size: 18px; color: black;”>CONSCIOUSNESS of womb. Hysterically inclined.</span> <p style=“padding: 0px; list-style-type: decimal; list-style-image: none; list-style-position: outside; font-family: “verdana”; font-size: 18px; color: black;”>Confirmatory symptoms</p> <ol><li style=“padding: 0px; list-style-type: decimal; list-style-image: none; list-style-position: outside; font-family: “verdana”; font-size: 18px; color: black;”>REMEDY RELATIONSHIPS</li></ol>
It would match, as @peterjones said, the five strings
Haemorrhoids
,Haemorrhoids
,haemorrhages
,CONSCIOUSNESS
andHysterically
which all contain12
or13
charsBest Regards,
guy038
- SEARCH
-
@guy038 @PeterJones It looks like I was using an old, earlier version of Notepad++. I have now installed the v8.7.4 (32 bit) version of Notepad++, released on December 4th, 2024. The RegEx
(?-si:<span style=|(?!\A)\G).*?\K(?i:\b[a-z]{12,}\b)
does not stop searching/finding matches of words with 12 or more letters/alphabets after a</span>
but(?-s)(?-i:<span style=|(?!\A)\G).*?\K(?i:\b[a-z]{12,}\b)
and(?-i:<span style=[^>]*+>|(?!\A)\G)(?>[^<>\w]++|<(?!\/?span\b)[^>]*+>|\b\w{1,11}\b)*+\K(\b[a-z]{12,}\b)
do. Thanks a lot for your time. -
Hi, @dr-ramaanand, @peterjones and All,
@dr-ramaanand, I don’t think that your old version was a problem, regarding the results of these regexes !
You said, in your last post :
The RegEx
(?-si:<span style=|(?!\A)\G).*?\K(?i:\b[a-z]{12,}\b)
does not stop searching/finding matches of words with 12 or more letters/alphabets after a </span> …Sorry to contradict you but, given your INPUT text, below :
<span style=“padding: 0px; list-style-type: decimal; list-style-image: none; list-style-position: outside; font-family: “verdana”; font-size: 18px; color: black;”>Haemorrhoids <br>- piles</span> <span style=“padding: 0px; list-style-type: decimal; list-style-image: none; list-style-position: outside; font-family: “verdana”; font-size: 18px; color: black;”>Haemorrhoids<br>-piles</span> <span style=“padding: 0px; list-style-type: decimal; list-style-image: none; list-style-position: outside; font-family: “verdana”; font-size: 18px; color: black;”>Offensive haemorrhages</span> <span style=“padding: 0px; list-style-type: decimal; list-style-image: none; list-style-position: outside; font-family: “verdana”; font-size: 18px; color: black;”>CONSCIOUSNESS of womb. Hysterically inclined.</span> <p style=“padding: 0px; list-style-type: decimal; list-style-image: none; list-style-position: outside; font-family: “verdana”; font-size: 18px; color: black;”>Confirmatory symptoms</p> <ol><li style=“padding: 0px; list-style-type: decimal; list-style-image: none; list-style-position: outside; font-family: “verdana”; font-size: 18px; color: black;”>REMEDY RELATIONSHIPS</li></ol>
My regex DOES find the five words
Haemorrhoids
,Haemorrhoids
,haemorrhages
,CONSCIOUSNESS
andHysterically
, as well as the two other regexes you mentionned !But, as I said in my previous post, this syntax only works if the BSR region and all the matching words lie in the current line, before an other
<span style=
syntax beginning the next line
Now, let’s use this new INPUT text, pasted in a new tab, where I added some
HTML
paragraphs, after the</span>
tag of each line<span style=“padding: 0px; list-style-type: decimal; list-style-image: none; list-style-position: outside; font-family: “verdana”; font-size: 18px; color: black;”>Haemorrhoids <br>- piles</span> <p>The words abbreviation and fabrications are long</p> <span style=“padding: 0px; list-style-type: decimal; list-style-image: none; list-style-position: outside; font-family: “verdana”; font-size: 18px; color: black;”>Haemorrhoids<br>-piles</span> <p>This radioacoustics</p> <span style=“padding: 0px; list-style-type: decimal; list-style-image: none; list-style-position: outside; font-family: “verdana”; font-size: 18px; color: black;”>Offensive haemorrhages</span> <p>Words nanofabrications and calligraphically contains 16 letters</p> <span style=“padding: 0px; list-style-type: decimal; list-style-image: none; list-style-position: outside; font-family: “verdana”; font-size: 18px; color: black;”>CONSCIOUSNESS of womb. Hysterically inclined.</span> <p>Echocardiographies and Neovascularization</p> <p style=“padding: 0px; list-style-type: decimal; list-style-image: none; list-style-position: outside; font-family: “verdana”; font-size: 18px; color: black;”>Confirmatory symptoms</p> <ol><li style=“padding: 0px; list-style-type: decimal; list-style-image: none; list-style-position: outside; font-family: “verdana”; font-size: 18px; color: black;”>REMEDY RELATIONSHIPS</li></ol>
This time, we cannot apply the same regex as some words, located after
</span>
would also be detected !Indeed, the additional words
abbreviation
,facrications
,radioacoustics
,nanofabrications
,calligraphically
,Echocardiographies
andNeovascularization
are not wanted !Thus, we must use the first form of my generic regex, described in this post :
SEARCH
(?-si:
BSR|(?!\A)\G)(?s-i:(?!
ESR).)*?\K(?-si:
FR)
With :
BSR =
<span style=
ESR =
</span>
FR =
\b[a-z]{12,}\b
Leading to this functional regex :
SEARCH
(?-si:<span style=|(?!\A)\G)(?s-i:(?!</span>).)*?\K(?i:\b[a-z]{12,}\b)
Remark : I also changed the part
(?-si:
, before FR, by simply(?i:
, in order to find any word of more than11
letters whatever its case !And again, it would just match the five words
Haemorrhoids
,Haemorrhoids
,haemorrhages
,CONSCIOUSNESS
andHysterically
, which come BEFORE</span>
:-))
Now, let’s get back to your original INPUT text, where I simply changed the name of the font by an other name with more than
11
letters : theDejaVuSansMono
font !<span style=“padding: 0px; list-style-type: decimal; list-style-image: none; list-style-position: outside; font-family: “DejaVuSansMono”; font-size: 18px; color: black;”>Haemorrhoids <br>- piles</span> <span style=“padding: 0px; list-style-type: decimal; list-style-image: none; list-style-position: outside; font-family: “DejaVuSansMono”; font-size: 18px; color: black;”>Haemorrhoids<br>-piles</span> <span style=“padding: 0px; list-style-type: decimal; list-style-image: none; list-style-position: outside; font-family: “DejaVuSansMono”; font-size: 18px; color: black;”>Offensive haemorrhages</span> <span style=“padding: 0px; list-style-type: decimal; list-style-image: none; list-style-position: outside; font-family: “DejaVuSansMono”; font-size: 18px; color: black;”>CONSCIOUSNESS of womb. Hysterically inclined.</span> <p style=“padding: 0px; list-style-type: decimal; list-style-image: none; list-style-position: outside; font-family: “verdana”; font-size: 18px; color: black;”>Confirmatory symptoms</p> <ol><li style=“padding: 0px; list-style-type: decimal; list-style-image: none; list-style-position: outside; font-family: “verdana”; font-size: 18px; color: black;”>REMEDY RELATIONSHIPS</li></ol>
If I use the last regex of my previous post :
(?-si:<span style=|(?!\A)\G).*?\K(?i:\b[a-z]{12,}\b)
Against the INPUT text just above, it would wrongly match the name of the font, four times :-((
We can solve this problem by changing the BSR search region ! Instead of
<span style=
, we’ll rather use<span style=.+?>
Hence, this new version :
(?-si:<span style=.+?>|(?!\A)\G).*?\K(?i:\b[a-z]{12,}\b)
Or this one :
(?-si:<span style=.+?>|(?!\A)\G)(?s-i:(?!</span>).)*?\K(?i:\b[a-z]{12,}\b)
, in case of possible words after</span>
and/or multi-lines resultsBest Regards,
guy038
-
@guy038 Oui, merci beaucoup!