How to find words with 12 or more alphabets that are between a <li style...........> and </li>
-
Hello, @dr-ramaanand and All,
For your problem, I would use the second form of the generic regex, exposed in this post :
The SEARCH regex is
(?-s)(?-i:
BSR|(?!\A)\G).*?\K(?-i:
FR)
Note that the key-point, of that generic regex, is the use of the
\G
regex feature which means that a next match MUST begin right after the previous match !
If we apply this generic regex to your practical search, it means that :
-
The BSR ( Begin Search region Regex ) is
<li style=
-
The FR ( Find Regex ) is
(?i)(\b[a-z]{12,}\b)
Note that I changed the initial FR case-sensitive region
(?-i:
FR)
, embedded in a non-capturing group, by the case-insensitive region, embedded in group1
(?i)(\b[a-z]{12,}\b)
This leads to this functional regex, below, which solves your practical case :
SEARCH
(?-is:<li style=|(?!\A)\G).*?\K(?i)(\b[a-z]{12,}\b)
REPLACE RR
Thus :
- Put the INPUT text, below , in a new tab
<div class=“right”> <ol> <li style=“padding: 0px; list-style-type: decimal; list-style-image: none; list-style-position: outside; font-family: “verdana”; font-size: 18px; color: black;”><div class=“marginleft”>Haemorrhoids <br>- piles</div></li> <li style=“padding: 0px; list-style-type: decimal; list-style-image: none; list-style-position: outside; font-family: “verdana”; font-size: 18px; color: black;”><div class=“marginleft”>Haemorrhoids<br>-piles</div></li> <li style=“padding: 0px; list-style-type: decimal; list-style-image: none; list-style-position: outside; font-family: “verdana”; font-size: 18px; color: black;”><div class=“marginleft”>Offensive haemorrhages</div></li> <li style=“padding: 0px; list-style-type: decimal; list-style-image: none; list-style-position: outside; font-family: “verdana”; font-size: 18px; color: black;”><div class=“marginleft”>CONSCIOUSNESS of womb. Hysterically inclined.</div></li> </ol>
-
Move to the very beginning of the file
-
Open the Find or Replace dialog
-
Uncheck all the box options
-
Check the
Wrap around
option -
Select the
Regular expression
search mode -
Click, several times, on the
Next
button to verify the different matches or click, once only, on theReplace All
button for a global replacement
Note : if your text may contain accentuated characters, I advice you to prefer this version :
- SEARCH
(?-is:<li style=|(?!\A)\G).*?\K(?i)(\b[\u\l]{12,}\b)
Best Regards,
guy038
-
-
@guy038 The word/term, “Hysterically” is not found/matched, probably because it is the second word with 12 or more alphabets in the same line. Can you tweak that RegEx to help find such words?
-
@guy038
(?-is:<li style=|(?!\A)\G).*?\K(?i)(\b[a-z]{12,}\b).*?\K(\b[a-z]{12,}\b)
helps find/match only the second word/term with 12 or more alphabets. So I can probably use this RegEx and make changes and then use what you gave to find the first word/term with 12 or more alphabets. -
Hi, @dr-ramaanand and All,
I don’t understand ! With the
(?-is:<li style=|(?!\A)\G).*?\K(?i)(\b[a-z]{12,}\b)
regex, once the search regex has matched the wordCONSCIOUSNESS
, if you click again on theFind Next
button, it does find the wordHysterically
!If it’s not the case just post the EXACT raw text used
You do not need your second version !
BR
guy038
P.S. : if your strings
<li style=
always begin a line, you may narrow down the results with this version :- SEARCH
(?-is:^<li style=|(?!\A)\G).*?\K(?i)(\b[a-z]{12,}\b)
- SEARCH
-
@guy038 It doesn’t begin at the start of the line, so I will not use your last RegEx (just above this response of mine). At regex101.com your Regular Expression does find that second word/term with 12 or more alphabets (see https://regex101.com/r/Dw8XTK/1) but it doesn’t on my laptop. It may be due to a bug but I will manage - I don’t want to waste your time. Thanks for your help and time!
-
@guy038 I have another block of text for testing as follows:-
<span style=“padding: 0px; list-style-type: decimal; list-style-image: none; list-style-position: outside; font-family: “verdana”; font-size: 18px; color: black;”>Haemorrhoids <br>- piles</span>
<span style=“padding: 0px; list-style-type: decimal; list-style-image: none; list-style-position: outside; font-family: “verdana”; font-size: 18px; color: black;”>Haemorrhoids<br>-piles</span>
<span style=“padding: 0px; list-style-type: decimal; list-style-image: none; list-style-position: outside; font-family: “verdana”; font-size: 18px; color: black;”>Offensive haemorrhages</span>
<span style=“padding: 0px; list-style-type: decimal; list-style-image: none; list-style-position: outside; font-family: “verdana”; font-size: 18px; color: black;”>CONSCIOUSNESS of womb. Hysterically inclined.</span>The Regular expression
(?-is:<span style=|(?!\A)\G).*?\K(?i)(\b[a-z]{12,}\b).*?\K(\b[a-z]{12,}\b)
does not find/match anything -
@guy038
(?-is:<span style=|(?!\A)\G).*?\K(?i)(\b[a-z]{12,}\b)
however does find/match it -
@dr-ramaanand said in How to find words with 12 or more alphabets that are between a <li style...........> and </li>:
(?-is:<span style=|(?!\A)\G).*?\K(?i)(\b[a-z]{12,}\b)
How to make the searching stop upon encountering a
</span>
? -
@guy038 The regular expression
(?-is:<span style=|(?!\A)\G).*?\K(?i)(\b[a-z]{12,}\b).*?<\/span>
helps find words of 12 alphabets or more between<span style=.....>
and</span>
. Is it correct? -
Block of text for testing:-
<span style=“padding: 0px; list-style-type: decimal; list-style-image: none; list-style-position: outside; font-family: “verdana”; font-size: 18px; color: black;”>Haemorrhoids <br>- piles</span> <span style=“padding: 0px; list-style-type: decimal; list-style-image: none; list-style-position: outside; font-family: “verdana”; font-size: 18px; color: black;”>Haemorrhoids<br>-piles</span> <span style=“padding: 0px; list-style-type: decimal; list-style-image: none; list-style-position: outside; font-family: “verdana”; font-size: 18px; color: black;”>Offensive haemorrhages</span> <span style=“padding: 0px; list-style-type: decimal; list-style-image: none; list-style-position: outside; font-family: “verdana”; font-size: 18px; color: black;”>CONSCIOUSNESS of womb. Hysterically inclined.</span> <p style=“padding: 0px; list-style-type: decimal; list-style-image: none; list-style-position: outside; font-family: “verdana”; font-size: 18px; color: black;”>Confirmatory symptoms</p> <ol><li style=“padding: 0px; list-style-type: decimal; list-style-image: none; list-style-position: outside; font-family: “verdana”; font-size: 18px; color: black;”>REMEDY RELATIONSHIPS</li></ol>
Everything after the
</span>
should be skipped -
@guy038 Tweaking your RegEx above to
(?-is:<span style=|(?!\A)\G).*?\K(?i)(\b[a-z]{12,}\b)
helped find the first word with 12 or more alphabets that are between a <span style…> and </span>.
Someone at https://regex101.com tweaked it to(?-i:<span style=[^>]*+>|(?!\A)\G)(?>[^<>\w]++|<(?!\/?span\b)[^>]*+>|\b\w{1,11}\b)*+\K(\b[a-z]{12,}\b)
which also helped find the first word with 12 or more alphabets that are between a <span style…> and </span> on my laptop but for them (and probably, even you) you are able to find/match the second word with 12 or more alphabets that are between a <span style…> and </span> also. I think this is a Notepad++ bug -
@dr-ramaanand said in How to find words with 12 or more alphabets that are between a <li style...........> and </li>:
I think this is a Notepad++ bug
Remember, regex101 is not specific to the Boost regex engine, which is the engine that Notepad++ uses. They might not be answering you with the Boost specifics in mind (though I haven’t read your discussions over there, so maybe they really are answering w/r/t Boost / Notepad++).
Every regex engine has its own design decisions, and just because the Boost regex design decisions are different than the engines they deal with does not mean that it’s a bug in Notepad++ (and if it is a bug in Notepad++, it would presumably be inherited from Boost regex, not inherent to Notepad++ itself, as far as I understand things).
-
@dr-ramaanand said in How to find words with 12 or more alphabets that are between a <li style...........> and </li>:
(?-i:<span style=[^>]+>|(?!\A)\G)(?>[^<>\w]++|<(?!/?span\b)[^>]+>|\b\w{1,11}\b)*+\K(\b[a-z]{12,}\b)
@PeterJones The person who helped me with the above RegEx at https://regex101.com was kind enough to send me a screenshot of his results with Notepad++ which found/matched every word of 12 alphabets or more but on my laptop, it found/matched only the first word of 12 alphabets or more between
<span.............
and</span>
-
send me a screenshot of his results with Notepad++ which found/matched every word of 12 alphabets
That means you obviously did something differently than he did, or you did not describe your data correctly to him. Oddly enough, the same thing happens here when Guy and the others try to help you here.
Calling it a “bug” in Notepad++, at this point, seems a rather premature conclusion on your part.
I took your most-recent data (from this post) along with the regex that you got from regex101, and it replaced all the instances of twelve-or-more-letter words for me. So I would agree with the regex101 , that the regex works.
Here are some screenshots: I start with the data pasted from your post, then I use the regex shown and use Replace All, and I see
- this:
- get converted into:
The five replacements are correct
Remember, with ANY regex that uses
\K
(which this one does), it will not work with single Replace – you must use Replace AllSo, if the data looks like the data you most-recently shared here, then the regex works to transform your data by using Replace All. If your data does not look like what you shared here, then all bets are off, and all I can do is reiterate what others have told you: regex alone is the wrong tool for editing HTML.
- this:
-
Hello, @dr-ramaanand, @peterjones and All,
@dr-ramaanand, you’re trying to use this regex
(?-i:<span style=[^>]+>|(?!\A)\G)(?>[^<>\w]++|<(?!/?span\b)[^>]+>|\b\w{1,11}\b)*+\K(\b[a-z]{12,}\b)
But this one, much more simple,
(?-is:<span style=|(?!\A)\G).*?\K(?i:\b[a-z]{12,}\b)
does EXACTLY the same search and find the same matches !!
It comes from the second form of my generic regex, discussed in this post :
where I said, textually :
When the BSR and the different matches of the FR regex are all located in a single line, any line-ending char(s) will implicitly break down the
\G
feature. The ESR part is then useless and the generic regex can be simplified into :SEARCH
(?-s)(?-i:
BSR|(?!\A)\G).*?\K(?-i:
FR)
REPLACE RR
Thus, @dr-ramaanand, just stay concentrated on the black zones to modify, in order to form a valid expression !! Do not try anything else, which is, finally, more complicated !!
So, if we just replace the BSR zone by
<span style=
and the FR zone by\b[a-z]{12,}\b
, we end up with the valid SEARCH regex, below :- SEARCH
(?-s)(?-i:<span style=|(?!\A)\G).*?\K(?i:\b[a-z]{12,}\b)
Which I just simplified as :
- SEARCH
(?-si:<span style=|(?!\A)\G).*?\K(?i:\b[a-z]{12,}\b)
And, given your INPUT text :
<span style=“padding: 0px; list-style-type: decimal; list-style-image: none; list-style-position: outside; font-family: “verdana”; font-size: 18px; color: black;”>Haemorrhoids <br>- piles</span> <span style=“padding: 0px; list-style-type: decimal; list-style-image: none; list-style-position: outside; font-family: “verdana”; font-size: 18px; color: black;”>Haemorrhoids<br>-piles</span> <span style=“padding: 0px; list-style-type: decimal; list-style-image: none; list-style-position: outside; font-family: “verdana”; font-size: 18px; color: black;”>Offensive haemorrhages</span> <span style=“padding: 0px; list-style-type: decimal; list-style-image: none; list-style-position: outside; font-family: “verdana”; font-size: 18px; color: black;”>CONSCIOUSNESS of womb. Hysterically inclined.</span> <p style=“padding: 0px; list-style-type: decimal; list-style-image: none; list-style-position: outside; font-family: “verdana”; font-size: 18px; color: black;”>Confirmatory symptoms</p> <ol><li style=“padding: 0px; list-style-type: decimal; list-style-image: none; list-style-position: outside; font-family: “verdana”; font-size: 18px; color: black;”>REMEDY RELATIONSHIPS</li></ol>
It would match, as @peterjones said, the five strings
Haemorrhoids
,Haemorrhoids
,haemorrhages
,CONSCIOUSNESS
andHysterically
which all contain12
or13
charsBest Regards,
guy038
- SEARCH
-
@guy038 @PeterJones It looks like I was using an old, earlier version of Notepad++. I have now installed the v8.7.4 (32 bit) version of Notepad++, released on December 4th, 2024. The RegEx
(?-si:<span style=|(?!\A)\G).*?\K(?i:\b[a-z]{12,}\b)
does not stop searching/finding matches of words with 12 or more letters/alphabets after a</span>
but(?-s)(?-i:<span style=|(?!\A)\G).*?\K(?i:\b[a-z]{12,}\b)
and(?-i:<span style=[^>]*+>|(?!\A)\G)(?>[^<>\w]++|<(?!\/?span\b)[^>]*+>|\b\w{1,11}\b)*+\K(\b[a-z]{12,}\b)
do. Thanks a lot for your time. -
Hi, @dr-ramaanand, @peterjones and All,
@dr-ramaanand, I don’t think that your old version was a problem, regarding the results of these regexes !
You said, in your last post :
The RegEx
(?-si:<span style=|(?!\A)\G).*?\K(?i:\b[a-z]{12,}\b)
does not stop searching/finding matches of words with 12 or more letters/alphabets after a </span> …Sorry to contradict you but, given your INPUT text, below :
<span style=“padding: 0px; list-style-type: decimal; list-style-image: none; list-style-position: outside; font-family: “verdana”; font-size: 18px; color: black;”>Haemorrhoids <br>- piles</span> <span style=“padding: 0px; list-style-type: decimal; list-style-image: none; list-style-position: outside; font-family: “verdana”; font-size: 18px; color: black;”>Haemorrhoids<br>-piles</span> <span style=“padding: 0px; list-style-type: decimal; list-style-image: none; list-style-position: outside; font-family: “verdana”; font-size: 18px; color: black;”>Offensive haemorrhages</span> <span style=“padding: 0px; list-style-type: decimal; list-style-image: none; list-style-position: outside; font-family: “verdana”; font-size: 18px; color: black;”>CONSCIOUSNESS of womb. Hysterically inclined.</span> <p style=“padding: 0px; list-style-type: decimal; list-style-image: none; list-style-position: outside; font-family: “verdana”; font-size: 18px; color: black;”>Confirmatory symptoms</p> <ol><li style=“padding: 0px; list-style-type: decimal; list-style-image: none; list-style-position: outside; font-family: “verdana”; font-size: 18px; color: black;”>REMEDY RELATIONSHIPS</li></ol>
My regex DOES find the five words
Haemorrhoids
,Haemorrhoids
,haemorrhages
,CONSCIOUSNESS
andHysterically
, as well as the two other regexes you mentionned !But, as I said in my previous post, this syntax only works if the BSR region and all the matching words lie in the current line, before an other
<span style=
syntax beginning the next line
Now, let’s use this new INPUT text, pasted in a new tab, where I added some
HTML
paragraphs, after the</span>
tag of each line<span style=“padding: 0px; list-style-type: decimal; list-style-image: none; list-style-position: outside; font-family: “verdana”; font-size: 18px; color: black;”>Haemorrhoids <br>- piles</span> <p>The words abbreviation and fabrications are long</p> <span style=“padding: 0px; list-style-type: decimal; list-style-image: none; list-style-position: outside; font-family: “verdana”; font-size: 18px; color: black;”>Haemorrhoids<br>-piles</span> <p>This radioacoustics</p> <span style=“padding: 0px; list-style-type: decimal; list-style-image: none; list-style-position: outside; font-family: “verdana”; font-size: 18px; color: black;”>Offensive haemorrhages</span> <p>Words nanofabrications and calligraphically contains 16 letters</p> <span style=“padding: 0px; list-style-type: decimal; list-style-image: none; list-style-position: outside; font-family: “verdana”; font-size: 18px; color: black;”>CONSCIOUSNESS of womb. Hysterically inclined.</span> <p>Echocardiographies and Neovascularization</p> <p style=“padding: 0px; list-style-type: decimal; list-style-image: none; list-style-position: outside; font-family: “verdana”; font-size: 18px; color: black;”>Confirmatory symptoms</p> <ol><li style=“padding: 0px; list-style-type: decimal; list-style-image: none; list-style-position: outside; font-family: “verdana”; font-size: 18px; color: black;”>REMEDY RELATIONSHIPS</li></ol>
This time, we cannot apply the same regex as some words, located after
</span>
would also be detected !Indeed, the additional words
abbreviation
,facrications
,radioacoustics
,nanofabrications
,calligraphically
,Echocardiographies
andNeovascularization
are not wanted !Thus, we must use the first form of my generic regex, described in this post :
SEARCH
(?-si:
BSR|(?!\A)\G)(?s-i:(?!
ESR).)*?\K(?-si:
FR)
With :
BSR =
<span style=
ESR =
</span>
FR =
\b[a-z]{12,}\b
Leading to this functional regex :
SEARCH
(?-si:<span style=|(?!\A)\G)(?s-i:(?!</span>).)*?\K(?i:\b[a-z]{12,}\b)
Remark : I also changed the part
(?-si:
, before FR, by simply(?i:
, in order to find any word of more than11
letters whatever its case !And again, it would just match the five words
Haemorrhoids
,Haemorrhoids
,haemorrhages
,CONSCIOUSNESS
andHysterically
, which come BEFORE</span>
:-))
Now, let’s get back to your original INPUT text, where I simply changed the name of the font by an other name with more than
11
letters : theDejaVuSansMono
font !<span style=“padding: 0px; list-style-type: decimal; list-style-image: none; list-style-position: outside; font-family: “DejaVuSansMono”; font-size: 18px; color: black;”>Haemorrhoids <br>- piles</span> <span style=“padding: 0px; list-style-type: decimal; list-style-image: none; list-style-position: outside; font-family: “DejaVuSansMono”; font-size: 18px; color: black;”>Haemorrhoids<br>-piles</span> <span style=“padding: 0px; list-style-type: decimal; list-style-image: none; list-style-position: outside; font-family: “DejaVuSansMono”; font-size: 18px; color: black;”>Offensive haemorrhages</span> <span style=“padding: 0px; list-style-type: decimal; list-style-image: none; list-style-position: outside; font-family: “DejaVuSansMono”; font-size: 18px; color: black;”>CONSCIOUSNESS of womb. Hysterically inclined.</span> <p style=“padding: 0px; list-style-type: decimal; list-style-image: none; list-style-position: outside; font-family: “verdana”; font-size: 18px; color: black;”>Confirmatory symptoms</p> <ol><li style=“padding: 0px; list-style-type: decimal; list-style-image: none; list-style-position: outside; font-family: “verdana”; font-size: 18px; color: black;”>REMEDY RELATIONSHIPS</li></ol>
If I use the last regex of my previous post :
(?-si:<span style=|(?!\A)\G).*?\K(?i:\b[a-z]{12,}\b)
Against the INPUT text just above, it would wrongly match the name of the font, four times :-((
We can solve this problem by changing the BSR search region ! Instead of
<span style=
, we’ll rather use<span style=.+?>
Hence, this new version :
(?-si:<span style=.+?>|(?!\A)\G).*?\K(?i:\b[a-z]{12,}\b)
Or this one :
(?-si:<span style=.+?>|(?!\A)\G)(?s-i:(?!</span>).)*?\K(?i:\b[a-z]{12,}\b)
, in case of possible words after</span>
and/or multi-lines resultsBest Regards,
guy038
-
@guy038 Oui, merci beaucoup!