Regex: Select all non-ASCII characters html tags
-
hello. I have this 4 lines. I want to make a regex as to find only those html tags that contain non-ASCII characters. (ignoring the
<em></em>
tags)Line 1 <p class="OANA"><em>评论家Drive有一些重要的东西来说,Home Edition关于总是分享胜利的人才,转向他们的起源:</em></p> Line 2 <p class="OANA"><em>评论家Drive有一些重要的东西来说,<em>Home Edition关于总是分享胜利的</em>人才,转向他们的起源:</em></p> Line 3 <p class="OANA"><em>评论家有一些重要的东西来说,关于总是分享胜利的人才,转向他们的起源:</em></p> Line 4 <p class="OANA"><em>What is it called when you love your car?</em></p>
The output should be Line 1, Line 2 and Line 4 (so don’t have to match Line 3)
My regex is not good, find’s all of them.
Find:
<p class="OANA">+(?!\w+<em>).*(\w+[\x00-\x7F]).*</p>
-
also, will be a nice idea to use another regex as to find only the Line 4.
-
Hello, @robin-cruise,
I don’t really understand what you want !
First, to match any Chinese character, you must use the range described by the
[\x{4E00}-\x{9FFF}
character class. Refer hereHowever, your text, between the
<em>
and</em>
tags, also contains some fullwith punctuation characters as,
and:
Refer here
So, If I mark all characters with the regex
[\x{4E00}-\x{9FFF}\x{FF00}-\x{FFEF}]
, in your sample, it matches102
characters and, between the outer<em>
and</em>
tags :-
The Line
1
contains twoASCII
strings ( Drive and Home Edition ) and three ranges of non-ASCII
characters -
The Line
2
contains twoASCII
strings ( Drive and Home Edition ) and four ranges of non-ASCII
characters -
The Line
3
contains one range of non-ASCII
characters, only, without anyASCII
char -
The Line
4
contains one range ofASCII
characters, only ( What is it called when you love your car? )
Now, the question is : what you want to do ?
On the other hand, your last question is :
also, will be a nice idea to use another regex as to find only the Line 4.
I suppose that the following regex
(?<=<p class="OANA"><em>)[\x00-\x7F]+?(?=</em>)
should work and matches the shortest range ofASCII
characters, after the string<p class="OANA"><em>
till the string</em>
not included !Best Regards,
guy038
-
-
@guy038 said in Regex: Select all non-ASCII characters html tags:
(?<=<p class=“OANA”><em>)[\x00-\x7F]+?(?=</em>)
thanks a lot @guy038