Regex: Select all non-ASCII characters html tags

Robin Cruise

hello. I have this 4 lines. I want to make a regex as to find only those html tags that contain non-ASCII characters. (ignoring the  tags)

Line 1
<p class="OANA"><em>评论家Drive有一些重要的东西来说，Home Edition关于总是分享胜利的人才，转向他们的起源：</em></p>
Line 2
<p class="OANA"><em>评论家Drive有一些重要的东西来说，<em>Home Edition关于总是分享胜利的</em>人才，转向他们的起源：</em></p>
Line 3
<p class="OANA"><em>评论家有一些重要的东西来说，关于总是分享胜利的人才，转向他们的起源：</em></p>
Line 4
<p class="OANA"><em>What is it called when you love your car?</em></p>

The output should be Line 1, Line 2 and Line 4 (so don’t have to match Line 3)

My regex is not good, find’s all of them.

Find: +(?!\w+).*(\w+[\x00-\x7F]).*

Robin Cruise

also, will be a nice idea to use another regex as to find only the Line 4.

guy038

Hello, @robin-cruise,

I don’t really understand what you want !

First, to match any Chinese character, you must use the range described by the [\x{4E00}-\x{9FFF} character class. Refer here

However, your text, between the  and  tags, also contains some fullwith punctuation characters as ， and ：

Refer here

So, If I mark all characters with the regex [\x{4E00}-\x{9FFF}\x{FF00}-\x{FFEF}], in your sample, it matches 102 characters and, between the outer  and  tags :

The Line 1 contains two ASCII strings ( Drive and Home Edition ) and three ranges of non-ASCII characters
The Line 2 contains two ASCII strings ( Drive and Home Edition ) and four ranges of non-ASCII characters
The Line 3 contains one range of non-ASCII characters, only, without any ASCII char
The Line 4 contains one range of ASCII characters, only ( What is it called when you love your car? )

Now, the question is : what you want to do ?

On the other hand, your last question is :

also, will be a nice idea to use another regex as to find only the Line 4.

I suppose that the following regex (?<=)[\x00-\x7F]+?(?=) should work and matches the shortest range of ASCII characters, after the string  till the string  not included !

Best Regards,

guy038

Robin Cruise

@guy038 said in Regex: Select all non-ASCII characters html tags:

(?<=)[\x00-\x7F]+?(?=)

thanks a lot @guy038