Regex: Select all non-ASCII characters html tags



  • hello. I have this 4 lines. I want to make a regex as to find only those html tags that contain non-ASCII characters. (ignoring the <em></em> tags)

    Line 1
    <p class="OANA"><em>评论家Drive有一些重要的东西来说,Home Edition关于总是分享胜利的人才,转向他们的起源:</em></p>
    Line 2
    <p class="OANA"><em>评论家Drive有一些重要的东西来说,<em>Home Edition关于总是分享胜利的</em>人才,转向他们的起源:</em></p>
    Line 3
    <p class="OANA"><em>评论家有一些重要的东西来说,关于总是分享胜利的人才,转向他们的起源:</em></p>
    Line 4
    <p class="OANA"><em>What is it called when you love your car?</em></p>
    

    The output should be Line 1, Line 2 and Line 4 (so don’t have to match Line 3)

    My regex is not good, find’s all of them.

    Find: <p class="OANA">+(?!\w+<em>).*(\w+[\x00-\x7F]).*</p>



  • also, will be a nice idea to use another regex as to find only the Line 4.



  • Hello, @robin-cruise,

    I don’t really understand what you want !

    First, to match any Chinese character, you must use the range described by the [\x{4E00}-\x{9FFF} character class. Refer here

    However, your text, between the <em> and </em> tags, also contains some fullwith punctuation characters as and

    Refer here

    So, If I mark all characters with the regex [\x{4E00}-\x{9FFF}\x{FF00}-\x{FFEF}], in your sample, it matches 102 characters and, between the outer <em> and </em> tags :

    • The Line 1 contains two ASCII strings ( Drive and Home Edition ) and three ranges of non-ASCII characters

    • The Line 2 contains two ASCII strings ( Drive and Home Edition ) and four ranges of non-ASCII characters

    • The Line 3 contains one range of non-ASCII characters, only, without any ASCII char

    • The Line 4 contains one range of ASCII characters, only ( What is it called when you love your car? )

    Now, the question is : what you want to do ?


    On the other hand, your last question is :

    also, will be a nice idea to use another regex as to find only the Line 4.

    I suppose that the following regex (?<=<p class="OANA"><em>)[\x00-\x7F]+?(?=</em>) should work and matches the shortest range of ASCII characters, after the string <p class="OANA"><em> till the string </em> not included !

    Best Regards,

    guy038



  • @guy038 said in Regex: Select all non-ASCII characters html tags:

    (?<=<p class=“OANA”><em>)[\x00-\x7F]+?(?=</em>)

    thanks a lot @guy038


Log in to reply