• Login
Community
  • Login

Regex: Select all non-ASCII characters html tags

Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
4 Posts 2 Posters 453 Views
Loading More Posts
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • R
    Robin Cruise
    last edited by Jun 8, 2021, 5:57 AM

    hello. I have this 4 lines. I want to make a regex as to find only those html tags that contain non-ASCII characters. (ignoring the <em></em> tags)

    Line 1
    <p class="OANA"><em>评论家Drive有一些重要的东西来说,Home Edition关于总是分享胜利的人才,转向他们的起源:</em></p>
    Line 2
    <p class="OANA"><em>评论家Drive有一些重要的东西来说,<em>Home Edition关于总是分享胜利的</em>人才,转向他们的起源:</em></p>
    Line 3
    <p class="OANA"><em>评论家有一些重要的东西来说,关于总是分享胜利的人才,转向他们的起源:</em></p>
    Line 4
    <p class="OANA"><em>What is it called when you love your car?</em></p>
    

    The output should be Line 1, Line 2 and Line 4 (so don’t have to match Line 3)

    My regex is not good, find’s all of them.

    Find: <p class="OANA">+(?!\w+<em>).*(\w+[\x00-\x7F]).*</p>

    1 Reply Last reply Reply Quote 0
    • R
      Robin Cruise
      last edited by Jun 8, 2021, 8:24 AM

      also, will be a nice idea to use another regex as to find only the Line 4.

      1 Reply Last reply Reply Quote 0
      • G
        guy038
        last edited by guy038 Jun 8, 2021, 8:06 PM Jun 8, 2021, 6:41 PM

        Hello, @robin-cruise,

        I don’t really understand what you want !

        First, to match any Chinese character, you must use the range described by the [\x{4E00}-\x{9FFF} character class. Refer here

        However, your text, between the <em> and </em> tags, also contains some fullwith punctuation characters as , and :

        Refer here

        So, If I mark all characters with the regex [\x{4E00}-\x{9FFF}\x{FF00}-\x{FFEF}], in your sample, it matches 102 characters and, between the outer <em> and </em> tags :

        • The Line 1 contains two ASCII strings ( Drive and Home Edition ) and three ranges of non-ASCII characters

        • The Line 2 contains two ASCII strings ( Drive and Home Edition ) and four ranges of non-ASCII characters

        • The Line 3 contains one range of non-ASCII characters, only, without any ASCII char

        • The Line 4 contains one range of ASCII characters, only ( What is it called when you love your car? )

        Now, the question is : what you want to do ?


        On the other hand, your last question is :

        also, will be a nice idea to use another regex as to find only the Line 4.

        I suppose that the following regex (?<=<p class="OANA"><em>)[\x00-\x7F]+?(?=</em>) should work and matches the shortest range of ASCII characters, after the string <p class="OANA"><em> till the string </em> not included !

        Best Regards,

        guy038

        R 1 Reply Last reply Jun 8, 2021, 7:17 PM Reply Quote 0
        • R
          Robin Cruise @guy038
          last edited by Jun 8, 2021, 7:17 PM

          @guy038 said in Regex: Select all non-ASCII characters html tags:

          (?<=<p class=“OANA”><em>)[\x00-\x7F]+?(?=</em>)

          thanks a lot @guy038

          1 Reply Last reply Reply Quote 1
          3 out of 4
          • First post
            3/4
            Last post
          The Community of users of the Notepad++ text editor.
          Powered by NodeBB | Contributors