Community
    • Login

    Regex: Select all non-ASCII characters html tags

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    4 Posts 2 Posters 448 Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Robin CruiseR
      Robin Cruise
      last edited by

      hello. I have this 4 lines. I want to make a regex as to find only those html tags that contain non-ASCII characters. (ignoring the <em></em> tags)

      Line 1
      <p class="OANA"><em>评论家Drive有一些重要的东西来说,Home Edition关于总是分享胜利的人才,转向他们的起源:</em></p>
      Line 2
      <p class="OANA"><em>评论家Drive有一些重要的东西来说,<em>Home Edition关于总是分享胜利的</em>人才,转向他们的起源:</em></p>
      Line 3
      <p class="OANA"><em>评论家有一些重要的东西来说,关于总是分享胜利的人才,转向他们的起源:</em></p>
      Line 4
      <p class="OANA"><em>What is it called when you love your car?</em></p>
      

      The output should be Line 1, Line 2 and Line 4 (so don’t have to match Line 3)

      My regex is not good, find’s all of them.

      Find: <p class="OANA">+(?!\w+<em>).*(\w+[\x00-\x7F]).*</p>

      1 Reply Last reply Reply Quote 0
      • Robin CruiseR
        Robin Cruise
        last edited by

        also, will be a nice idea to use another regex as to find only the Line 4.

        1 Reply Last reply Reply Quote 0
        • guy038G
          guy038
          last edited by guy038

          Hello, @robin-cruise,

          I don’t really understand what you want !

          First, to match any Chinese character, you must use the range described by the [\x{4E00}-\x{9FFF} character class. Refer here

          However, your text, between the <em> and </em> tags, also contains some fullwith punctuation characters as , and :

          Refer here

          So, If I mark all characters with the regex [\x{4E00}-\x{9FFF}\x{FF00}-\x{FFEF}], in your sample, it matches 102 characters and, between the outer <em> and </em> tags :

          • The Line 1 contains two ASCII strings ( Drive and Home Edition ) and three ranges of non-ASCII characters

          • The Line 2 contains two ASCII strings ( Drive and Home Edition ) and four ranges of non-ASCII characters

          • The Line 3 contains one range of non-ASCII characters, only, without any ASCII char

          • The Line 4 contains one range of ASCII characters, only ( What is it called when you love your car? )

          Now, the question is : what you want to do ?


          On the other hand, your last question is :

          also, will be a nice idea to use another regex as to find only the Line 4.

          I suppose that the following regex (?<=<p class="OANA"><em>)[\x00-\x7F]+?(?=</em>) should work and matches the shortest range of ASCII characters, after the string <p class="OANA"><em> till the string </em> not included !

          Best Regards,

          guy038

          Robin CruiseR 1 Reply Last reply Reply Quote 0
          • Robin CruiseR
            Robin Cruise @guy038
            last edited by

            @guy038 said in Regex: Select all non-ASCII characters html tags:

            (?<=<p class=“OANA”><em>)[\x00-\x7F]+?(?=</em>)

            thanks a lot @guy038

            1 Reply Last reply Reply Quote 1
            • First post
              Last post
            The Community of users of the Notepad++ text editor.
            Powered by NodeBB | Contributors