Community
    • Login

    Regex: Select all non-ASCII characters html tags

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    4 Posts 2 Posters 815 Views 1 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Robin CruiseR Offline
      Robin Cruise
      last edited by

      hello. I have this 4 lines. I want to make a regex as to find only those html tags that contain non-ASCII characters. (ignoring the <em></em> tags)

      Line 1
      <p class="OANA"><em>评论家Drive有一些重要的东西来说,Home Edition关于总是分享胜利的人才,转向他们的起源:</em></p>
      Line 2
      <p class="OANA"><em>评论家Drive有一些重要的东西来说,<em>Home Edition关于总是分享胜利的</em>人才,转向他们的起源:</em></p>
      Line 3
      <p class="OANA"><em>评论家有一些重要的东西来说,关于总是分享胜利的人才,转向他们的起源:</em></p>
      Line 4
      <p class="OANA"><em>What is it called when you love your car?</em></p>
      

      The output should be Line 1, Line 2 and Line 4 (so don’t have to match Line 3)

      My regex is not good, find’s all of them.

      Find: <p class="OANA">+(?!\w+<em>).*(\w+[\x00-\x7F]).*</p>

      1 Reply Last reply Reply Quote 0
      • Robin CruiseR Offline
        Robin Cruise
        last edited by

        also, will be a nice idea to use another regex as to find only the Line 4.

        1 Reply Last reply Reply Quote 0
        • guy038G Online
          guy038
          last edited by guy038

          Hello, @robin-cruise,

          I don’t really understand what you want !

          First, to match any Chinese character, you must use the range described by the [\x{4E00}-\x{9FFF} character class. Refer here

          However, your text, between the <em> and </em> tags, also contains some fullwith punctuation characters as , and :

          Refer here

          So, If I mark all characters with the regex [\x{4E00}-\x{9FFF}\x{FF00}-\x{FFEF}], in your sample, it matches 102 characters and, between the outer <em> and </em> tags :

          • The Line 1 contains two ASCII strings ( Drive and Home Edition ) and three ranges of non-ASCII characters

          • The Line 2 contains two ASCII strings ( Drive and Home Edition ) and four ranges of non-ASCII characters

          • The Line 3 contains one range of non-ASCII characters, only, without any ASCII char

          • The Line 4 contains one range of ASCII characters, only ( What is it called when you love your car? )

          Now, the question is : what you want to do ?


          On the other hand, your last question is :

          also, will be a nice idea to use another regex as to find only the Line 4.

          I suppose that the following regex (?<=<p class="OANA"><em>)[\x00-\x7F]+?(?=</em>) should work and matches the shortest range of ASCII characters, after the string <p class="OANA"><em> till the string </em> not included !

          Best Regards,

          guy038

          Robin CruiseR 1 Reply Last reply Reply Quote 0
          • Robin CruiseR Offline
            Robin Cruise @guy038
            last edited by

            @guy038 said in Regex: Select all non-ASCII characters html tags:

            (?<=<p class=“OANA”><em>)[\x00-\x7F]+?(?=</em>)

            thanks a lot @guy038

            1 Reply Last reply Reply Quote 1

            Hello! It looks like you're interested in this conversation, but you don't have an account yet.

            Getting fed up of having to scroll through the same posts each visit? When you register for an account, you'll always come back to exactly where you were before, and choose to be notified of new replies (either via email, or push notification). You'll also be able to save bookmarks and upvote posts to show your appreciation to other community members.

            With your input, this post could be even better 💗

            Register Login
            • First post
              Last post
            The Community of users of the Notepad++ text editor.
            Powered by NodeBB | Contributors