Community

    • Login
    • Search
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Search

    Remove unicode characters within range

    Help wanted · · · – – – · · ·
    2
    3
    163
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Dan Wier
      Dan Wier last edited by

      I have a large text document that includes accented characters like æøåáäĺćçčéđńőöřůýţžš. I am trying to remove all unicode characters between 0 and 96 with the intention of leaving behind these characters only so I can make sure that when I process text like these that I know what special characters I need to be able to handle.

      This regular expression I would expect to work but the accented letters are still removed. I presume it’s using unicode hex code rather than number?

      [\u0001-\u0096,-]
      

      I don’t see a n++ character class that would work either. Any sugestions?

      PeterJones 1 Reply Last reply Reply Quote 0
      • PeterJones
        PeterJones @Dan Wier last edited by PeterJones

        @Dan-Wier

        As the manual says, \u#### notation is for Extended Search Notation, not regular expression match by character code notation. Extended search does not have range notation. Make sure you use it in the right situation.

        In regular expression, you use \x{####} for four-nibble unicode characters.
        [\x{0001}-\x{0096}] will match from 'START OF HEADING' (U+0001) to 'START OF GUARDED AREA' (U+0096) … an odd range to pick for your stated goals, but whatever makes you happy on that.

        And, BTW, the ,- is useless, since comma and hyphen are already in that Unicode range.

        With regular expression mode and regular expression syntax, it matches the right characters.
        90ccb025-c304-4873-8536-a13a785a315b-image.png

        Dan Wier 1 Reply Last reply Reply Quote 2
        • Dan Wier
          Dan Wier @PeterJones last edited by

          @PeterJones Thank you so much for not only the answer but explaining the difference notation. I may need to adjust my range but it felt like a good place to start to see what I get back from these documents.

          1 Reply Last reply Reply Quote 0
          • First post
            Last post
          Copyright © 2014 NodeBB Forums | Contributors