Remove unicode characters within range

Dan Wier

I have a large text document that includes accented characters like æøåáäĺćçčéđńőöřůýţžš. I am trying to remove all unicode characters between 0 and 96 with the intention of leaving behind these characters only so I can make sure that when I process text like these that I know what special characters I need to be able to handle.

This regular expression I would expect to work but the accented letters are still removed. I presume it’s using unicode hex code rather than number?

[\u0001-\u0096,-]

I don’t see a n++ character class that would work either. Any sugestions?

PeterJones

@Dan-Wier

As the manual says, \u#### notation is for Extended Search Notation, not regular expression match by character code notation. Extended search does not have range notation. Make sure you use it in the right situation.

In regular expression, you use \x{####} for four-nibble unicode characters.
[\x{0001}-\x{0096}] will match from 'START OF HEADING' (U+0001) to 'START OF GUARDED AREA' (U+0096) … an odd range to pick for your stated goals, but whatever makes you happy on that.

And, BTW, the ,- is useless, since comma and hyphen are already in that Unicode range.

With regular expression mode and regular expression syntax, it matches the right characters.

Dan Wier

@PeterJones Thank you so much for not only the answer but explaining the difference notation. I may need to adjust my range but it felt like a good place to start to see what I get back from these documents.