Search for unicode char?

Boliusa bol

If I paste in a character like \uFEC1 and \uFEC9 from windows charmap

How do I search for it in UTF8 and UTF16?

For UTF-16
I tried [\ufec1] it picked it up but then it picked up \ufec9 as well.

Its UTF8 code as you see from this URL http://www.fileformat.info/info/unicode/char/fec1/index.htm

UTF 8 for it is 0xEF 0xBB 0x81 (efbb81) but I can’t see how to search for it with that either.

I’m interested in searching for it in both UTF 8 and UTF 16.

THanks

guy038

Hello Boliusa bol,

I really wonder how you could find your Arabic characters with that syntax [\ufec1], or even \ufec1 !?

With the BOOST regex library, used from N++ v6.00, I think that the right syntax, to find your two characters, should be \x{fec1} and \x{fec9}.

Your characters are part of the Arabic Presentation Forms-B table, from FE70 to FEFF, at the address, below :

http://www.unicode.org/charts/PDF/UFE70.pdf

Secondly, don’t forget that, in N++, the search is oriented characters and NOT bytes. So, when you search for \x{fec1}, in a file, with an UTF-8 encoding, the regex engine looks for the three consecutive bytes efbb81, which represents the ONLY Arabic character ﻁ

Note that :

With the UCS-2 BE encoding ( Low Endian ), this character would have been encoded with the two bytes C1FE
With the UCS-2 LE encoding ( Big Endian ), this character would have been encoded with the two bytes FEC1

And it good to remind what is the difference, between the UTF-16 and USC-2 encodings :

With the UCS-2 encodings, supported by Notepad++, you can encode any character, from UNICODE code-point \x0 to \x{FFFD}
With the UTF-16 encodings, NOT supported by Notepad++, you can encode ALL UNICODE code-points, from \x0 to \x{10FFFD}, with the surrogate pairs mechanism.

BTW, in N++, the regex syntax \x{...} does NOT work, presently, for code-points over FFFF !

Now, if N++ search is “characters” oriented, how to detect the individual bytes of some characters of an UTF-8 file ? It’s normally impossible and it’s rather useless to do so ! Indeed, with the only syntax \x{fec1}, you’ll always find your ARABIC LETTER TAH ISOLATED FORM, as long as your file’s encoding is an UNICODE encoding !

If you really want to see the individual bytes of an UTF-8 file, just choose the option Encode in ANSI :

For instance, let’s suppose that you have an UTF-8 file with the ONLY ARABIC LETTER AIN ISOLATED FORM : ﻉ. Once you chose the menu option Encoding - Encode in ANSI, it will display the three-characters string ï»‰.

And, as the UTF-8 representation of this character is EF BB 89, it’s easy to verify that the simple regex search of \xEF\xBB\x89 does find the string ï»‰

By the way, here is, below, a very nice Internet tool to get the main informations for each UNICODE character. By default, you must type, on top of the page, the UNICODE hexadecimal code-point of your character ( For instance fec9 ), but you may select one of the six other proposed interpretations :

http://www.cogsci.ed.ac.uk/~richard/utf-8.cgi?

Hope that this post will be useful to you, anyway !

Best Regards,

guy038

P.S. :

Waooh, by chance, the Windows ANSI Code Pages 1252 ( Latin 1 ) and 1256 ( Arabic ) have a same representation, for the three characters, of code \xEF, \xBB and \x89 !

To that purpose, consult the different Windows encodings, below :

https://msdn.microsoft.com/en-us/goglobal/bb964654