Search for unicode char?
-
If I paste in a character like \uFEC1 and \uFEC9 from windows charmap
How do I search for it in UTF8 and UTF16?
For UTF-16
I tried [\ufec1] it picked it up but then it picked up \ufec9 as well.Its UTF8 code as you see from this URL http://www.fileformat.info/info/unicode/char/fec1/index.htm
UTF 8 for it is 0xEF 0xBB 0x81 (efbb81) but I can’t see how to search for it with that either.
I’m interested in searching for it in both UTF 8 and UTF 16.
THanks
-
Hello Boliusa bol,
I really wonder how you could find your Arabic characters with that syntax [\ufec1], or even \ufec1 !?
With the BOOST regex library, used from N++ v6.00, I think that the right syntax, to find your two characters, should be
\x{fec1}
and\x{fec9}
.Your characters are part of the Arabic Presentation Forms-B table, from FE70 to FEFF, at the address, below :
http://www.unicode.org/charts/PDF/UFE70.pdf
Secondly, don’t forget that, in N++, the search is oriented characters and NOT bytes. So, when you search for
\x{fec1}
, in a file, with an UTF-8 encoding, the regex engine looks for the three consecutive bytes efbb81, which represents the ONLY Arabic character ﻁNote that :
-
With the UCS-2 BE encoding ( Low Endian ), this character would have been encoded with the two bytes C1FE
-
With the UCS-2 LE encoding ( Big Endian ), this character would have been encoded with the two bytes FEC1
And it good to remind what is the difference, between the UTF-16 and USC-2 encodings :
-
With the UCS-2 encodings, supported by Notepad++, you can encode any character, from UNICODE code-point
\x0
to\x{FFFD}
-
With the UTF-16 encodings, NOT supported by Notepad++, you can encode ALL UNICODE code-points, from
\x0
to\x{10FFFD}
, with the surrogate pairs mechanism.
BTW, in N++, the regex syntax
\x{...}
does NOT work, presently, for code-points over FFFF !
Now, if N++ search is “characters” oriented, how to detect the individual bytes of some characters of an UTF-8 file ? It’s normally impossible and it’s rather useless to do so ! Indeed, with the only syntax
\x{fec1}
, you’ll always find your ARABIC LETTER TAH ISOLATED FORM, as long as your file’s encoding is an UNICODE encoding !If you really want to see the individual bytes of an UTF-8 file, just choose the option Encode in ANSI :
For instance, let’s suppose that you have an UTF-8 file with the ONLY ARABIC LETTER AIN ISOLATED FORM :
ﻉ
. Once you chose the menu option Encoding - Encode in ANSI, it will display the three-characters stringﻉ
.And, as the UTF-8 representation of this character is EF BB 89, it’s easy to verify that the simple regex search of
\xEF\xBB\x89
does find the stringﻉ
By the way, here is, below, a very nice Internet tool to get the main informations for each UNICODE character. By default, you must type, on top of the page, the UNICODE hexadecimal code-point of your character ( For instance fec9 ), but you may select one of the six other proposed interpretations :
http://www.cogsci.ed.ac.uk/~richard/utf-8.cgi?
Hope that this post will be useful to you, anyway !
Best Regards,
guy038
P.S. :
Waooh, by chance, the Windows ANSI Code Pages 1252 ( Latin 1 ) and 1256 ( Arabic ) have a same representation, for the three characters, of code
\xEF
,\xBB
and\x89
!To that purpose, consult the different Windows encodings, below :
-