Search for unicode char?



  • If I paste in a character like \uFEC1 and \uFEC9 from windows charmap

    How do I search for it in UTF8 and UTF16?

    For UTF-16
    I tried [\ufec1] it picked it up but then it picked up \ufec9 as well.

    Its UTF8 code as you see from this URL http://www.fileformat.info/info/unicode/char/fec1/index.htm

    UTF 8 for it is 0xEF 0xBB 0x81 (efbb81) but I can’t see how to search for it with that either.

    I’m interested in searching for it in both UTF 8 and UTF 16.

    THanks



  • Hello Boliusa bol,

    I really wonder how you could find your Arabic characters with that syntax [\ufec1], or even \ufec1 !?

    With the BOOST regex library, used from N++ v6.00, I think that the right syntax, to find your two characters, should be \x{fec1} and \x{fec9}.

    Your characters are part of the Arabic Presentation Forms-B table, from FE70 to FEFF, at the address, below :

    http://www.unicode.org/charts/PDF/UFE70.pdf


    Secondly, don’t forget that, in N++, the search is oriented characters and NOT bytes. So, when you search for \x{fec1}, in a file, with an UTF-8 encoding, the regex engine looks for the three consecutive bytes efbb81, which represents the ONLY Arabic character

    Note that :

    • With the UCS-2 BE encoding ( Low Endian ), this character would have been encoded with the two bytes C1FE

    • With the UCS-2 LE encoding ( Big Endian ), this character would have been encoded with the two bytes FEC1

    And it good to remind what is the difference, between the UTF-16 and USC-2 encodings :

    • With the UCS-2 encodings, supported by Notepad++, you can encode any character, from UNICODE code-point \x0 to \x{FFFD}

    • With the UTF-16 encodings, NOT supported by Notepad++, you can encode ALL UNICODE code-points, from \x0 to \x{10FFFD}, with the surrogate pairs mechanism.

    BTW, in N++, the regex syntax \x{...} does NOT work, presently, for code-points over FFFF !


    Now, if N++ search is “characters” oriented, how to detect the individual bytes of some characters of an UTF-8 file ? It’s normally impossible and it’s rather useless to do so ! Indeed, with the only syntax \x{fec1}, you’ll always find your ARABIC LETTER TAH ISOLATED FORM, as long as your file’s encoding is an UNICODE encoding !

    If you really want to see the individual bytes of an UTF-8 file, just choose the option Encode in ANSI :

    For instance, let’s suppose that you have an UTF-8 file with the ONLY ARABIC LETTER AIN ISOLATED FORM : . Once you chose the menu option Encoding - Encode in ANSI, it will display the three-characters string ﻉ.

    And, as the UTF-8 representation of this character is EF BB 89, it’s easy to verify that the simple regex search of \xEF\xBB\x89 does find the string ﻉ

    By the way, here is, below, a very nice Internet tool to get the main informations for each UNICODE character. By default, you must type, on top of the page, the UNICODE hexadecimal code-point of your character ( For instance fec9 ), but you may select one of the six other proposed interpretations :

    http://www.cogsci.ed.ac.uk/~richard/utf-8.cgi?

    Hope that this post will be useful to you, anyway !

    Best Regards,

    guy038

    P.S. :

    Waooh, by chance, the Windows ANSI Code Pages 1252 ( Latin 1 ) and 1256 ( Arabic ) have a same representation, for the three characters, of code \xEF, \xBB and \x89 !

    To that purpose, consult the different Windows encodings, below :

    https://msdn.microsoft.com/en-us/goglobal/bb964654


Log in to reply