Community
    • Login

    Search for unicode char?

    Scheduled Pinned Locked Moved General Discussion
    2 Posts 2 Posters 12.7k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Boliusa bolB
      Boliusa bol
      last edited by

      If I paste in a character like \uFEC1 and \uFEC9 from windows charmap

      How do I search for it in UTF8 and UTF16?

      For UTF-16
      I tried [\ufec1] it picked it up but then it picked up \ufec9 as well.

      Its UTF8 code as you see from this URL http://www.fileformat.info/info/unicode/char/fec1/index.htm

      UTF 8 for it is 0xEF 0xBB 0x81 (efbb81) but I can’t see how to search for it with that either.

      I’m interested in searching for it in both UTF 8 and UTF 16.

      THanks

      1 Reply Last reply Reply Quote 0
      • guy038G
        guy038
        last edited by guy038

        Hello Boliusa bol,

        I really wonder how you could find your Arabic characters with that syntax [\ufec1], or even \ufec1 !?

        With the BOOST regex library, used from N++ v6.00, I think that the right syntax, to find your two characters, should be \x{fec1} and \x{fec9}.

        Your characters are part of the Arabic Presentation Forms-B table, from FE70 to FEFF, at the address, below :

        http://www.unicode.org/charts/PDF/UFE70.pdf


        Secondly, don’t forget that, in N++, the search is oriented characters and NOT bytes. So, when you search for \x{fec1}, in a file, with an UTF-8 encoding, the regex engine looks for the three consecutive bytes efbb81, which represents the ONLY Arabic character ﻁ

        Note that :

        • With the UCS-2 BE encoding ( Low Endian ), this character would have been encoded with the two bytes C1FE

        • With the UCS-2 LE encoding ( Big Endian ), this character would have been encoded with the two bytes FEC1

        And it good to remind what is the difference, between the UTF-16 and USC-2 encodings :

        • With the UCS-2 encodings, supported by Notepad++, you can encode any character, from UNICODE code-point \x0 to \x{FFFD}

        • With the UTF-16 encodings, NOT supported by Notepad++, you can encode ALL UNICODE code-points, from \x0 to \x{10FFFD}, with the surrogate pairs mechanism.

        BTW, in N++, the regex syntax \x{...} does NOT work, presently, for code-points over FFFF !


        Now, if N++ search is “characters” oriented, how to detect the individual bytes of some characters of an UTF-8 file ? It’s normally impossible and it’s rather useless to do so ! Indeed, with the only syntax \x{fec1}, you’ll always find your ARABIC LETTER TAH ISOLATED FORM, as long as your file’s encoding is an UNICODE encoding !

        If you really want to see the individual bytes of an UTF-8 file, just choose the option Encode in ANSI :

        For instance, let’s suppose that you have an UTF-8 file with the ONLY ARABIC LETTER AIN ISOLATED FORM : ﻉ. Once you chose the menu option Encoding - Encode in ANSI, it will display the three-characters string ﻉ.

        And, as the UTF-8 representation of this character is EF BB 89, it’s easy to verify that the simple regex search of \xEF\xBB\x89 does find the string ﻉ

        By the way, here is, below, a very nice Internet tool to get the main informations for each UNICODE character. By default, you must type, on top of the page, the UNICODE hexadecimal code-point of your character ( For instance fec9 ), but you may select one of the six other proposed interpretations :

        http://www.cogsci.ed.ac.uk/~richard/utf-8.cgi?

        Hope that this post will be useful to you, anyway !

        Best Regards,

        guy038

        P.S. :

        Waooh, by chance, the Windows ANSI Code Pages 1252 ( Latin 1 ) and 1256 ( Arabic ) have a same representation, for the three characters, of code \xEF, \xBB and \x89 !

        To that purpose, consult the different Windows encodings, below :

        https://msdn.microsoft.com/en-us/goglobal/bb964654

        1 Reply Last reply Reply Quote 0
        • First post
          Last post
        The Community of users of the Notepad++ text editor.
        Powered by NodeBB | Contributors