Community
    • Login

    Search for unicode char?

    Scheduled Pinned Locked Moved General Discussion
    2 Posts 2 Posters 13.4k Views 1 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Boliusa bolB Offline
      Boliusa bol
      last edited by

      If I paste in a character like \uFEC1 and \uFEC9 from windows charmap

      How do I search for it in UTF8 and UTF16?

      For UTF-16
      I tried [\ufec1] it picked it up but then it picked up \ufec9 as well.

      Its UTF8 code as you see from this URL http://www.fileformat.info/info/unicode/char/fec1/index.htm

      UTF 8 for it is 0xEF 0xBB 0x81 (efbb81) but I can’t see how to search for it with that either.

      I’m interested in searching for it in both UTF 8 and UTF 16.

      THanks

      1 Reply Last reply Reply Quote 0
      • guy038G Offline
        guy038
        last edited by guy038

        Hello Boliusa bol,

        I really wonder how you could find your Arabic characters with that syntax [\ufec1], or even \ufec1 !?

        With the BOOST regex library, used from N++ v6.00, I think that the right syntax, to find your two characters, should be \x{fec1} and \x{fec9}.

        Your characters are part of the Arabic Presentation Forms-B table, from FE70 to FEFF, at the address, below :

        http://www.unicode.org/charts/PDF/UFE70.pdf


        Secondly, don’t forget that, in N++, the search is oriented characters and NOT bytes. So, when you search for \x{fec1}, in a file, with an UTF-8 encoding, the regex engine looks for the three consecutive bytes efbb81, which represents the ONLY Arabic character ﻁ

        Note that :

        • With the UCS-2 BE encoding ( Low Endian ), this character would have been encoded with the two bytes C1FE

        • With the UCS-2 LE encoding ( Big Endian ), this character would have been encoded with the two bytes FEC1

        And it good to remind what is the difference, between the UTF-16 and USC-2 encodings :

        • With the UCS-2 encodings, supported by Notepad++, you can encode any character, from UNICODE code-point \x0 to \x{FFFD}

        • With the UTF-16 encodings, NOT supported by Notepad++, you can encode ALL UNICODE code-points, from \x0 to \x{10FFFD}, with the surrogate pairs mechanism.

        BTW, in N++, the regex syntax \x{...} does NOT work, presently, for code-points over FFFF !


        Now, if N++ search is “characters” oriented, how to detect the individual bytes of some characters of an UTF-8 file ? It’s normally impossible and it’s rather useless to do so ! Indeed, with the only syntax \x{fec1}, you’ll always find your ARABIC LETTER TAH ISOLATED FORM, as long as your file’s encoding is an UNICODE encoding !

        If you really want to see the individual bytes of an UTF-8 file, just choose the option Encode in ANSI :

        For instance, let’s suppose that you have an UTF-8 file with the ONLY ARABIC LETTER AIN ISOLATED FORM : ﻉ. Once you chose the menu option Encoding - Encode in ANSI, it will display the three-characters string ﻉ.

        And, as the UTF-8 representation of this character is EF BB 89, it’s easy to verify that the simple regex search of \xEF\xBB\x89 does find the string ﻉ

        By the way, here is, below, a very nice Internet tool to get the main informations for each UNICODE character. By default, you must type, on top of the page, the UNICODE hexadecimal code-point of your character ( For instance fec9 ), but you may select one of the six other proposed interpretations :

        http://www.cogsci.ed.ac.uk/~richard/utf-8.cgi?

        Hope that this post will be useful to you, anyway !

        Best Regards,

        guy038

        P.S. :

        Waooh, by chance, the Windows ANSI Code Pages 1252 ( Latin 1 ) and 1256 ( Arabic ) have a same representation, for the three characters, of code \xEF, \xBB and \x89 !

        To that purpose, consult the different Windows encodings, below :

        https://msdn.microsoft.com/en-us/goglobal/bb964654

        1 Reply Last reply Reply Quote 0

        Hello! It looks like you're interested in this conversation, but you don't have an account yet.

        Getting fed up of having to scroll through the same posts each visit? When you register for an account, you'll always come back to exactly where you were before, and choose to be notified of new replies (either via email, or push notification). You'll also be able to save bookmarks and upvote posts to show your appreciation to other community members.

        With your input, this post could be even better 💗

        Register Login
        • First post
          Last post
        The Community of users of the Notepad++ text editor.
        Powered by NodeBB | Contributors