Hello Boliusa bol,
I really wonder how you could find your Arabic characters with that syntax [\ufec1], or even \ufec1 !?
With the BOOST regex library, used from N++ v6.00, I think that the right syntax, to find your two characters, should be \x{fec1} and \x{fec9}.
Your characters are part of the Arabic Presentation Forms-B table, from FE70 to FEFF, at the address, below :
http://www.unicode.org/charts/PDF/UFE70.pdf
Secondly, don’t forget that, in N++, the search is oriented characters and NOT bytes. So, when you search for \x{fec1}, in a file, with an UTF-8 encoding, the regex engine looks for the three consecutive bytes efbb81, which represents the ONLY Arabic character ﻁ
Note that :
With the UCS-2 BE encoding ( Low Endian ), this character would have been encoded with the two bytes C1FE
With the UCS-2 LE encoding ( Big Endian ), this character would have been encoded with the two bytes FEC1
And it good to remind what is the difference, between the UTF-16 and USC-2 encodings :
With the UCS-2 encodings, supported by Notepad++, you can encode any character, from UNICODE code-point \x0 to \x{FFFD}
With the UTF-16 encodings, NOT supported by Notepad++, you can encode ALL UNICODE code-points, from \x0 to \x{10FFFD}, with the surrogate pairs mechanism.
BTW, in N++, the regex syntax \x{...} does NOT work, presently, for code-points over FFFF !
Now, if N++ search is “characters” oriented, how to detect the individual bytes of some characters of an UTF-8 file ? It’s normally impossible and it’s rather useless to do so ! Indeed, with the only syntax \x{fec1}, you’ll always find your ARABIC LETTER TAH ISOLATED FORM, as long as your file’s encoding is an UNICODE encoding !
If you really want to see the individual bytes of an UTF-8 file, just choose the option Encode in ANSI :
For instance, let’s suppose that you have an UTF-8 file with the ONLY ARABIC LETTER AIN ISOLATED FORM : ﻉ. Once you chose the menu option Encoding - Encode in ANSI, it will display the three-characters string ﻉ.
And, as the UTF-8 representation of this character is EF BB 89, it’s easy to verify that the simple regex search of \xEF\xBB\x89 does find the string ﻉ
By the way, here is, below, a very nice Internet tool to get the main informations for each UNICODE character. By default, you must type, on top of the page, the UNICODE hexadecimal code-point of your character ( For instance fec9 ), but you may select one of the six other proposed interpretations :
http://www.cogsci.ed.ac.uk/~richard/utf-8.cgi?
Hope that this post will be useful to you, anyway !
Best Regards,
guy038
P.S. :
Waooh, by chance, the Windows ANSI Code Pages 1252 ( Latin 1 ) and 1256 ( Arabic ) have a same representation, for the three characters, of code \xEF, \xBB and \x89 !
To that purpose, consult the different Windows encodings, below :
https://msdn.microsoft.com/en-us/goglobal/bb964654