Hello, @mkupper and All,
@mkupper, thank you for your appreciation !
First, I would say that the \C and \X syntaxes are far from noob regex syntaxes !
The \C syntax, as said in my previous post, should detect individual bytes of an UTF-8 file but, actually, returns the current NON-EOF character just like the well-known (?-s). syntax
The \X syntax matches :
Any single Non-diacritic character
0 or more associated diacritic characters, following the Non-diacritic char
For instance, the regex --o\x{0306}\x{0320}\x{0340}--o\x{0318}\x{0346}\x{0305}-- would exactly match the 14-chars string --ŏ̠̀--o̘͆̅-- and could be replaced by the regex --\X--\X--. Just enlarge the characters to their maximum for good readability ! However, note that the simple 8-chars string -------- would also be matched by the --\X--\X-- regex !
Secondly, I must admit that talking about Unicode characters, in a general way, made me drift towards my Total_Chars.txt file discussion !
But, even if we use the previous THEORICAL syntax, against the Total_Chars.txt file :
(?xs)
(
(?|
(?= [\x{0000}-\x{007F}] ) (\C) ( ) ( ) ( ) | # 128 1-byte chars in part INSIDE the BMP |
(?= [\x{0080}-\x{07FF}] ) (\C) (\C) ( ) ( ) | # 1,920 2-byte chars , in part INSIDE the BMP | 63,454 chars
(?= [\x{0800}-\x{FFFF}] ) (\C) (\C) (\C) ( ) | # 61,406 3-byte chars , in part INSIDE the BMP |
(?= [\x{10000}-\x{1FFFFF}] ) (\C) (\C) (\C) (\C) # 262,136 4-byte chars , in part OUTSIDE the BMP, with code > \x{FFFF} ( = 4 × 65,534 )
)
)
We could NOT find any result for two reasons :
The \C regex does not work with our present Boost regex engine ( See above )
The characters over \x{FFFF} are not properly handled by the Boost regex engine
So the last line (?= [\x{10000}-\x{1FFFFF}] ) (\C) (\C) (\C) (\C), regarding characters outside the BMP, should be changed as (?s).[\x{D800}-\x{DFFF}]
Using this regex, against the Total_Chars file, in the Find dialog, with the Wrap around button checked, does return 262,136 characters, when you click on the Count button
You may also convert this regex in a range delimited by two surrogate pairs as character boundaries
Open the Mark dialog ( Ctrl + M )
Untick all box options
Enter the regex \x{D800}\x{DC00}.+\x{DB7F}\x{DFFD} ( first char of Plane 1 to last allowed char of Plane 14 )
Tick the Purge for each search and Wrap around options
Select the Regular expression search mode
Click on the Mark All button ( 1 hit )
Click on the Copy Marked Text button
Open a new file ( Ctrl + N )
Paste the contents of the clipboard
Again, using the (?s).[\x{D800}-\x{DFFF}] regex on the entire file or a simple Ctrl + A gives a count of 262,136 characters for this new file !
Thirdly, I would like to insist on the fact that, both, the LastResort-Regular.ttf font and the Total_Chars text file deal only with characters and NOT with the individual bytes of these chars, depending of their current encoding !
So, in a sense, it’s not connected to the beginning of my initial post, regarding individual bytes. Sorry for the confusion !
Best Regards,
guy038