Hi, @coises and All,
From this link https://www.unicode.org/Public/UCD/latest/ucd/Blocks.txt, I created a list of all Unicode blocks and I verified the number of word characters of each block with, either :
Columns++
Notepad++
MultiReplace
Just download the text file Words_in_Blocks.txt , from my Google Drive account below :
https://drive.google.com/file/d/1hFXLBhrKghjoMTvDk46QSk4BjlzOAPKP/view?usp=sharing
As you can see, from left to right :
Column 1 : regex needed to get the number of Word characters
Column 2 : name of each Unicode block
Column 3 : total number of characters of each block
Column 4 : number of assigned numbers of each block, so far
Column 5 : Columns++ number of Word characters found
Column 6 : N++ Search and MultiReplace number of Word chars found
At this point, We can deduce some major points :
First, for any character over the BMP, the N++ search and Multireplace always return the 0 value whereas Columns++, implemented in UTF-32, give the correct results ! So, from now on, I’ll speak about results regarding the BMP Unicode plane, ONLY !Secondly, in the table below, I listed all blocks where the N++ search and MultiReplace return 0 for Word chars. As I added a column which shows in which release, each block was created, it’s easy to see that any block after the Unicode release 5.2 have not been updated in our Boost regex engine !
•---------------------------•------------------------------------------------•----------•----------•-----------•------------•---------• | Block range | Block name | Total | Assigned | Columns++ | N++ / MRep | Unicode | | | | Code-Pts | Code-Pts | Word Chrs | Word Chrs | Version | •---------------------------•------------------------------------------------•----------•----------•-----------•------------•---------• | (?=\w)[\x{0800}-\x{083F}] | Samaritan | 64 | 61 | 25 | 0 | 5.2 | | (?=\w)[\x{18B0}-\x{18FF}] | Unified Canadian Aboriginal Syllabics Extended | 80 | 70 | 70 | 0 | 5.2 | | (?=\w)[\x{1A20}-\x{1AAF}] | Tai Tham | 144 | 127 | 74 | 0 | 5.2 | | (?=\w)[\x{1CD0}-\x{1CFF}] | Vedic Extensions | 48 | 43 | 13 | 0 | 5.2 | | (?=\w)[\x{A4D0}-\x{A4FF}] | Lisu | 48 | 48 | 46 | 0 | 5.2 | | (?=\w)[\x{A6A0}-\x{A6FF}] | Bamum | 96 | 88 | 70 | 0 | 5.2 | | (?=\w)[\x{A8E0}-\x{A8FF}] | Devanagari Extended | 32 | 32 | 9 | 0 | 5.2 | | (?=\w)[\x{A960}-\x{A97F}] | Hangul Jamo Extended-A | 32 | 29 | 29 | 0 | 5.2 | | (?=\w)[\x{A980}-\x{A9DF}] | Javanese | 96 | 91 | 58 | 0 | 5.2 | | (?=\w)[\x{AA60}-\x{AA7F}] | Myanmar Extended-A | 32 | 32 | 26 | 0 | 5.2 | | (?=\w)[\x{AA80}-\x{AADF}] | Tai Viet | 96 | 72 | 61 | 0 | 5.2 | | (?=\w)[\x{ABC0}-\x{ABFF}] | Meetei Mayek | 64 | 56 | 45 | 0 | 5.2 | | (?=\w)[\x{D7B0}-\x{D7FF}] | Hangul Jamo Extended-B | 80 | 72 | 72 | 0 | 5.2 | | (?=\w)[\x{0840}-\x{085F}] | Mandaic | 32 | 29 | 25 | 0 | 6.0 | | (?=\w)[\x{1BC0}-\x{1BFF}] | Batak | 64 | 56 | 38 | 0 | 6.0 | | (?=\w)[\x{AB00}-\x{AB2F}] | Ethiopic Extended-A | 48 | 32 | 32 | 0 | 6.0 | | (?=\w)[\x{08A0}-\x{08FF}] | Arabic Extended-A | 96 | 96 | 42 | 0 | 6.1 | | (?=\w)[\x{AAE0}-\x{AAFF}] | Meetei Mayek Extensions | 32 | 23 | 14 | 0 | 6.1 | | (?=\w)[\x{A9E0}-\x{A9FF}] | Myanmar Extended-B | 32 | 31 | 30 | 0 | 7.0 | | (?=\w)[\x{AB30}-\x{AB6F}] | Latin Extended-E | 64 | 60 | 57 | 0 | 7.0 | | (?=\w)[\x{AB70}-\x{ABBF}] | Cherokee Supplement | 80 | 80 | 80 | 0 | 8.0 | | (?=\w)[\x{1C80}-\x{1C8F}] | Cyrillic Extended-C | 16 | 11 | 11 | 0 | 9.0 | | (?=\w)[\x{0860}-\x{086F}] | Syriac Supplement | 16 | 11 | 11 | 0 | 10.0 | | (?=\w)[\x{1C90}-\x{1CBF}] | Georgian Extended | 48 | 46 | 46 | 0 | 11.0 | | (?=\w)[\x{0870}-\x{089F}] | Arabic Extended-B | 48 | 43 | 31 | 0 | 14.0 | •---------------------------•------------------------------------------------•----------•----------•-----------•------------•---------•I did a quick test with N++ v8.9.1 which says :
Update to Boost 1.90.0.
But the results do not change at all. So, if I understand correctly, the Boost regex engine hasn’t updated Unicode since version 5.2 ? Very surprising !
Thirdly, in the table below, I listed all blocks where the N++ search and MultiReplace return a number of WORD chars smaller than in the Columns++ column :
•---------------------------•----------------------------------------------- •----------•----------•-----------•------------•---------• | Block range | Block name | Total | Assigned | Columns++ | N++ / MRep | Unicode | | | | Code-Pts | Code-Pts | Word Chrs | Word Chrs | Version | •---------------------------•----------------------------------------------- •----------•----------•-----------•------------•---------• | (?=\w)[\x{02B0}-\x{02FF}] | Spacing Modifier Letters | 80 | 80 | 37 | 24 | 1.0 | | (?=\w)[\x{0370}-\x{03FF}] | Greek and Coptic | 144 | 135 | 129 | 127 | 1.0 | | (?=\w)[\x{0530}-\x{058F}] | Armenian | 96 | 91 | 80 | 78 | 1.0 | | (?=\w)[\x{0590}-\x{05FF}] | Hebrew | 112 | 88 | 31 | 30 | 1.0 | | (?=\w)[\x{0600}-\x{06FF}] | Arabic | 256 | 256 | 173 | 172 | 1.0 | | (?=\w)[\x{0900}-\x{097F}] | Devanagari | 128 | 128 | 91 | 83 | 1.0 | | (?=\w)[\x{0980}-\x{09FF}] | Bengali | 128 | 96 | 65 | 63 | 1.0 | | (?=\w)[\x{0A80}-\x{0AFF}] | Gujarati | 128 | 91 | 63 | 62 | 1.0 | | (?=\w)[\x{0C00}-\x{0C7F}] | Telugu | 128 | 101 | 68 | 64 | 1.0 | | (?=\w)[\x{0C80}-\x{0CFF}] | Kannada | 128 | 92 | 68 | 63 | 1.0 | | (?=\w)[\x{0D00}-\x{0D7F}] | Malayalam | 128 | 118 | 77 | 69 | 1.0 | | (?=\w)[\x{0D80}-\x{0DFF}] | Sinhala | 128 | 91 | 69 | 59 | 1.0 | | (?=\w)[\x{0E80}-\x{0EFF}] | Lao | 128 | 83 | 66 | 50 | 1.0 | | (?=\w)[\x{0F00}-\x{0FFF}] | Tibetan | 256 | 211 | 60 | 59 | 1.0 | | (?=\w)[\x{10A0}-\x{10FF}] | Georgian | 96 | 88 | 87 | 82 | 1.0 | | (?=\w)[\x{2070}-\x{209F}] | Superscripts and Subscripts | 48 | 42 | 15 | 7 | 1.0 | | (?=\w)[\x{3100}-\x{312F}] | Bopomofo | 48 | 43 | 43 | 41 | 1.0 | | (?=\w)[\x{4E00}-\x{9FFF}] | CJK Unified Ideographs | 20992 | 20992 | 20992 | 20932 | 1.0.1 | | (?=\w)[\x{F900}-\x{FAFF}] | CJK Compatibility Ideographs | 512 | 472 | 472 | 467 | 1.0.1 | | (?=\w)[\x{16A0}-\x{16FF}] | Runic | 96 | 89 | 83 | 78 | 3.0 | | (?=\w)[\x{13A0}-\x{13FF}] | Cherokee | 96 | 92 | 92 | 85 | 3.0 | | (?=\w)[\x{1400}-\x{167F}] | Unified Canadian Aboriginal Syllabics | 640 | 640 | 637 | 628 | 3.0 | | (?=\w)[\x{3400}-\x{4DBF}] | CJK Unified Ideographs Extension A | 6592 | 6592 | 6592 | 6582 | 3.0 | | (?=\w)[\x{31A0}-\x{31BF}] | Bopomofo Extended | 32 | 32 | 32 | 24 | 3.0 | | (?=\w)[\x{1100}-\x{11FF}] | Hangul Jamo | 256 | 256 | 256 | 240 | 3.1 | | (?=\w)[\x{1700}-\x{171F}] | Tagalog | 32 | 23 | 19 | 17 | 3.2 | | (?=\w)[\x{0500}-\x{052F}] | Cyrillic Supplement | 48 | 48 | 48 | 36 | 3.2 | | (?=\w)[\x{1900}-\x{194F}] | Limbu | 80 | 68 | 41 | 39 | 4.0 | | (?=\w)[\x{2C00}-\x{2C5F}] | Glagolitic | 96 | 96 | 96 | 94 | 4.1 | | (?=\w)[\x{2C80}-\x{2CFF}] | Coptic | 128 | 123 | 107 | 101 | 4.1 | | (?=\w)[\x{2D00}-\x{2D2F}] | Georgian Supplement | 48 | 40 | 40 | 38 | 4.1 | | (?=\w)[\x{2E00}-\x{2E7F}] | Supplemental Punctuation | 128 | 94 | 1 | 0 | 4.1 | | (?=\w)[\x{1980}-\x{19DF}] | New Tai Lue | 96 | 83 | 80 | 59 | 4.1 | | (?=\w)[\x{2D30}-\x{2D7F}] | Tifinagh | 80 | 59 | 57 | 55 | 4.1 | | (?=\w)[\x{A700}-\x{A71F}] | Modifier Tone Letters | 32 | 32 | 9 | 0 | 4.1 | | (?=\w)[\x{2C60}-\x{2C7F}] | Latin Extended-C | 32 | 32 | 32 | 29 | 5.0 | | (?=\w)[\x{1B00}-\x{1B7F}] | Balinese | 128 | 127 | 65 | 64 | 5.0 | | (?=\w)[\x{A720}-\x{A7FF}] | Latin Extended-D | 224 | 204 | 200 | 109 | 5.0 | | (?=\w)[\x{1B80}-\x{1BBF}] | Sundanese | 64 | 64 | 48 | 42 | 5.1 | | (?=\w)[\x{A640}-\x{A69F}] | Cyrillic Extended-B | 96 | 96 | 78 | 69 | 5.1 | •---------------------------•----------------------------------------------- •----------•----------•-----------•------------•---------•This time, we can see that the **Unicode releases, listed in this table, are all inferior to the Unicode 5.2 release. I haven’t exactly identified the problem, so far, for these blocks !
Fourthly, in the table below, I listed all blocks where the N++ search and MultiReplace return a number of WORD chars greater than in the Columns++ column :
•---------------------------•------------------------------------------------•----------•----------•-----------•------------•---------• | Block range | Block name | Total | Assigned | Columns++ | N++ / MRep | Unicode | | | | Code-Pts | Code-Pts | Word Chrs | Word Chrs | Version | •---------------------------•------------------------------------------------•----------•----------•-----------•------------•---------• | (?=\w)[\x{0080}-\x{00FF}] | Latin-1 Supplement | 128 | 128 | 65 | 68 | 1.0 | | (?=\w)[\x{0E00}-\x{0E7F}] | Thai | 128 | 87 | 67 | 83 | 1.0 | | (?=\w)[\x{2150}-\x{218F}] | Number Forms | 64 | 60 | 2 | 41 | 1.0 | | (?=\w)[\x{3000}-\x{303F}] | CJK Symbols and Punctuation | 64 | 64 | 9 | 22 | 1.0 | | (?=\w)[\x{1800}-\x{18AF}] | Mongolian | 176 | 158 | 139 | 140 | 3.0 | •---------------------------•------------------------------------------------•----------•----------•-----------•------------•---------•Again, I don’t understand clearly these differences between the two last columns !
Best Regards,
guy038