Regexp fails to match UTF-8 characters

PeterJones

Expanding on your data with the U+#### unicode codepoints for the characters

🤣 1 U+1F923
☺ ☺️ 2 U+263A
😊 3 U+1F60A

You will notice that Notepad++ can find the U+263A, but not the U+1F923 or U+1F60A.
@guy038 just recently posted in “Functionlist Help” about how Notepad++ cannot search for those in normal circumstances, and instead has to use the surrogate pairs.

Unfortunately, in character classes like you mentioned, that means that the characters outside the BMP (at U+10000 and above), while they can be found by ^.+, cannot be found by something that seems equivalent, like ^[\s\S]+

Would it work for you to search for “anything (lookahead for followed by a space or EOF)” instead, using something like ^.+?(?=\s|\Z): this matches the first character on all three of the lines from your example data: this regex finds the first 1 or more characters at the beginning of a line that are followed by a space or EOF, which I think is what you intended

guy038

Hello, @alexolog, @peterjones and All,

I’m elaborating a more decent post which will explain the problem ! But, in the meanwhile, here are my solutions :

SEARCH (?-s)^((?!\h).[\x{DC00}-\x{DFFF}]?)+

SEARCH (?-s)((?!\h).[\x{DC00}-\x{DFFF}]?)+

The first regex finds any consecutive range of characters from \x{0000} to \x{10FFFF}, different from a space, a tab character and a line-break, which begins a line
The second regex finds any consecutive range of characters from \x{0000} to \x{10FFFF}, different from a space, a tab character and a line-break

Just test them against this text :

𝅘𝅥𝅮🂡🌍🎅😀 🞋🡅🧀 !!

See you later !

Cheers,

guy038

alexolog

@PeterJones said in Regexp fails to match UTF-8 characters:

Unfortunately, in character classes like you mentioned, that means that the characters outside the BMP (at U+10000 and above), while they can be found by ^.+, cannot be found by something that seems equivalent, like ^[\s\S]+

Is there a chance that N++ will support it in the near future?

Would it work for you to search for “anything (lookahead for followed by a space or EOF)” instead, using something like ^.+?(?=\s|\Z): this matches the first character on all three of the lines from your example data: this regex finds the first 1 or more characters at the beginning of a line that are followed by a space or EOF, which I think is what you intended

Thank you for the suggestion, but it was a one-off transformation I needed to do on a text file, and I achieved it by using a different editor.

Olivier Thomas

You’ll find all 3 examples.
(🤣 |😊 |☺ ☺\x{FE0F} )\d

guy038

Hello, @alexolog, @peterjones @olivier-thomas and All,

Sorry for my late answer ! As @Peterjones said, problems arise when searching Unicode characters which are over the Basic Multilingual plane ( BMP ) which have a code-point between \x{10000} and \x{10FFFF} ( so over \x{FFFF} )

For instance, as the code-point of the emoticon 🤣 is over \x{FFFF} :

It cannot be represented with its real regex syntax \x{1F923}, due a bug of the present Boost regex engine, which does not handle all characters in true 32-bits encoding, but only with the UTF-16 encoding:-(( So, searching for \x{1F4A6} results in the error message Find: Invalid regular expression
Moreover, the simple regex dot symbol (?-s). cannot match a character, with Unicode code-point > \x{FFFF}, too !
Of course if you paste your character, directly, in the Find what: zone, it does find all occurrences of the ROLLING ON THE FLOOR LAUGHING character !

BTW, your two emoticons can be found in the lists, below :

https://www.unicode.org/charts/PDF/U1F600.pdf

https://www.unicode.org/charts/PDF/U1F900.pdf

Luckily, the coding of characters of our Boost regex engine in UTF-16 allows to code all characters, with code-point over \x{FFFF}, thanks to the surrogates mechanism. Refer to generalities, below :

https://en.wikipedia.org/wiki/UTF-16

In short, the surrogate pair of a character, with Unicode code-point in range from \x{10000} till \x{10FFFF}, can be described by the regex :

\x{hhhh}\x{iiii} where D800 < hhhh < DBFF and DC00 < iiii < DFFF

So if a regex, involves the surrogates pair ( two 16-bit units ) of a character, which is over the BMP, our regex engine is able to match it. For instance, as the surrogates pair of the character ROLLING ON THE FLOOR LAUGHING is D83E DD23, the regex \x{D83E}\x{DD23} does find all occurrences of your emoticon character !

For a full explanation about the two 16-bits code units, called a surrogates pair, refer to :

https://en.wikipedia.org/wiki/UTF-16#Code_points_from_U+010000_to_U+10FFFF

For the calculus of the surrogates pair of a specific character with code over \x{FFFF}, refer, either , to :

http://www.russellcottrell.com/greek/utilities/SurrogatePairCalculator.htm

http://www.cogsci.ed.ac.uk/~richard/utf-8.cgi?

On our site, get additional information, here :

https://community.notepad-plus-plus.org/post/51068

https://community.notepad-plus-plus.org/post/43037

and recently I proposed a Notepad++ macro which replaces any selection of the \xhhhhh syntaxes with their surrogate pair equivalents \x{Dhhh}\x{Diii} ! See below :

https://community.notepad-plus-plus.org/post/57528

In summary, because of the use of UTF-16, instead of UTF-32, by the present implementation of the Boost Regex library, within N++ :

Use the simple regex (?-s). to match any standard character, from \x{0000} to \x{FFFF} ( so not including the EOL chars nor the Form Feed char \x0c )
IMPORTANT : From the surrogates mechanism, explained above, one may think that the regex [\x{D800}-\x{DBFF][\x{DC00}-\x{DFFF}] should find all the characters with Unicode code-point over \x{FFFF}. Unfortunately, this syntax does not work !? So, we need to use these derived regexes :
(?-s).[\x{DC00}-\x{DFFF}] to match any standard character from \x{10000} to \x{10FFFF}
(?-s).[\x{DC00}-\x{DFFF}]? to match all standard characters, from \x{0000} to \x{10FFFF}

And :

To match a specific character of the BMP, from \x{0000} to \x{FFFF}, use the regex syntax \x{hhhh}, with four hexadecimal numbers
To match a specific character over the BMP, from \x{10000} to \x{10FFFF}, use the high and low surrogates equivalent pair, with the regex syntax \x{<high>}\x{<low>}, replacing the <high> and <low> values with their exact hexadecimal values, using each 4 hexadecimal numbers

Now, let’s go back to your example :

🤣 1
☺ ☺️ 2
😊 3

The first line contains the \x{1F923} character, a space char and the 1 digit
The second line contains the \x{263A} character, a space char, an other \x{263A} char, the invisible \x{FE0F} char ( VARIATION SELECTOR-16 ) a space char and, finally, the 2 digit
The third line contains the \x{1F60A} character, a space char and the 3 digit

So, in order to find the contents of :

The first line, the \x{1F923}\x20\x31 regex must be changed by the regex \x{D83E}\x{DD23}\x20\x31
The second line, simply use the syntax \x{263A}\x20\x{263A}\x{FE0F}\x20\x32
The third line, the \x{1F60A}\x20\x33 regex must be changed by the regex \x{D83D}\x{DE0A}\x20\x33

And, in order to find an equivalent to the pseudo-wrong syntaxes ^[^ ]+ and ^[\S]+, use :

(?-s)^((?!\x20).[\x{DC00}-\x{DFFF}]?)+

Notes :

As usual, the in-line modifier (?-s) means that any dot will match a single standard char and not EOL chars !
The ^ assertion looks for a beginning of line
As said above, the (.[\x{DC00}-\x{DFFF}]?)+ will find any range of chars, from \x{0000} to \x{10FFFF}
But as we must omit the space char, we place the negative look-ahead (?!\x20), right before the . symbol, standing for any char under \x{10000}

Best Regards,

guy038

A last example, containing your three consecutive emoticons, with a space char and digit 4 :

🤣☺😊 4

Then, the exact regex \x{1F923}\x{263A}\x{1F60A}\x20\x34 must be changed as \x{D83E}\x{DD23}\x{263A}\x{D83D}\x{DE0A}\x20\x34 !

Long Tứ

@PeterJones said in Regexp fails to match UTF-8 characters:

@alexolog,

Expanding on your data with the U+#### unicode codepoints for the characters
🤣 1 U+1F923
☺ ☺️ 2 U+263A
😊 3 U+1F60A
You will notice that Notepad++ can find the U+263A, but not the U+1F923 or U+1F60A.
@guy038 just recently posted in “Functionlist Help” about how Notepad++ cannot search for those in normal circumstances, and instead has to use the surrogate pairs.

Unfortunately, in character classes like you mentioned, that means that the characters outside the BMP (at U+10000 and above), while they can be found by ^.+, cannot be found by something that seems equivalent, like ^[\s\S]+

Would it work for you to search for “anything (lookahead for followed by a space or EOF)” instead, using something like ^.+?(?=\s|\Z): this matches the first character on all three of the lines from your example data: this regex finds the first 1 or more characters at the beginning of a line that are followed by a space or EOF, which I think is what you intended

alexolog

@guy038 ,

Thank you for the detailed explanation!

Is there a way to match a string of UTF-8 characters that are not ASCII?

Aside: For some reason email notifications of replies do not seem to work.

guy038

Hello, @alexolog, @peterjones, @olivier-thomas and All,

First, in my previous post, I said :

Use the regex (?-s).[\x{DC00}-\x{DFFF}]? to match all standard characters, from \x{0000} to \x{10FFFF} ( so All Unicode chars ! )

You may prefer this more simple syntax (?-s).[\x{DC00}-\x{DFFF}]|. with two alternatives ( the former relative to all the characters over the BMP and the later relative to all the characters within the BMP )

Now, your question is a bit ambiguous ! Do you speak of :

Characters with Unicode code-point over \x{00FF} ?
Characters which cannot exist in an ANSI encoded file ?

Indeed, an ANSI encoded file may contain characters whose code-point is over \x{00FF} !

Probably, you’re using the Win-1252 ANSI encoding. To verify this assertion, open the Edit > Character panel. It should be identical to the one shown in this Wikipedia article :

https://en.wikipedia.org/wiki/Windows-1252#Character_set

which can be shortened as :

•---------------•-------•--------•----------•
|   Win-1252    |       | Unicode|  Code >  |
|  Dec   | Hex  | Char. |  C.P.  | \x{00FF} |
•---------------•-------•--------•----------•
|  0000  |  00  | <NUL> |  0000  |          |
|  ....  |  ..  | ..... |  ....  |          |
|  ....  |  ..  | ..... |  ....  |          |
|  0127  |  7F  | <DEL> |  007F  |          |
•---------------•-------•--------•----------•
|  0128  |  80  |   €   |  20AC  |    Yes   |
|  0129  |  81  | <HOP> |  0081  |          |
|  0130  |  82  |   ‘   |  201A  |    Yes   |
|  0131  |  83  |   ƒ   |  0192  |    Yes   |
|  0132  |  84  |   „   |  201E  |    Yes   |
|  0133  |  85  |   …   |  2026  |    Yes   |
|  0134  |  86  |   †   |  2020  |    Yes   |
|  0135  |  87  |   ‡   |  2021  |    Yes   |
|  0136  |  88  |   ˆ   |  02C6  |    Yes   |
|  0137  |  89  |   ‰   |  2030  |    Yes   |
|  0138  |  8A  |   Š   |  0160  |    Yes   |
|  0149  |  8B  |   ‹   |  2039  |    Yes   |
|  0140  |  8C  |   Œ   |  0152  |    Yes   |
|  0141  |  8D  | <RI>  |  008D  |          |
|  0142  |  8E  |   Ž   |  017D  |    Yes   |
|  0143  |  8F  | <SS3> |  008F  |          |
|  0144  |  90  | <DCS> |  0090  |          |
|  0145  |  91  |   ‘   |  2018  |    Yes   |
|  0146  |  92  |   ’   |  2019  |    Yes   |
|  0147  |  93  |   “   |  201C  |    Yes   |
|  0148  |  94  |   ”   |  201D  |    Yes   |
|  0149  |  95  |   •   |  2022  |    Yes   |
|  0150  |  96  |   –   |  2013  |    Yes   |
|  0151  |  97  |   —   |  2014  |    Yes   |
|  0152  |  98  |   ˜   |  02DC  |    Yes   |
|  0153  |  99  |   ™   |  2122  |    Yes   |
|  0154  |  9A  |   š   |  0161  |    Yes   |
|  0155  |  9B  |   ›   |  203A  |    Yes   |
|  0156  |  9C  |   œ   |  0153  |    Yes   |
|  0157  |  9D  | <OSC> |  009D  |          |
|  0158  |  9E  |   ž   |  017E  |    Yes   |
|  0159  |  9F  |   Ÿ   |  0178  |    Yes   |
•---------------•-------•--------•----------•
|  0160  |  A0  | <NBSP>|  00A0  |          |
|  ....  |  ..  | ..... |  ....  |          |
|  ....  |  ..  | ..... |  ....  |          |
|  0255  |  FF  |   ÿ   |  00FF  |          |
•---------------•-------•--------•----------•

If we sort this table by Unicode code-point ascending, we get :

•---------------•-------•--------•----------•
|   Win-1252    |       | Unicode|  Code >  |
|  Dec   | Hex  | Char. |  C.P.  | \x{00FF} |
•---------------•-------•--------•----------•
|  0000  |  00  | <NUL> |  0000  |          |
|  ....  |  ..  | ..... |  ....  |          |
|  ....  |  ..  | ..... |  ....  |          |
|  0127  |  7F  | <DEL> |  007F  |          |
•---------------•-------•--------•----------•
|  0129  |  81  | <HOP> |  0081  |          |
|  0141  |  8D  | <RI>  |  008D  |          |
|  0143  |  8F  | <SS3> |  008F  |          |
|  0144  |  90  | <DCS> |  0090  |          |
|  0157  |  9D  | <OSC> |  009D  |          |
•---------------•-------•--------•----------•
|  0160  |  A0  | <NBSP>|  00A0  |          |
|  ....  |  ..  | ..... |  ....  |          |
|  ....  |  ..  | ..... |  ....  |          |
|  0255  |  FF  |   ÿ   |  00FF  |          |
•---------------•-------•--------•----------•
|  0140  |  8C  |   Œ   |  0152  |    Yes   |
|  0156  |  9C  |   œ   |  0153  |    Yes   |
|  0138  |  8A  |   Š   |  0160  |    Yes   |
|  0154  |  9A  |   š   |  0161  |    Yes   |
|  0159  |  9F  |   Ÿ   |  0178  |    Yes   |
|  0142  |  8E  |   Ž   |  017D  |    Yes   |
|  0158  |  9E  |   ž   |  017E  |    Yes   |
|  0131  |  83  |   ƒ   |  0192  |    Yes   |
|  0136  |  88  |   ˆ   |  02C6  |    Yes   |
|  0152  |  98  |   ˜   |  02DC  |    Yes   |
|  0150  |  96  |   –   |  2013  |    Yes   |
|  0151  |  97  |   —   |  2014  |    Yes   |
|  0145  |  91  |   ‘   |  2018  |    Yes   |
|  0146  |  92  |   ’   |  2019  |    Yes   |
|  0130  |  82  |   ‘   |  201A  |    Yes   |
|  0147  |  93  |   “   |  201C  |    Yes   |
|  0148  |  94  |   ”   |  201D  |    Yes   |
|  0132  |  84  |   „   |  201E  |    Yes   |
|  0134  |  86  |   †   |  2020  |    Yes   |
|  0135  |  87  |   ‡   |  2021  |    Yes   |
|  0149  |  95  |   •   |  2022  |    Yes   |
|  0133  |  85  |   …   |  2026  |    Yes   |
|  0137  |  89  |   ‰   |  2030  |    Yes   |
|  0149  |  8B  |   ‹   |  2039  |    Yes   |
|  0155  |  9B  |   ›   |  203A  |    Yes   |
|  0128  |  80  |   €   |  20AC  |    Yes   |
|  0153  |  99  |   ™   |  2122  |    Yes   |
•---------------•-------•--------•----------•

So, if you want to detect all strings :

Containing characters with code-point over \x{00FF}, only, use the regex :

(?-s)(.[\x{DC00}-\x{DFFF}]|[[:unicode:]])+ ( Note the Posix character class [[:unicode:]] )

Containing characters, not involved in the Win-1252 ANSI encoding at all, use the regex :

(?-s)(.[\x{DC00}-\x{DFFF}]|[^\x{0000}-\x{007F}\x{0081}\x{008D}\x{008F}\x{0090}\x{009D}\x{00A0}-\x{00FF}ŒœŠšŸŽžƒˆ˜–—‘’‘“”„†‡•…‰‹›€™])+

Beware : It’s important to point out that this second regex avoid, for instance, classical letters, digits, space, tabulation and usual symbols, as well !! In other words, it will find any character not present in the character column of the ASCII Codes Insertion Panel ( Edit > Character Panel )

Best Regards,

Cheers,

guy038

guy038

Hi, @alexolog, @peterjones, @olivier-thomas and All,

Ouuuups, sorry ! I read you post too quickly and I thought that you were asking the question :

Is there a way to match a string of UTF-8 characters that are not ANSI ?

So, if you mean a way to detect a strings of characters, not pure ASCII ( so over \x{007F} ), use the regex :

(?-s)(.[\x{DC00}-\x{DFFF}]|[^\x{0000}-\x{007F}])+

Again, this regex will not match classical letters, digits, space, tabulation and usual symbols. Only chracters with unicode code-point over \x{007F} !

In other words, it will not match any character of that list :

https://en.wikipedia.org/wiki/ASCII#Character_set

BR

guy038

alexolog

Thank you!

Alan Kilborn

@guy038

Since you seem to have a good grasp on this topic…

I was reading this with some interest:
https://github.com/notepad-plus-plus/notepad-plus-plus/issues/5558

It is true indeed that:

The emoji sequence “👩‍❤️‍💋‍👩” does not seem to be rendered as a sequence on notepad++ It is rendered as 4 characters: 👩❤️💋👩

Do you have any idea on why this is?

guy038

Hello, @alan-kilborn and All,

Ah ah ! Alan You made me discover something I didn’t know existed : the creation of a new Emoji character from a small Emoji characters set !

I found out all that story, but we need to describe some technical data, first !

In this article, below, it is said :

https://en.wikipedia.org/wiki/Zero-width_joiner

The zero-width joiner ( ZWJ ) is a non-printing character used in the computerized typesetting of some complex scripts such as the Arabic script or any Indic script or, sometimes, as the Roman script. When placed between two characters that would otherwise not be connected, a ZWJ causes them to be printed in their connected forms. When a ZWJ char ( \x[200D} ) is placed between two emoji characters (or interspersed between multiple), it can result in a single glyph being shown, such as the family emoji, made up of two adult emoji and one or two child emoji

Similarly, in this article, below, it is said :

https://en.wikipedia.org/wiki/Zero-width_non-joiner

The zero-width non-joiner ( ZWNJ ) is a non-printing character used in the computerization of writing systems that make use of ligatures. When a ZWNJ char ( \x[200C} ) is placed between two characters that would, otherwise, be connected into a ligature, a ZWNJ causes them to be printed in their final and initial forms, respectively

On the other hand, in this aricle, it is said :

https://www.unicode.org/charts/PDF/UFE00.pdf

The variation selector-16 ( VS-16 ) is an invisible code-point which specifies that the preceding character should be displayed with the emoji presentation. Only required if the preceding character defaults to text presentation

For instance, as the ❤ heart character ( \x{2764} ) pre-dates the emoji characters, it needs this variation selector after it, to tell systems to use the ❤️ emoji version ( \x{2764}\x{FE0F} ), not the ❤︎ text version !

Similarly, the variation selector-15 ( VS-15 ) is an invisible code-point which specifies that the preceding character should be displayed with the text representation. Only required if the preceding character defaults to emoji presentation

Now, in this page, we can read :

https://emojipedia.org/emoji-zwj-sequence/

An Emoji ZWJ Sequence is a combination of multiple emojis which display as a single emoji on supported platforms. These sequences are joined with a Zero Width Joiner character

To learn how this feature works, refer to :

https://blog.emojipedia.org/emoji-zwj-sequences-three-letters-many-possibilities/

To be exhaustive, the different special characters, involved with the Emoji characters, are :

The 2 Format Characters, U+200C and U+200D, in the Unicode block General Punctuation ( U+2000 - U+206F )

https://www.unicode.org/charts/PDF/U2000.pdf

The 26 Regional Indicator Symbols, U+1F1E6 - U+1F1FF, in the Unicode block Enclosed Alphanumeric Supplement (U+1F100 – U+1F1FF )

https://www.unicode.org/charts/PDF/U1F100.pdf

The 5 Emoji Modifiers, U+1F3FB - U+1F3FF, in the Unicode block Miscellaneous Symbols and Pictographs ( U+1F300 – U+1F5FF )

https://www.unicode.org/charts/PDF/U1F300.pdf

The 4 Emoji Components, U+1F9B0 - U+1F9B3, in the Unicode block Supplemental Symbols and Pictographs ( U+1F900 – U+1F9FF )

https://www.unicode.org/charts/PDF/U1F900.pdf

The 2 Emoji Variation Selectors,U+FE0E and U+FE0F, in the Unicode block Variation Selectors( U+FE00 – U+FE0F )

https://www.unicode.org/charts/PDF/UFE00.pdf

Now that we have the technical background, let’s come back to your example !

In fact, the emoji 👩‍❤️‍💋‍👩 character, of juliodcs, is the combination of : 👩 emoji + ZWJ char + ❤️ dingbat + VS-16 char + ZWJ char + 💋 emoji + ZWJ char + 👩 emoji

and can be found with the following regex, where I use the free-spacing mode for readability )

(?x)
\x{D83D}\x{DC69}  #  Woman Emoji                     U+1F469
\x{200D}          #  ZWJ character                   U+200D
\x{2764}          #  Heavy Black Heart dingbat       U+2764
\x{FE0F}          #  Variation Selector-16 character U+FE0F
\x{200D}          #  ZWJ character                   U+200D
\x{D83D}\x{DC8B}  #  Kiss Mark Emoji                 U+1F48B
\x{200D}          #  ZWJ character                   U+200D
\x{D83D}\x{DC69}  #  Woman Emoji                     U+1F469

IMPORTANT : Don’t forget that in order to search characters with code-point over U+FFFF, our regex engine needs to use the Surrogate Pairs mechanism, explained in this post :

https://community.notepad-plus-plus.org/post/57591

Note also that the Variation Selector-16 character does not seem necessary, to the sequence. So we end with this sequence described in this page :

https://emojipedia.org/kiss-woman-woman/

An other example :

From this four emoji characters, below :

U+1F468 = \x{D83D}\x{DC68} => 👨 Man
U+1F469 = \x{D83D}\x{DC69} => 👩 Woman
U+1F466 = \x{D83D}\x{DC66} => 👦 Boy
U+1F467 = \x{D83D}\x{DC67} => 👧 Girl

We can build this composite emoji 👨‍👩‍👧‍👦 ( ZWJ sequence ) = 👨 emoji + ZWJ char + 👩 emoji + ZWJ char + 👧 emoji + ZWJ char + 👦 emoji

which can be searched with the following regex, in free-spacing mode :

(?x)
\x{D83D}\x{DC68}  # Man
\x{200D}          # ZWJ ( Zero Width Joiner )
\x{D83D}\x{DC69}  # Woman
\x{200D}          # ZWJ ( Zero Width Joiner )
\x{D83D}\x{DC67}  # Girl
\x{200D}          # ZWJ ( Zero Width Joiner )
\x{D83D}\x{DC66}  # Boy

Remark :

It’s important to point out that the ZWJ sequence of emojis : 👨 emoji + ZWJ char + 👩 emoji + ZWJ char + 👦 emoji + ZWJ char + 👧 emoji does not give the expected result !

Indeed, it just be outputted as the ZWJ sequence 👨 emoji + ZWJ char + 👩 emoji + ZWJ char + 👦 emoji, followed with the single 👧 emoji ! That is to say the emoji sequence 👨‍👩‍👦‍👧

A third example, using the Emoji modifiers. From these chars :

U+1F3FB = \x{D83C}\x{DFFB} 🏻 EMOJI MODIFIER FITZPATRICK TYPE-1-2
U+1F3FC = \x{D83C}\x{DFFC} 🏼 EMOJI MODIFIER FITZPATRICK TYPE-3
U+1F3FD = \x{D83C}\x{DFFD} 🏽 EMOJI MODIFIER FITZPATRICK TYPE-4
U+1F3FE = \x{D83C}\x{DFFE} 🏾 EMOJI MODIFIER FITZPATRICK TYPE-5
U+1F3FF = \x{D83C}\x{DFFF} 🏿 EMOJI MODIFIER FITZPATRICK TYPE-6

We can build this following emoji characters of a girl ( 👧 emoji ) with different skin tone :

\x{D83D}\x{DC67}\x{200D}\x{D83C}\x{DFFB}    =>    👧🏻    Girl with a light skin tone
\x{D83D}\x{DC67}\x{200D}\x{D83C}\x{DFFC}    =>    👧🏼    Girl with a medium-light skin tone
\x{D83D}\x{DC67}\x{200D}\x{D83C}\x{DFFD}    =>    👧🏽    Girl with a medium skin tone
\x{D83D}\x{DC67}\x{200D}\x{D83C}\x{DFFE}    =>    👧🏾    Girl with a medium-dark skin tone
\x{D83D}\x{DC67}\x{200D}\x{D83C}\x{DFFF}    =>    👧🏿    Girl with a dark skin tone

Note the function of the ZWJ and ZWNJ format characters :

With the ZWJ char, the emoji sequence 👧 emoji + ZWJ char + 🏽 emoji modifier is displayed as the composite 👧‍🏽 emoji
With the ZWNJ char, the emoji sequence 👧 emoji + ZWNJ char + 🏽 emoji modifier is displayed as the two single emojis 👧‌🏽
However, I noticed that, without these format chars, the sequence 👧 emoji + 🏽 emoji modifier is also outputted as the composite emoji 👧🏽 !

A fourth example, using the regional Indicator symbols. From these chars, below :

U+1F1E7 = \x{D83C}\x{DDE7} => 🇧 Regional Indicator Symbol Letter B
U+1F1EB = \x{D83C}\x{DDEB} => 🇫 Regional Indicator Symbol Letter F
U+1F1EC = \x{D83C}\x{DDEC} => 🇬 Regional Indicator Symbol Letter G
U+1F1F4 = \x{D83C}\x{DDF4} => 🇴 Regional Indicator Symbol Letter O
U+1F1F7 = \x{D83C}\x{DDF7} => 🇷 Regional Indicator Symbol Letter R
U+1F1F8 = \x{D83C}\x{DDF8} => 🇸 Regional Indicator Symbol Letter S
U+1F1FA = \x{D83C}\x{DDFA} => 🇺 Regional Indicator Symbol Letter U

We can build, for instance, these flags :

The French flag 🇫‍🇷 from 🇫 and 🇷 Regional indicators ( \x{D83C}\x{DDEB}\x{200D}\x{D83C}\x{DDF7} )
The United States flag 🇺‍🇸 from 🇺 and 🇸 Regional indicators ( \x{D83C}\x{DDFA}\x{200D}\x{D83C}\x{DDF8} )
The United Kingdom flag 🇬‍🇧 from 🇬 and 🇧 Regional indicators ( \x{D83C}\x{DDEC}\x{200D}\x{D83C}\x{DDE7} )
The Faroe Islands flag 🇫‍🇴 from 🇫 and 🇴 Regional indicators ( \x{D83C}\x{DDEB}\x{200D}\x{D83C}\x{DDF4} )
The Brazil flag 🇧‍🇷 from 🇧 and 🇷 Regional indicators ( \x{D83C}\x{DDE7}\x{200D}\x{D83C}\x{DDF7} )
The Uganda flag 🇺‍🇬 from 🇺 and 🇬 Regional indicators ( \x{D83C}\x{DDFA}\x{200D}\x{D83C}\x{DDEC} )

Remark : You may omit the ZWJ character between the two regional indicators characters !

Note the function of the VS-15 and VS-16 Variation Selector characters. For instance, as emoji sequences can be represented as black and white text or as coloured emojis

With the VS-15 ( \x{FE0E} ) char, the text representation is selected => the sequence : ℹ Information Source char + VS-15 char + 👨 emoji ) \x{2139}\x{FE0E}\x{D83D}\x{DC68} returns the ℹ︎👨 sequence
With the VS-16 ( \x{FE0F} ) char, the emoji representation is selected => the sequence : ℹ Information Source char + VS-16 char + 👨 emoji ) \x{2139}\x{FE0F}\x{D83D}\x{DC68} returns the ℹ️👨 sequence

To end, you’ll find a list of all Emoji characters, either individual or composite, below :

https://emojipedia.org/emoji/

And, for paranoid people, refer to the Unicode Technical Standard #51 :

https://www.unicode.org/reports/tr51/

Best Regards,

guy038

Alan Kilborn

Wow.
More to it than I’d have thought.
Thanks for the insight.

guy038

This post is deleted!