Regexp fails to match UTF-8 characters
-
Expanding on your data with the U+#### unicode codepoints for the characters
🤣 1 U+1F923 ☺ ☺️ 2 U+263A 😊 3 U+1F60A
You will notice that Notepad++ can find the U+263A, but not the U+1F923 or U+1F60A.
@guy038 just recently posted in “Functionlist Help” about how Notepad++ cannot search for those in normal circumstances, and instead has to use the surrogate pairs.Unfortunately, in character classes like you mentioned, that means that the characters outside the BMP (at U+10000 and above), while they can be found by
^.+
, cannot be found by something that seems equivalent, like^[\s\S]+
Would it work for you to search for “anything (lookahead for followed by a space or EOF)” instead, using something like
^.+?(?=\s|\Z)
: this matches the first character on all three of the lines from your example data: this regex finds the first 1 or more characters at the beginning of a line that are followed by a space or EOF, which I think is what you intended -
Hello, @alexolog, @peterjones and All,
I’m elaborating a more decent post which will explain the problem ! But, in the meanwhile, here are my solutions :
SEARCH
(?-s)^((?!\h).[\x{DC00}-\x{DFFF}]?)+
SEARCH
(?-s)((?!\h).[\x{DC00}-\x{DFFF}]?)+
-
The first regex finds any consecutive range of characters from
\x{0000}
to\x{10FFFF}
, different from a space, a tab character and a line-break, which begins a line -
The second regex finds any consecutive range of characters from
\x{0000}
to\x{10FFFF}
, different from a space, a tab character and a line-break
Just test them against this text :
𝅘𝅥𝅮🂡🌍🎅😀 🞋🡅🧀 !!
See you later !
Cheers,
guy038
-
-
@PeterJones said in Regexp fails to match UTF-8 characters:
Unfortunately, in character classes like you mentioned, that means that the characters outside the BMP (at U+10000 and above), while they can be found by
^.+
, cannot be found by something that seems equivalent, like^[\s\S]+
Is there a chance that N++ will support it in the near future?
Would it work for you to search for “anything (lookahead for followed by a space or EOF)” instead, using something like
^.+?(?=\s|\Z)
: this matches the first character on all three of the lines from your example data: this regex finds the first 1 or more characters at the beginning of a line that are followed by a space or EOF, which I think is what you intendedThank you for the suggestion, but it was a one-off transformation I needed to do on a text file, and I achieved it by using a different editor.
-
You’ll find all 3 examples.
(🤣 |😊 |☺ ☺\x{FE0F} )\d
-
Hello, @alexolog, @peterjones @olivier-thomas and All,
Sorry for my late answer ! As @Peterjones said, problems arise when searching Unicode characters which are over the Basic Multilingual plane (
BMP
) which have a code-point between\x{10000}
and\x{10FFFF}
( so over\x{FFFF}
)For instance, as the code-point of the emoticon
🤣
is over\x{FFFF}
:-
It cannot be represented with its real regex syntax
\x{1F923}
, due a bug of the present Boost regex engine, which does not handle all characters in true32-bits
encoding, but only with theUTF-16
encoding:-(( So, searching for\x{1F4A6}
results in the error messageFind: Invalid regular expression
-
Moreover, the simple regex dot symbol
(?-s).
cannot match a character, with Unicode code-point> \x{FFFF}
, too ! -
Of course if you paste your character, directly, in the
Find what:
zone, it does find all occurrences of theROLLING ON THE FLOOR LAUGHING
character !
BTW, your two emoticons can be found in the lists, below :
https://www.unicode.org/charts/PDF/U1F600.pdf
https://www.unicode.org/charts/PDF/U1F900.pdf
Luckily, the coding of characters of our Boost regex engine in
UTF-16
allows to code all characters, with code-point over\x{FFFF}
, thanks to the surrogates mechanism. Refer to generalities, below :https://en.wikipedia.org/wiki/UTF-16
In short, the surrogate pair of a character, with Unicode code-point in range from
\x{10000}
till\x{10FFFF}
, can be described by the regex :\x{hhhh}\x{iiii}
whereD800
< hhhh <DBFF
andDC00
< iiii <DFFF
So if a regex, involves the surrogates pair ( two
16-bit
units ) of a character, which is over theBMP
, our regex engine is able to match it. For instance, as the surrogates pair of the characterROLLING ON THE FLOOR LAUGHING
isD83E DD23
, the regex\x{D83E}\x{DD23}
does find all occurrences of your emoticon character !
- For a full explanation about the two 16-bits code units, called a surrogates pair, refer to :
https://en.wikipedia.org/wiki/UTF-16#Code_points_from_U+010000_to_U+10FFFF
- For the calculus of the surrogates pair of a specific character with code over
\x{FFFF}
, refer, either , to :
http://www.russellcottrell.com/greek/utilities/SurrogatePairCalculator.htm
http://www.cogsci.ed.ac.uk/~richard/utf-8.cgi?
- On our site, get additional information, here :
https://community.notepad-plus-plus.org/post/51068
https://community.notepad-plus-plus.org/post/43037
and recently I proposed a Notepad++ macro which replaces any selection of the
\xhhhhh
syntaxes with their surrogate pair equivalents\x{Dhhh}\x{Diii}
! See below :https://community.notepad-plus-plus.org/post/57528
In summary, because of the use of
UTF-16
, instead ofUTF-32
, by the present implementation of the Boost Regex library, within N++ :-
Use the simple regex
(?-s).
to match any standard character, from\x{0000}
to\x{FFFF}
( so not including the EOL chars nor the Form Feed char\x0c
) -
IMPORTANT : From the surrogates mechanism, explained above, one may think that the regex
[\x{D800}-\x{DBFF][\x{DC00}-\x{DFFF}]
should find all the characters with Unicode code-point over \x{FFFF}. Unfortunately, this syntax does not work !? So, we need to use these derived regexes : -
(?-s).[\x{DC00}-\x{DFFF}]
to match any standard character from\x{10000}
to\x{10FFFF}
-
(?-s).[\x{DC00}-\x{DFFF}]?
to match all standard characters, from\x{0000}
to\x{10FFFF}
And :
-
To match a specific character of the BMP, from
\x{0000}
to\x{FFFF}
, use the regex syntax\x{hhhh}
, with four hexadecimal numbers -
To match a specific character over the BMP, from
\x{10000}
to\x{10FFFF}
, use the high and low surrogates equivalent pair, with the regex syntax\x{<high>}\x{<low>}
, replacing the <high> and <low> values with their exact hexadecimal values, using each4
hexadecimal numbers
Now, let’s go back to your example :
🤣 1 ☺ ☺️ 2 😊 3
-
The first line contains the
\x{1F923}
character, a space char and the1
digit -
The second line contains the
\x{263A}
character, a space char, an other\x{263A}
char, the invisible\x{FE0F}
char ( VARIATION SELECTOR-16 ) a space char and, finally, the2
digit -
The third line contains the
\x{1F60A}
character, a space char and the3
digit
So, in order to find the contents of :
-
The first line, the
\x{1F923}\x20\x31
regex must be changed by the regex\x{D83E}\x{DD23}\x20\x31
-
The second line, simply use the syntax
\x{263A}\x20\x{263A}\x{FE0F}\x20\x32
-
The third line, the
\x{1F60A}\x20\x33
regex must be changed by the regex\x{D83D}\x{DE0A}\x20\x33
And, in order to find an equivalent to the pseudo-wrong syntaxes
^[^ ]+
and^[\S]+
, use :(?-s)^((?!\x20).[\x{DC00}-\x{DFFF}]?)+
Notes :
-
As usual, the in-line modifier
(?-s)
means that any dot will match a single standard char and not EOL chars ! -
The
^
assertion looks for a beginning of line -
As said above, the
(.[\x{DC00}-\x{DFFF}]?)+
will find any range of chars, from\x{0000}
to\x{10FFFF}
-
But as we must omit the space char, we place the negative look-ahead
(?!\x20)
, right before the.
symbol, standing for any char under\x{10000}
Best Regards,
guy038
A last example, containing your three consecutive emoticons, with a space char and digit
4
:🤣☺😊 4
Then, the exact regex
\x{1F923}\x{263A}\x{1F60A}\x20\x34
must be changed as\x{D83E}\x{DD23}\x{263A}\x{D83D}\x{DE0A}\x20\x34
! -
-
@PeterJones said in Regexp fails to match UTF-8 characters:
Expanding on your data with the U+#### unicode codepoints for the characters
🤣 1 U+1F923 ☺ ☺️ 2 U+263A 😊 3 U+1F60A
You will notice that Notepad++ can find the U+263A, but not the U+1F923 or U+1F60A.
@guy038 just recently posted in “Functionlist Help” about how Notepad++ cannot search for those in normal circumstances, and instead has to use the surrogate pairs.Unfortunately, in character classes like you mentioned, that means that the characters outside the BMP (at U+10000 and above), while they can be found by
^.+
, cannot be found by something that seems equivalent, like^[\s\S]+
Would it work for you to search for “anything (lookahead for followed by a space or EOF)” instead, using something like
^.+?(?=\s|\Z)
: this matches the first character on all three of the lines from your example data: this regex finds the first 1 or more characters at the beginning of a line that are followed by a space or EOF, which I think is what you intended -
@guy038 ,
Thank you for the detailed explanation!
Is there a way to match a string of UTF-8 characters that are not ASCII?
Aside: For some reason email notifications of replies do not seem to work.
-
Hello, @alexolog, @peterjones, @olivier-thomas and All,
First, in my previous post, I said :
- Use the regex
(?-s).[\x{DC00}-\x{DFFF}]?
to match all standard characters, from\x{0000}
to\x{10FFFF}
( so All Unicode chars ! )
You may prefer this more simple syntax
(?-s).[\x{DC00}-\x{DFFF}]|.
with two alternatives ( the former relative to all the characters over theBMP
and the later relative to all the characters within theBMP
)
Now, your question is a bit ambiguous ! Do you speak of :
-
Characters with Unicode code-point over
\x{00FF}
? -
Characters which cannot exist in an
ANSI
encoded file ?
Indeed, an
ANSI
encoded file may contain characters whose code-point is over\x{00FF}
!Probably, you’re using the
Win-1252 ANSI
encoding. To verify this assertion, open theEdit > Character panel
. It should be identical to the one shown in this Wikipedia article :https://en.wikipedia.org/wiki/Windows-1252#Character_set
which can be shortened as :
•---------------•-------•--------•----------• | Win-1252 | | Unicode| Code > | | Dec | Hex | Char. | C.P. | \x{00FF} | •---------------•-------•--------•----------• | 0000 | 00 | <NUL> | 0000 | | | .... | .. | ..... | .... | | | .... | .. | ..... | .... | | | 0127 | 7F | <DEL> | 007F | | •---------------•-------•--------•----------• | 0128 | 80 | € | 20AC | Yes | | 0129 | 81 | <HOP> | 0081 | | | 0130 | 82 | ‘ | 201A | Yes | | 0131 | 83 | ƒ | 0192 | Yes | | 0132 | 84 | „ | 201E | Yes | | 0133 | 85 | … | 2026 | Yes | | 0134 | 86 | † | 2020 | Yes | | 0135 | 87 | ‡ | 2021 | Yes | | 0136 | 88 | ˆ | 02C6 | Yes | | 0137 | 89 | ‰ | 2030 | Yes | | 0138 | 8A | Š | 0160 | Yes | | 0149 | 8B | ‹ | 2039 | Yes | | 0140 | 8C | Œ | 0152 | Yes | | 0141 | 8D | <RI> | 008D | | | 0142 | 8E | Ž | 017D | Yes | | 0143 | 8F | <SS3> | 008F | | | 0144 | 90 | <DCS> | 0090 | | | 0145 | 91 | ‘ | 2018 | Yes | | 0146 | 92 | ’ | 2019 | Yes | | 0147 | 93 | “ | 201C | Yes | | 0148 | 94 | ” | 201D | Yes | | 0149 | 95 | • | 2022 | Yes | | 0150 | 96 | – | 2013 | Yes | | 0151 | 97 | — | 2014 | Yes | | 0152 | 98 | ˜ | 02DC | Yes | | 0153 | 99 | ™ | 2122 | Yes | | 0154 | 9A | š | 0161 | Yes | | 0155 | 9B | › | 203A | Yes | | 0156 | 9C | œ | 0153 | Yes | | 0157 | 9D | <OSC> | 009D | | | 0158 | 9E | ž | 017E | Yes | | 0159 | 9F | Ÿ | 0178 | Yes | •---------------•-------•--------•----------• | 0160 | A0 | <NBSP>| 00A0 | | | .... | .. | ..... | .... | | | .... | .. | ..... | .... | | | 0255 | FF | ÿ | 00FF | | •---------------•-------•--------•----------•
If we sort this table by Unicode code-point ascending, we get :
•---------------•-------•--------•----------• | Win-1252 | | Unicode| Code > | | Dec | Hex | Char. | C.P. | \x{00FF} | •---------------•-------•--------•----------• | 0000 | 00 | <NUL> | 0000 | | | .... | .. | ..... | .... | | | .... | .. | ..... | .... | | | 0127 | 7F | <DEL> | 007F | | •---------------•-------•--------•----------• | 0129 | 81 | <HOP> | 0081 | | | 0141 | 8D | <RI> | 008D | | | 0143 | 8F | <SS3> | 008F | | | 0144 | 90 | <DCS> | 0090 | | | 0157 | 9D | <OSC> | 009D | | •---------------•-------•--------•----------• | 0160 | A0 | <NBSP>| 00A0 | | | .... | .. | ..... | .... | | | .... | .. | ..... | .... | | | 0255 | FF | ÿ | 00FF | | •---------------•-------•--------•----------• | 0140 | 8C | Œ | 0152 | Yes | | 0156 | 9C | œ | 0153 | Yes | | 0138 | 8A | Š | 0160 | Yes | | 0154 | 9A | š | 0161 | Yes | | 0159 | 9F | Ÿ | 0178 | Yes | | 0142 | 8E | Ž | 017D | Yes | | 0158 | 9E | ž | 017E | Yes | | 0131 | 83 | ƒ | 0192 | Yes | | 0136 | 88 | ˆ | 02C6 | Yes | | 0152 | 98 | ˜ | 02DC | Yes | | 0150 | 96 | – | 2013 | Yes | | 0151 | 97 | — | 2014 | Yes | | 0145 | 91 | ‘ | 2018 | Yes | | 0146 | 92 | ’ | 2019 | Yes | | 0130 | 82 | ‘ | 201A | Yes | | 0147 | 93 | “ | 201C | Yes | | 0148 | 94 | ” | 201D | Yes | | 0132 | 84 | „ | 201E | Yes | | 0134 | 86 | † | 2020 | Yes | | 0135 | 87 | ‡ | 2021 | Yes | | 0149 | 95 | • | 2022 | Yes | | 0133 | 85 | … | 2026 | Yes | | 0137 | 89 | ‰ | 2030 | Yes | | 0149 | 8B | ‹ | 2039 | Yes | | 0155 | 9B | › | 203A | Yes | | 0128 | 80 | € | 20AC | Yes | | 0153 | 99 | ™ | 2122 | Yes | •---------------•-------•--------•----------•
So, if you want to detect all strings :
- Containing characters with code-point over
\x{00FF}
, only, use the regex :
(?-s)(.[\x{DC00}-\x{DFFF}]|[[:unicode:]])+
( Note the Posix character class[[:unicode:]]
)- Containing characters, not involved in the
Win-1252 ANSI
encoding at all, use the regex :
(?-s)(.[\x{DC00}-\x{DFFF}]|[^\x{0000}-\x{007F}\x{0081}\x{008D}\x{008F}\x{0090}\x{009D}\x{00A0}-\x{00FF}ŒœŠšŸŽžƒˆ˜–—‘’‘“”„†‡•…‰‹›€™])+
Beware : It’s important to point out that this second regex avoid, for instance, classical letters, digits, space, tabulation and usual symbols, as well !! In other words, it will find any character not present in the
character
column of the ASCII Codes Insertion Panel (Edit > Character Panel
)Best Regards,
Cheers,
guy038
- Use the regex
-
Hi, @alexolog, @peterjones, @olivier-thomas and All,
Ouuuups, sorry ! I read you post too quickly and I thought that you were asking the question :
Is there a way to match a string of UTF-8 characters that are not ANSI ?
So, if you mean a way to detect a strings of characters, not pure
ASCII
( so over\x{007F}
), use the regex :(?-s)(.[\x{DC00}-\x{DFFF}]|[^\x{0000}-\x{007F}])+
Again, this regex will not match classical letters, digits, space, tabulation and usual symbols. Only chracters with unicode code-point over
\x{007F}
!In other words, it will not match any character of that list :
https://en.wikipedia.org/wiki/ASCII#Character_set
BR
guy038
-
Thank you!
-
Since you seem to have a good grasp on this topic…
I was reading this with some interest:
https://github.com/notepad-plus-plus/notepad-plus-plus/issues/5558It is true indeed that:
The emoji sequence “👩❤️💋👩” does not seem to be rendered as a sequence on notepad++ It is rendered as 4 characters: 👩❤️💋👩
Do you have any idea on why this is?
-
Hello, @alan-kilborn and All,
Ah ah ! Alan You made me discover something I didn’t know existed : the creation of a new Emoji character from a small Emoji characters set !
I found out all that story, but we need to describe some technical data, first !
In this article, below, it is said :
https://en.wikipedia.org/wiki/Zero-width_joiner
The zero-width joiner (
ZWJ
) is a non-printing character used in the computerized typesetting of some complex scripts such as the Arabic script or any Indic script or, sometimes, as the Roman script. When placed between two characters that would otherwise not be connected, a ZWJ causes them to be printed in their connected forms. When aZWJ
char (\x[200D}
) is placed between two emoji characters (or interspersed between multiple), it can result in a single glyph being shown, such as the family emoji, made up of two adult emoji and one or two child emojiSimilarly, in this article, below, it is said :
https://en.wikipedia.org/wiki/Zero-width_non-joiner
The zero-width non-joiner (
ZWNJ
) is a non-printing character used in the computerization of writing systems that make use of ligatures. When aZWNJ
char (\x[200C}
) is placed between two characters that would, otherwise, be connected into a ligature, aZWNJ
causes them to be printed in their final and initial forms, respectivelyOn the other hand, in this aricle, it is said :
https://www.unicode.org/charts/PDF/UFE00.pdf
The variation selector-16 (
VS-16
) is an invisible code-point which specifies that the preceding character should be displayed with the emoji presentation. Only required if the preceding character defaults to text presentationFor instance, as the ❤ heart character (
\x{2764}
) pre-dates the emoji characters, it needs this variation selector after it, to tell systems to use the ❤️ emoji version (\x{2764}\x{FE0F}
), not the ❤︎ text version !Similarly, the variation selector-15 (
VS-15
) is an invisible code-point which specifies that the preceding character should be displayed with the text representation. Only required if the preceding character defaults to emoji presentation
Now, in this page, we can read :
https://emojipedia.org/emoji-zwj-sequence/
An
Emoji ZWJ
Sequence is a combination of multiple emojis which display as a single emoji on supported platforms. These sequences are joined with aZero Width Joiner
characterTo learn how this feature works, refer to :
https://blog.emojipedia.org/emoji-zwj-sequences-three-letters-many-possibilities/
To be exhaustive, the different special characters, involved with the Emoji characters, are :
- The
2
Format Characters,U+200C
andU+200D
, in the Unicode block General Punctuation ( U+2000 - U+206F )
https://www.unicode.org/charts/PDF/U2000.pdf
- The
26
Regional Indicator Symbols,U+1F1E6
-U+1F1FF
, in the Unicode block Enclosed Alphanumeric Supplement (U+1F100 – U+1F1FF )
https://www.unicode.org/charts/PDF/U1F100.pdf
- The
5
Emoji Modifiers,U+1F3FB - U+1F3FF
, in the Unicode block Miscellaneous Symbols and Pictographs ( U+1F300 – U+1F5FF )
https://www.unicode.org/charts/PDF/U1F300.pdf
- The
4
Emoji Components,U+1F9B0 - U+1F9B3
, in the Unicode block Supplemental Symbols and Pictographs ( U+1F900 – U+1F9FF )
https://www.unicode.org/charts/PDF/U1F900.pdf
- The
2
Emoji Variation Selectors,U+FE0E
andU+FE0F
, in the Unicode block Variation Selectors( U+FE00 – U+FE0F )
https://www.unicode.org/charts/PDF/UFE00.pdf
Now that we have the technical background, let’s come back to your example !
In fact, the emoji 👩❤️💋👩 character, of juliodcs, is the combination of : 👩 emoji + ZWJ char + ❤️ dingbat + VS-16 char + ZWJ char + 💋 emoji + ZWJ char + 👩 emoji
and can be found with the following regex, where I use the free-spacing mode for readability )
(?x) \x{D83D}\x{DC69} # Woman Emoji U+1F469 \x{200D} # ZWJ character U+200D \x{2764} # Heavy Black Heart dingbat U+2764 \x{FE0F} # Variation Selector-16 character U+FE0F \x{200D} # ZWJ character U+200D \x{D83D}\x{DC8B} # Kiss Mark Emoji U+1F48B \x{200D} # ZWJ character U+200D \x{D83D}\x{DC69} # Woman Emoji U+1F469
IMPORTANT : Don’t forget that in order to search characters with code-point over
U+FFFF
, our regex engine needs to use the Surrogate Pairs mechanism, explained in this post :https://community.notepad-plus-plus.org/post/57591
Note also that the Variation Selector-16 character does not seem necessary, to the sequence. So we end with this sequence described in this page :
https://emojipedia.org/kiss-woman-woman/
An other example :
From this four emoji characters, below :
- U+1F468 =
\x{D83D}\x{DC68}
=> 👨 Man - U+1F469 =
\x{D83D}\x{DC69}
=> 👩 Woman - U+1F466 =
\x{D83D}\x{DC66}
=> 👦 Boy - U+1F467 =
\x{D83D}\x{DC67}
=> 👧 Girl
We can build this composite emoji 👨👩👧👦 (
ZWJ
sequence ) = 👨 emoji + ZWJ char + 👩 emoji + ZWJ char + 👧 emoji + ZWJ char + 👦 emojiwhich can be searched with the following regex, in free-spacing mode :
(?x) \x{D83D}\x{DC68} # Man \x{200D} # ZWJ ( Zero Width Joiner ) \x{D83D}\x{DC69} # Woman \x{200D} # ZWJ ( Zero Width Joiner ) \x{D83D}\x{DC67} # Girl \x{200D} # ZWJ ( Zero Width Joiner ) \x{D83D}\x{DC66} # Boy
Remark :
It’s important to point out that the
ZWJ
sequence of emojis : 👨 emoji + ZWJ char + 👩 emoji + ZWJ char + 👦 emoji + ZWJ char + 👧 emoji does not give the expected result !Indeed, it just be outputted as the
ZWJ
sequence 👨 emoji + ZWJ char + 👩 emoji + ZWJ char + 👦 emoji, followed with the single 👧 emoji ! That is to say the emoji sequence 👨👩👦👧
A third example, using the Emoji modifiers. From these chars :
- U+1F3FB =
\x{D83C}\x{DFFB}
🏻 EMOJI MODIFIER FITZPATRICK TYPE-1-2 - U+1F3FC =
\x{D83C}\x{DFFC}
🏼 EMOJI MODIFIER FITZPATRICK TYPE-3 - U+1F3FD =
\x{D83C}\x{DFFD}
🏽 EMOJI MODIFIER FITZPATRICK TYPE-4 - U+1F3FE =
\x{D83C}\x{DFFE}
🏾 EMOJI MODIFIER FITZPATRICK TYPE-5 - U+1F3FF =
\x{D83C}\x{DFFF}
🏿 EMOJI MODIFIER FITZPATRICK TYPE-6
We can build this following emoji characters of a girl ( 👧 emoji ) with different skin tone :
\x{D83D}\x{DC67}\x{200D}\x{D83C}\x{DFFB}
=> 👧🏻 Girl with a light skin tone
\x{D83D}\x{DC67}\x{200D}\x{D83C}\x{DFFC}
=> 👧🏼 Girl with a medium-light skin tone
\x{D83D}\x{DC67}\x{200D}\x{D83C}\x{DFFD}
=> 👧🏽 Girl with a medium skin tone
\x{D83D}\x{DC67}\x{200D}\x{D83C}\x{DFFE}
=> 👧🏾 Girl with a medium-dark skin tone
\x{D83D}\x{DC67}\x{200D}\x{D83C}\x{DFFF}
=> 👧🏿 Girl with a dark skin toneNote the function of the
ZWJ
andZWNJ
format characters :-
With the
ZWJ
char, the emoji sequence 👧 emoji + ZWJ char + 🏽 emoji modifier is displayed as the composite 👧🏽 emoji -
With the
ZWNJ
char, the emoji sequence 👧 emoji + ZWNJ char + 🏽 emoji modifier is displayed as the two single emojis 👧🏽 -
However, I noticed that, without these format chars, the sequence 👧 emoji + 🏽 emoji modifier is also outputted as the composite emoji 👧🏽 !
A fourth example, using the regional Indicator symbols. From these chars, below :
- U+1F1E7 =
\x{D83C}\x{DDE7}
=> 🇧 Regional Indicator Symbol Letter B - U+1F1EB =
\x{D83C}\x{DDEB}
=> 🇫 Regional Indicator Symbol Letter F - U+1F1EC =
\x{D83C}\x{DDEC}
=> 🇬 Regional Indicator Symbol Letter G - U+1F1F4 =
\x{D83C}\x{DDF4}
=> 🇴 Regional Indicator Symbol Letter O - U+1F1F7 =
\x{D83C}\x{DDF7}
=> 🇷 Regional Indicator Symbol Letter R - U+1F1F8 =
\x{D83C}\x{DDF8}
=> 🇸 Regional Indicator Symbol Letter S - U+1F1FA =
\x{D83C}\x{DDFA}
=> 🇺 Regional Indicator Symbol Letter U
We can build, for instance, these flags :
-
The French flag 🇫🇷 from 🇫 and 🇷 Regional indicators (
\x{D83C}\x{DDEB}\x{200D}\x{D83C}\x{DDF7}
) -
The United States flag 🇺🇸 from 🇺 and 🇸 Regional indicators (
\x{D83C}\x{DDFA}\x{200D}\x{D83C}\x{DDF8}
) -
The United Kingdom flag 🇬🇧 from 🇬 and 🇧 Regional indicators (
\x{D83C}\x{DDEC}\x{200D}\x{D83C}\x{DDE7}
) -
The Faroe Islands flag 🇫🇴 from 🇫 and 🇴 Regional indicators (
\x{D83C}\x{DDEB}\x{200D}\x{D83C}\x{DDF4}
) -
The Brazil flag 🇧🇷 from 🇧 and 🇷 Regional indicators (
\x{D83C}\x{DDE7}\x{200D}\x{D83C}\x{DDF7}
) -
The Uganda flag 🇺🇬 from 🇺 and 🇬 Regional indicators (
\x{D83C}\x{DDFA}\x{200D}\x{D83C}\x{DDEC}
)
Remark : You may omit the
ZWJ
character between the two regional indicators characters !
Note the function of the
VS-15
andVS-16
Variation Selector characters. For instance, as emoji sequences can be represented as black and white text or as coloured emojis-
With the VS-15 (
\x{FE0E}
) char, the text representation is selected => the sequence : ℹ Information Source char +VS-15
char + 👨 emoji )\x{2139}\x{FE0E}\x{D83D}\x{DC68}
returns the ℹ︎👨 sequence -
With the VS-16 (
\x{FE0F}
) char, the emoji representation is selected => the sequence : ℹ Information Source char +VS-16
char + 👨 emoji )\x{2139}\x{FE0F}\x{D83D}\x{DC68}
returns the ℹ️👨 sequence
To end, you’ll find a list of all Emoji characters, either individual or composite, below :
And, for paranoid people, refer to the Unicode Technical Standard #51 :
https://www.unicode.org/reports/tr51/
Best Regards,
guy038
- The
-
Wow.
More to it than I’d have thought.
Thanks for the insight.