Community
    • Login

    Regexp fails to match UTF-8 characters

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    16 Posts 6 Posters 5.3k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • alexologA
      alexolog @PeterJones
      last edited by

      @PeterJones said in Regexp fails to match UTF-8 characters:

      Unfortunately, in character classes like you mentioned, that means that the characters outside the BMP (at U+10000 and above), while they can be found by ^.+, cannot be found by something that seems equivalent, like ^[\s\S]+

      Is there a chance that N++ will support it in the near future?

      Would it work for you to search for “anything (lookahead for followed by a space or EOF)” instead, using something like ^.+?(?=\s|\Z): this matches the first character on all three of the lines from your example data: this regex finds the first 1 or more characters at the beginning of a line that are followed by a space or EOF, which I think is what you intended

      Thank you for the suggestion, but it was a one-off transformation I needed to do on a text file, and I achieved it by using a different editor.

      1 Reply Last reply Reply Quote 0
      • Olivier ThomasO
        Olivier Thomas
        last edited by Olivier Thomas

        You’ll find all 3 examples.
        (🤣 |😊 |☺ ☺\x{FE0F} )\d

        1 Reply Last reply Reply Quote 0
        • guy038G
          guy038
          last edited by guy038

          Hello, @alexolog, @peterjones @olivier-thomas and All,

          Sorry for my late answer ! As @Peterjones said, problems arise when searching Unicode characters which are over the Basic Multilingual plane ( BMP ) which have a code-point between \x{10000} and \x{10FFFF} ( so over \x{FFFF} )

          For instance, as the code-point of the emoticon 🤣 is over \x{FFFF} :

          • It cannot be represented with its real regex syntax \x{1F923}, due a bug of the present Boost regex engine, which does not handle all characters in true 32-bits encoding, but only with the UTF-16 encoding:-(( So, searching for \x{1F4A6} results in the error message Find: Invalid regular expression

          • Moreover, the simple regex dot symbol (?-s). cannot match a character, with Unicode code-point > \x{FFFF}, too !

          • Of course if you paste your character, directly, in the Find what: zone, it does find all occurrences of the ROLLING ON THE FLOOR LAUGHING character !

          BTW, your two emoticons can be found in the lists, below :

          https://www.unicode.org/charts/PDF/U1F600.pdf

          https://www.unicode.org/charts/PDF/U1F900.pdf


          Luckily, the coding of characters of our Boost regex engine in UTF-16 allows to code all characters, with code-point over \x{FFFF}, thanks to the surrogates mechanism. Refer to generalities, below :

          https://en.wikipedia.org/wiki/UTF-16

          In short, the surrogate pair of a character, with Unicode code-point in range from \x{10000} till \x{10FFFF}, can be described by the regex :

          \x{hhhh}\x{iiii}    where D800 < hhhh < DBFF    and    DC00 < iiii < DFFF

          So if a regex, involves the surrogates pair ( two 16-bit units ) of a character, which is over the BMP, our regex engine is able to match it. For instance, as the surrogates pair of the character ROLLING ON THE FLOOR LAUGHING is D83E DD23, the regex \x{D83E}\x{DD23} does find all occurrences of your emoticon character !


          • For a full explanation about the two 16-bits code units, called a surrogates pair, refer to :

          https://en.wikipedia.org/wiki/UTF-16#Code_points_from_U+010000_to_U+10FFFF

          • For the calculus of the surrogates pair of a specific character with code over \x{FFFF}, refer, either , to :

          http://www.russellcottrell.com/greek/utilities/SurrogatePairCalculator.htm

          http://www.cogsci.ed.ac.uk/~richard/utf-8.cgi?

          • On our site, get additional information, here :

          https://community.notepad-plus-plus.org/post/51068

          https://community.notepad-plus-plus.org/post/43037

          and recently I proposed a Notepad++ macro which replaces any selection of the \xhhhhh syntaxes with their surrogate pair equivalents \x{Dhhh}\x{Diii} ! See below :

          https://community.notepad-plus-plus.org/post/57528


          In summary, because of the use of UTF-16, instead of UTF-32, by the present implementation of the Boost Regex library, within N++ :

          • Use the simple regex (?-s). to match any standard character, from \x{0000} to \x{FFFF} ( so not including the EOL chars nor the Form Feed char \x0c )

          • IMPORTANT : From the surrogates mechanism, explained above, one may think that the regex [\x{D800}-\x{DBFF][\x{DC00}-\x{DFFF}] should find all the characters with Unicode code-point over \x{FFFF}. Unfortunately, this syntax does not work !? So, we need to use these derived regexes :

          • (?-s).[\x{DC00}-\x{DFFF}] to match any standard character from \x{10000} to \x{10FFFF}

          • (?-s).[\x{DC00}-\x{DFFF}]? to match all standard characters, from \x{0000} to \x{10FFFF}

          And :

          • To match a specific character of the BMP, from \x{0000} to \x{FFFF}, use the regex syntax \x{hhhh}, with four hexadecimal numbers

          • To match a specific character over the BMP, from \x{10000} to \x{10FFFF}, use the high and low surrogates equivalent pair, with the regex syntax \x{<high>}\x{<low>}, replacing the <high> and <low> values with their exact hexadecimal values, using each 4 hexadecimal numbers


          Now, let’s go back to your example :

          🤣 1
          ☺ ☺️ 2
          😊 3
          
          • The first line contains the \x{1F923} character, a space char and the 1 digit

          • The second line contains the \x{263A} character, a space char, an other \x{263A} char, the invisible \x{FE0F} char ( VARIATION SELECTOR-16 ) a space char and, finally, the 2 digit

          • The third line contains the \x{1F60A} character, a space char and the 3 digit

          So, in order to find the contents of :

          • The first line, the \x{1F923}\x20\x31 regex must be changed by the regex \x{D83E}\x{DD23}\x20\x31

          • The second line, simply use the syntax \x{263A}\x20\x{263A}\x{FE0F}\x20\x32

          • The third line, the \x{1F60A}\x20\x33 regex must be changed by the regex \x{D83D}\x{DE0A}\x20\x33


          And, in order to find an equivalent to the pseudo-wrong syntaxes ^[^ ]+ and ^[\S]+, use :

          (?-s)^((?!\x20).[\x{DC00}-\x{DFFF}]?)+

          Notes :

          • As usual, the in-line modifier (?-s) means that any dot will match a single standard char and not EOL chars !

          • The ^ assertion looks for a beginning of line

          • As said above, the (.[\x{DC00}-\x{DFFF}]?)+ will find any range of chars, from \x{0000} to \x{10FFFF}

          • But as we must omit the space char, we place the negative look-ahead (?!\x20), right before the . symbol, standing for any char under \x{10000}

          Best Regards,

          guy038

          A last example, containing your three consecutive emoticons, with a space char and digit 4 :

          🤣☺😊 4
          

          Then, the exact regex \x{1F923}\x{263A}\x{1F60A}\x20\x34 must be changed as \x{D83E}\x{DD23}\x{263A}\x{D83D}\x{DE0A}\x20\x34 !

          1 Reply Last reply Reply Quote 2
          • Long TứL
            Long Tứ @PeterJones
            last edited by

            @PeterJones said in Regexp fails to match UTF-8 characters:

            @alexolog,

            Expanding on your data with the U+#### unicode codepoints for the characters

            🤣 1 U+1F923
            ☺ ☺️ 2 U+263A
            😊 3 U+1F60A
            

            You will notice that Notepad++ can find the U+263A, but not the U+1F923 or U+1F60A.
            @guy038 just recently posted in “Functionlist Help” about how Notepad++ cannot search for those in normal circumstances, and instead has to use the surrogate pairs.

            Unfortunately, in character classes like you mentioned, that means that the characters outside the BMP (at U+10000 and above), while they can be found by ^.+, cannot be found by something that seems equivalent, like ^[\s\S]+

            Would it work for you to search for “anything (lookahead for followed by a space or EOF)” instead, using something like ^.+?(?=\s|\Z): this matches the first character on all three of the lines from your example data: this regex finds the first 1 or more characters at the beginning of a line that are followed by a space or EOF, which I think is what you intended

            1 Reply Last reply Reply Quote 0
            • alexologA
              alexolog
              last edited by

              @guy038 ,

              Thank you for the detailed explanation!

              Is there a way to match a string of UTF-8 characters that are not ASCII?

              Aside: For some reason email notifications of replies do not seem to work.

              1 Reply Last reply Reply Quote 1
              • guy038G
                guy038
                last edited by

                Hello, @alexolog, @peterjones, @olivier-thomas and All,

                First, in my previous post, I said :

                • Use the regex (?-s).[\x{DC00}-\x{DFFF}]? to match all standard characters, from \x{0000} to \x{10FFFF} ( so All Unicode chars ! )

                You may prefer this more simple syntax (?-s).[\x{DC00}-\x{DFFF}]|. with two alternatives ( the former relative to all the characters over the BMP and the later relative to all the characters within the BMP )


                Now, your question is a bit ambiguous ! Do you speak of :

                • Characters with Unicode code-point over \x{00FF} ?

                • Characters which cannot exist in an ANSI encoded file ?


                Indeed, an ANSI encoded file may contain characters whose code-point is over \x{00FF} !

                Probably, you’re using the Win-1252 ANSI encoding. To verify this assertion, open the Edit > Character panel. It should be identical to the one shown in this Wikipedia article :

                https://en.wikipedia.org/wiki/Windows-1252#Character_set

                which can be shortened as :

                •---------------•-------•--------•----------•
                |   Win-1252    |       | Unicode|  Code >  |
                |  Dec   | Hex  | Char. |  C.P.  | \x{00FF} |
                •---------------•-------•--------•----------•
                |  0000  |  00  | <NUL> |  0000  |          |
                |  ....  |  ..  | ..... |  ....  |          |
                |  ....  |  ..  | ..... |  ....  |          |
                |  0127  |  7F  | <DEL> |  007F  |          |
                •---------------•-------•--------•----------•
                |  0128  |  80  |   €   |  20AC  |    Yes   |
                |  0129  |  81  | <HOP> |  0081  |          |
                |  0130  |  82  |   ‘   |  201A  |    Yes   |
                |  0131  |  83  |   ƒ   |  0192  |    Yes   |
                |  0132  |  84  |   „   |  201E  |    Yes   |
                |  0133  |  85  |   …   |  2026  |    Yes   |
                |  0134  |  86  |   †   |  2020  |    Yes   |
                |  0135  |  87  |   ‡   |  2021  |    Yes   |
                |  0136  |  88  |   ˆ   |  02C6  |    Yes   |
                |  0137  |  89  |   ‰   |  2030  |    Yes   |
                |  0138  |  8A  |   Š   |  0160  |    Yes   |
                |  0149  |  8B  |   ‹   |  2039  |    Yes   |
                |  0140  |  8C  |   Œ   |  0152  |    Yes   |
                |  0141  |  8D  | <RI>  |  008D  |          |
                |  0142  |  8E  |   Ž   |  017D  |    Yes   |
                |  0143  |  8F  | <SS3> |  008F  |          |
                |  0144  |  90  | <DCS> |  0090  |          |
                |  0145  |  91  |   ‘   |  2018  |    Yes   |
                |  0146  |  92  |   ’   |  2019  |    Yes   |
                |  0147  |  93  |   “   |  201C  |    Yes   |
                |  0148  |  94  |   ”   |  201D  |    Yes   |
                |  0149  |  95  |   •   |  2022  |    Yes   |
                |  0150  |  96  |   –   |  2013  |    Yes   |
                |  0151  |  97  |   —   |  2014  |    Yes   |
                |  0152  |  98  |   ˜   |  02DC  |    Yes   |
                |  0153  |  99  |   ™   |  2122  |    Yes   |
                |  0154  |  9A  |   š   |  0161  |    Yes   |
                |  0155  |  9B  |   ›   |  203A  |    Yes   |
                |  0156  |  9C  |   œ   |  0153  |    Yes   |
                |  0157  |  9D  | <OSC> |  009D  |          |
                |  0158  |  9E  |   ž   |  017E  |    Yes   |
                |  0159  |  9F  |   Ÿ   |  0178  |    Yes   |
                •---------------•-------•--------•----------•
                |  0160  |  A0  | <NBSP>|  00A0  |          |
                |  ....  |  ..  | ..... |  ....  |          |
                |  ....  |  ..  | ..... |  ....  |          |
                |  0255  |  FF  |   ÿ   |  00FF  |          |
                •---------------•-------•--------•----------•
                

                If we sort this table by Unicode code-point ascending, we get :

                •---------------•-------•--------•----------•
                |   Win-1252    |       | Unicode|  Code >  |
                |  Dec   | Hex  | Char. |  C.P.  | \x{00FF} |
                •---------------•-------•--------•----------•
                |  0000  |  00  | <NUL> |  0000  |          |
                |  ....  |  ..  | ..... |  ....  |          |
                |  ....  |  ..  | ..... |  ....  |          |
                |  0127  |  7F  | <DEL> |  007F  |          |
                •---------------•-------•--------•----------•
                |  0129  |  81  | <HOP> |  0081  |          |
                |  0141  |  8D  | <RI>  |  008D  |          |
                |  0143  |  8F  | <SS3> |  008F  |          |
                |  0144  |  90  | <DCS> |  0090  |          |
                |  0157  |  9D  | <OSC> |  009D  |          |
                •---------------•-------•--------•----------•
                |  0160  |  A0  | <NBSP>|  00A0  |          |
                |  ....  |  ..  | ..... |  ....  |          |
                |  ....  |  ..  | ..... |  ....  |          |
                |  0255  |  FF  |   ÿ   |  00FF  |          |
                •---------------•-------•--------•----------•
                |  0140  |  8C  |   Œ   |  0152  |    Yes   |
                |  0156  |  9C  |   œ   |  0153  |    Yes   |
                |  0138  |  8A  |   Š   |  0160  |    Yes   |
                |  0154  |  9A  |   š   |  0161  |    Yes   |
                |  0159  |  9F  |   Ÿ   |  0178  |    Yes   |
                |  0142  |  8E  |   Ž   |  017D  |    Yes   |
                |  0158  |  9E  |   ž   |  017E  |    Yes   |
                |  0131  |  83  |   ƒ   |  0192  |    Yes   |
                |  0136  |  88  |   ˆ   |  02C6  |    Yes   |
                |  0152  |  98  |   ˜   |  02DC  |    Yes   |
                |  0150  |  96  |   –   |  2013  |    Yes   |
                |  0151  |  97  |   —   |  2014  |    Yes   |
                |  0145  |  91  |   ‘   |  2018  |    Yes   |
                |  0146  |  92  |   ’   |  2019  |    Yes   |
                |  0130  |  82  |   ‘   |  201A  |    Yes   |
                |  0147  |  93  |   “   |  201C  |    Yes   |
                |  0148  |  94  |   ”   |  201D  |    Yes   |
                |  0132  |  84  |   „   |  201E  |    Yes   |
                |  0134  |  86  |   †   |  2020  |    Yes   |
                |  0135  |  87  |   ‡   |  2021  |    Yes   |
                |  0149  |  95  |   •   |  2022  |    Yes   |
                |  0133  |  85  |   …   |  2026  |    Yes   |
                |  0137  |  89  |   ‰   |  2030  |    Yes   |
                |  0149  |  8B  |   ‹   |  2039  |    Yes   |
                |  0155  |  9B  |   ›   |  203A  |    Yes   |
                |  0128  |  80  |   €   |  20AC  |    Yes   |
                |  0153  |  99  |   ™   |  2122  |    Yes   |
                •---------------•-------•--------•----------•
                

                So, if you want to detect all strings :

                • Containing characters with code-point over \x{00FF}, only, use the regex :

                (?-s)(.[\x{DC00}-\x{DFFF}]|[[:unicode:]])+    ( Note the Posix character class [[:unicode:]] )

                • Containing characters, not involved in the Win-1252 ANSI encoding at all, use the regex :

                (?-s)(.[\x{DC00}-\x{DFFF}]|[^\x{0000}-\x{007F}\x{0081}\x{008D}\x{008F}\x{0090}\x{009D}\x{00A0}-\x{00FF}ŒœŠšŸŽžƒˆ˜–—‘’‘“”„†‡•…‰‹›€™])+

                Beware : It’s important to point out that this second regex avoid, for instance, classical letters, digits, space, tabulation and usual symbols, as well !! In other words, it will find any character not present in the character column of the ASCII Codes Insertion Panel ( Edit > Character Panel )

                Best Regards,

                Cheers,

                guy038

                1 Reply Last reply Reply Quote 2
                • guy038G
                  guy038
                  last edited by guy038

                  Hi, @alexolog, @peterjones, @olivier-thomas and All,

                  Ouuuups, sorry ! I read you post too quickly and I thought that you were asking the question :

                  Is there a way to match a string of UTF-8 characters that are not ANSI ?

                  So, if you mean a way to detect a strings of characters, not pure ASCII ( so over \x{007F} ), use the regex :

                  (?-s)(.[\x{DC00}-\x{DFFF}]|[^\x{0000}-\x{007F}])+

                  Again, this regex will not match classical letters, digits, space, tabulation and usual symbols. Only chracters with unicode code-point over \x{007F} !

                  In other words, it will not match any character of that list :

                  https://en.wikipedia.org/wiki/ASCII#Character_set

                  BR

                  guy038

                  1 Reply Last reply Reply Quote 2
                  • alexologA
                    alexolog
                    last edited by

                    Thank you!

                    1 Reply Last reply Reply Quote 0
                    • Alan KilbornA
                      Alan Kilborn
                      last edited by

                      @guy038

                      Since you seem to have a good grasp on this topic…

                      I was reading this with some interest:
                      https://github.com/notepad-plus-plus/notepad-plus-plus/issues/5558

                      It is true indeed that:

                      The emoji sequence “👩‍❤️‍💋‍👩” does not seem to be rendered as a sequence on notepad++ It is rendered as 4 characters: 👩❤️💋👩

                      Do you have any idea on why this is?

                      1 Reply Last reply Reply Quote 2
                      • guy038G
                        guy038
                        last edited by guy038

                        Hello, @alan-kilborn and All,

                        Ah ah ! Alan You made me discover something I didn’t know existed : the creation of a new Emoji character from a small Emoji characters set !

                        I found out all that story, but we need to describe some technical data, first !


                        In this article, below, it is said :

                        https://en.wikipedia.org/wiki/Zero-width_joiner

                        The zero-width joiner ( ZWJ ) is a non-printing character used in the computerized typesetting of some complex scripts such as the Arabic script or any Indic script or, sometimes, as the Roman script. When placed between two characters that would otherwise not be connected, a ZWJ causes them to be printed in their connected forms. When a ZWJ char ( \x[200D} ) is placed between two emoji characters (or interspersed between multiple), it can result in a single glyph being shown, such as the family emoji, made up of two adult emoji and one or two child emoji

                        Similarly, in this article, below, it is said :

                        https://en.wikipedia.org/wiki/Zero-width_non-joiner

                        The zero-width non-joiner ( ZWNJ ) is a non-printing character used in the computerization of writing systems that make use of ligatures. When a ZWNJ char ( \x[200C} ) is placed between two characters that would, otherwise, be connected into a ligature, a ZWNJ causes them to be printed in their final and initial forms, respectively

                        On the other hand, in this aricle, it is said :

                        https://www.unicode.org/charts/PDF/UFE00.pdf

                        The variation selector-16 ( VS-16 ) is an invisible code-point which specifies that the preceding character should be displayed with the emoji presentation. Only required if the preceding character defaults to text presentation

                        For instance, as the ❤ heart character ( \x{2764} ) pre-dates the emoji characters, it needs this variation selector after it, to tell systems to use the ❤️ emoji version ( \x{2764}\x{FE0F} ), not the ❤︎ text version !

                        Similarly, the variation selector-15 ( VS-15 ) is an invisible code-point which specifies that the preceding character should be displayed with the text representation. Only required if the preceding character defaults to emoji presentation


                        Now, in this page, we can read :

                        https://emojipedia.org/emoji-zwj-sequence/

                        An Emoji ZWJ Sequence is a combination of multiple emojis which display as a single emoji on supported platforms. These sequences are joined with a Zero Width Joiner character

                        To learn how this feature works, refer to :

                        https://blog.emojipedia.org/emoji-zwj-sequences-three-letters-many-possibilities/


                        To be exhaustive, the different special characters, involved with the Emoji characters, are :

                        • The 2 Format Characters, U+200C and U+200D, in the Unicode block General Punctuation ( U+2000 - U+206F )

                        https://www.unicode.org/charts/PDF/U2000.pdf

                        • The 26 Regional Indicator Symbols, U+1F1E6 - U+1F1FF, in the Unicode block Enclosed Alphanumeric Supplement (U+1F100 – U+1F1FF )

                        https://www.unicode.org/charts/PDF/U1F100.pdf

                        • The 5 Emoji Modifiers, U+1F3FB - U+1F3FF, in the Unicode block Miscellaneous Symbols and Pictographs ( U+1F300 – U+1F5FF )

                        https://www.unicode.org/charts/PDF/U1F300.pdf

                        • The 4 Emoji Components, U+1F9B0 - U+1F9B3, in the Unicode block Supplemental Symbols and Pictographs ( U+1F900 – U+1F9FF )

                        https://www.unicode.org/charts/PDF/U1F900.pdf

                        • The 2 Emoji Variation Selectors,U+FE0E and U+FE0F, in the Unicode block Variation Selectors( U+FE00 – U+FE0F )

                        https://www.unicode.org/charts/PDF/UFE00.pdf


                        Now that we have the technical background, let’s come back to your example !

                        In fact, the emoji 👩‍❤️‍💋‍👩 character, of juliodcs, is the combination of : 👩 emoji + ZWJ char + ❤️ dingbat + VS-16 char + ZWJ char + 💋 emoji + ZWJ char + 👩 emoji

                        and can be found with the following regex, where I use the free-spacing mode for readability )

                        (?x)
                        \x{D83D}\x{DC69}  #  Woman Emoji                     U+1F469
                        \x{200D}          #  ZWJ character                   U+200D
                        \x{2764}          #  Heavy Black Heart dingbat       U+2764
                        \x{FE0F}          #  Variation Selector-16 character U+FE0F
                        \x{200D}          #  ZWJ character                   U+200D
                        \x{D83D}\x{DC8B}  #  Kiss Mark Emoji                 U+1F48B
                        \x{200D}          #  ZWJ character                   U+200D
                        \x{D83D}\x{DC69}  #  Woman Emoji                     U+1F469
                        

                        IMPORTANT : Don’t forget that in order to search characters with code-point over U+FFFF, our regex engine needs to use the Surrogate Pairs mechanism, explained in this post :

                        https://community.notepad-plus-plus.org/post/57591

                        Note also that the Variation Selector-16 character does not seem necessary, to the sequence. So we end with this sequence described in this page :

                        https://emojipedia.org/kiss-woman-woman/


                        An other example :

                        From this four emoji characters, below :

                        • U+1F468 = \x{D83D}\x{DC68}    =>    👨    Man
                        • U+1F469 = \x{D83D}\x{DC69}    =>    👩    Woman
                        • U+1F466 = \x{D83D}\x{DC66}    =>    👦    Boy
                        • U+1F467 = \x{D83D}\x{DC67}    =>    👧    Girl

                        We can build this composite emoji 👨‍👩‍👧‍👦 ( ZWJ sequence ) = 👨 emoji + ZWJ char + 👩 emoji + ZWJ char + 👧 emoji + ZWJ char + 👦 emoji

                        which can be searched with the following regex, in free-spacing mode :

                        (?x)
                        \x{D83D}\x{DC68}  # Man
                        \x{200D}          # ZWJ ( Zero Width Joiner )
                        \x{D83D}\x{DC69}  # Woman
                        \x{200D}          # ZWJ ( Zero Width Joiner )
                        \x{D83D}\x{DC67}  # Girl
                        \x{200D}          # ZWJ ( Zero Width Joiner )
                        \x{D83D}\x{DC66}  # Boy
                        

                        Remark :

                        It’s important to point out that the ZWJ sequence of emojis : 👨 emoji + ZWJ char + 👩 emoji + ZWJ char + 👦 emoji + ZWJ char + 👧 emoji does not give the expected result !

                        Indeed, it just be outputted as the ZWJ sequence 👨 emoji + ZWJ char + 👩 emoji + ZWJ char + 👦 emoji, followed with the single 👧 emoji ! That is to say the emoji sequence 👨‍👩‍👦‍👧


                        A third example, using the Emoji modifiers. From these chars :

                        • U+1F3FB = \x{D83C}\x{DFFB}    🏻    EMOJI MODIFIER FITZPATRICK TYPE-1-2
                        • U+1F3FC = \x{D83C}\x{DFFC}    🏼    EMOJI MODIFIER FITZPATRICK TYPE-3
                        • U+1F3FD = \x{D83C}\x{DFFD}    🏽    EMOJI MODIFIER FITZPATRICK TYPE-4
                        • U+1F3FE = \x{D83C}\x{DFFE}    🏾    EMOJI MODIFIER FITZPATRICK TYPE-5
                        • U+1F3FF = \x{D83C}\x{DFFF}    🏿    EMOJI MODIFIER FITZPATRICK TYPE-6

                        We can build this following emoji characters of a girl ( 👧 emoji ) with different skin tone :

                        \x{D83D}\x{DC67}\x{200D}\x{D83C}\x{DFFB}    =>    👧🏻    Girl with a light skin tone
                        \x{D83D}\x{DC67}\x{200D}\x{D83C}\x{DFFC}    =>    👧🏼    Girl with a medium-light skin tone
                        \x{D83D}\x{DC67}\x{200D}\x{D83C}\x{DFFD}    =>    👧🏽    Girl with a medium skin tone
                        \x{D83D}\x{DC67}\x{200D}\x{D83C}\x{DFFE}    =>    👧🏾    Girl with a medium-dark skin tone
                        \x{D83D}\x{DC67}\x{200D}\x{D83C}\x{DFFF}    =>    👧🏿    Girl with a dark skin tone

                        Note the function of the ZWJ and ZWNJ format characters :

                        • With the ZWJ char, the emoji sequence 👧 emoji + ZWJ char + 🏽 emoji modifier is displayed as the composite 👧‍🏽 emoji

                        • With the ZWNJ char, the emoji sequence 👧 emoji + ZWNJ char + 🏽 emoji modifier is displayed as the two single emojis 👧‌🏽

                        • However, I noticed that, without these format chars, the sequence 👧 emoji + 🏽 emoji modifier is also outputted as the composite emoji 👧🏽 !


                        A fourth example, using the regional Indicator symbols. From these chars, below :

                        • U+1F1E7 = \x{D83C}\x{DDE7}    =>    🇧    Regional Indicator Symbol Letter B
                        • U+1F1EB = \x{D83C}\x{DDEB}    =>    🇫    Regional Indicator Symbol Letter F
                        • U+1F1EC = \x{D83C}\x{DDEC}    =>    🇬    Regional Indicator Symbol Letter G
                        • U+1F1F4 = \x{D83C}\x{DDF4}    =>    🇴    Regional Indicator Symbol Letter O
                        • U+1F1F7 = \x{D83C}\x{DDF7}    =>    🇷    Regional Indicator Symbol Letter R
                        • U+1F1F8 = \x{D83C}\x{DDF8}    =>    🇸    Regional Indicator Symbol Letter S
                        • U+1F1FA = \x{D83C}\x{DDFA}    =>    🇺    Regional Indicator Symbol Letter U

                        We can build, for instance, these flags :

                        • The French flag    🇫‍🇷    from 🇫 and 🇷 Regional indicators ( \x{D83C}\x{DDEB}\x{200D}\x{D83C}\x{DDF7} )

                        • The United States flag    🇺‍🇸    from 🇺 and 🇸 Regional indicators ( \x{D83C}\x{DDFA}\x{200D}\x{D83C}\x{DDF8} )

                        • The United Kingdom flag    🇬‍🇧    from 🇬 and 🇧 Regional indicators ( \x{D83C}\x{DDEC}\x{200D}\x{D83C}\x{DDE7} )

                        • The Faroe Islands flag    🇫‍🇴    from 🇫 and 🇴 Regional indicators ( \x{D83C}\x{DDEB}\x{200D}\x{D83C}\x{DDF4} )

                        • The Brazil flag    🇧‍🇷    from 🇧 and 🇷 Regional indicators ( \x{D83C}\x{DDE7}\x{200D}\x{D83C}\x{DDF7} )

                        • The Uganda flag    🇺‍🇬    from 🇺 and 🇬 Regional indicators ( \x{D83C}\x{DDFA}\x{200D}\x{D83C}\x{DDEC} )

                        Remark : You may omit the ZWJ character between the two regional indicators characters !


                        Note the function of the VS-15 and VS-16 Variation Selector characters. For instance, as emoji sequences can be represented as black and white text or as coloured emojis

                        • With the VS-15 ( \x{FE0E} ) char, the text representation is selected => the sequence : ℹ Information Source char + VS-15 char + 👨 emoji ) \x{2139}\x{FE0E}\x{D83D}\x{DC68} returns the ℹ︎👨 sequence

                        • With the VS-16 ( \x{FE0F} ) char, the emoji representation is selected => the sequence : ℹ Information Source char + VS-16 char + 👨 emoji ) \x{2139}\x{FE0F}\x{D83D}\x{DC68} returns the ℹ️👨 sequence


                        To end, you’ll find a list of all Emoji characters, either individual or composite, below :

                        https://emojipedia.org/emoji/

                        And, for paranoid people, refer to the Unicode Technical Standard #51 :

                        https://www.unicode.org/reports/tr51/

                        Best Regards,

                        guy038

                        1 Reply Last reply Reply Quote 4
                        • Alan KilbornA
                          Alan Kilborn
                          last edited by

                          Wow.
                          More to it than I’d have thought.
                          Thanks for the insight.

                          1 Reply Last reply Reply Quote 2
                          • guy038G
                            guy038
                            last edited by guy038

                            This post is deleted!
                            1 Reply Last reply Reply Quote 0
                            • First post
                              Last post
                            The Community of users of the Notepad++ text editor.
                            Powered by NodeBB | Contributors