Regex single dot character in group behaves differently than not in group

Matthews Dylan

Regex1: ^.*$

Regex2: ^(.)*$

Input line: 💦

Regex2 does not match the input line, but Regex1 does. I have a bit more complex regex based on Regex2, where I cannot omit the parenthesis and I want it to match. Am I making some mistake, or is there a workaround?

I am basically trying to replace lines that do not contain something, but it fails and keeps lines with emojis. This is what I use: ^((?!word).)*$ based on SO answer from here

guy038

Hello, @matthews-dylan

Allow me some hours to elaborate a correct reply to your problem, which is really not easy, as it involves notions such as UTF-8 encoding, Unicode surrogates, Notepad++ encodings, regex engine handling of characters and, of course, fonts !

See you later,

guy038

guy038

Hi, @matthews-dylan and All,

I apologize for my very late reply, but I needed to do numerous verifications and tests ! I’m going to start with some general topics, and, then, I’ll come back to your specific problem to tell you why your second regex ^(.)*$ matches empty lines only and I’ll give you a solution in order to delete any line which does not contain any Emoji character. Take your time and have a drink : this post is quite long ;-))

First, I would say that most of the monospaced fonts, using in code editors, can display the glyphs of traditional characters only ! So, you need to get a more robust font, which could display most of Unicode symbols properly ;-))

So, refer to the last section of my other post, below :

https://community.notepad-plus-plus.org/post/50673

Now, after pasting the input line of your post, with my current N++ Courier New font, I get the line, below, where your character, not handled with that font, is simply replaced with a small white square box :

`Input line: □

To get information in that character, refer, again, to the last section of this other post, which speaks about a very handy on-line UTF-8 tool :

https://community.notepad-plus-plus.org/post/50983

With the help of this tool, we deduce that your special char has the following characteristics :

Character name                           SPLASHING SWEAT SYMBOL

Hex code point                           1F4A6
Decimal code point                       128166

Hex UTF-8 bytes                          F0 9F 92 A6
Octal UTF-8 bytes                        360 237 222 246

UTF-8 bytes as Latin-1 characters bytes  ð <9F> <92> ¦

Hex UTF-16 Surrogates                    D83D DCA6

Refer to the link, below, to see all the characters of the Unicode Miscellaneous Symbols and Pictographs block :

http://www.unicode.org/charts/PDF/U1F300.pdf

Note that the Unicode code-point of this character is 1F4A6, which is over the first 65536 characters of the Basic Multilingual Plane ( BMP ) Therefore, this means that :

It is correctly encoded in an UTF-8 encoded file. So, you must use the N++ UTF-8 or UTF-8 BOM encodings, which can handle all Unicode characters, from \x{0000} to \x{10FFFF}
It cannot be inserted in an ANSI encoded file, which handle 256 characters, only, from \x{00} to \x{FF}
It cannot be inserted in a N++ UCS-2 BE BOM and UCS-2 LE BOM encoded file, which can handle only the 65536 characters of the BMP, from \x{0000} to \x{FFFF}

Moreover, as the code-point of your character is over \x{FFFF} :

It cannot be represented with the regex syntax \x{1F4A6}, due a bug of the present Boost regex engine, which does not handle all characters in true 32-bits encoding :-(( Also, searching for \x{1F4A6} results in the error message Find: Invalid regular expression
The simple regex dot symbol . cannot match a character, with Unicode code-point > \x{FFFF}, too !

Luckily, if you paste your character in the Find what: zone, it does find all occurrences of the SPLASHING SWEAT SYMBOL character !

Now, the surrogates mechanism allows the UTF-16 encoding ( not used in Notepad++ ) to be able to code all characters with code-point over \x{FFFF}. Refer below :

https://en.wikipedia.org/wiki/UTF-16#Description

And I found out that if I write a regex, involving the surrogates pair ( 2 16-bit units ) of a character, which is over the BMP, the regex engine is able to match this character. For instance, as the surrogates pair of your character are : D83D DCA6, the regex \x{D83D}\x{DCA6} does find all occurrences of your SPLASHING SWEAT SYMBOL character !

I’ve done a lot of tests and, unfortunately, using a similar syntax, to get any char, with code over \x{FFFF}, most of the regexes do not work.

Indeed, as the high 16-bits surrogate belongs to the [\x{D800}-\x{DBFF}] range and the low 16-bits surrogate belongs to the [\x{DC00}-\x{DFFF}] range :

The regex [\x{D800}-\x{DBFF}][\x{DC00}-\x{DFFF}] does not find any match
The regex [\x{D800}-\x{DBFF}]\x{DCA6} does not find any match, too
Luckily, the regex \x{D83D}[\x{DC00}-\x{DFFF}] does match your special 💦 character :-))

So, in summary, because of the wrong handling of characters, in the present implementation of the Boost Regex library, within Notepad++ :

To match any standard character, from \x{0000} to \x{FFFF} ( NOT EOL chars and the Form Feed char \x0c ), use the simple regex .
To match any standard character from \x{10000} to \x{10FFFF}, use the regex .[\x{DC00}-\x{DFFF}] OR the shorter syntax ..
To match all standard characters, from \x{0000} to \x{10FFFF}, use the regex .[\x{DC00}-\x{DFFF}]? OR the shorter syntax ..?

And :

To match a specific character of the BMP, from \x{0000} to \x{FFFF} use the regex syntax \x{....}, with four hexadecimal numbers
To match a specific character over the BMP, from \x{10000} to \x{10FFFF}, use the high and low surrogates equivalent pair, with the regex syntax \x{<high>}\x{<low>}, replacing the <high> and <low> values with their exact hexadecimal values, using 4 hexadecimal numbers

First example :


From the list of chars, below :

    •----------------------------------•------------•-------•-------------------------•-------------------•--------------------------•
    |       Character NAME             | Code-Point | Char  | In a UTF-8 encoded file | Hex-16 Surrogates |       SEARCH Regex       |
    •----------------------------------•------------•-------•-------------------------•-------------------•--------------------------•
    | LATIN CAPITAL LETTER A           |    0041    |   A   | 41                      |        N/A        | \x{0041}          or  .  |
    | MATHEMATICAL BOLD CAPITAL A      |   1D400    |   𝐀   | F0 9D 90 80             |    D835 + DC00    | \x{D835}\x{DC00}  or  .. |
    | COMBINING GRAVE ACCENT BELOW     |    0316    |   ̖   | CC 96                    |        N/A        | \x{0316}          or  .  |
    | COMBINING LEFT ANGLE ABOVE       |    031A    |   ̚   | CC 9A                    |        N/A        | \x{031A}          or  .  |
    | MUSICAL SYMBOL COMBINING MARCATO |   1D17F    |   𝅿   | F0 9D 85 BF              |    D834 + DD7F    | \x{D834}\x{DD7F}  or  .. |
    •----------------------------------•------------•-------•-------------------------•-------------------•--------------------------•

We may build up some COMPOSED characters, as below :

    •-----------------------•-------•-------------------------•----------------------------•--------------------------------------------•
    |  Code-Points          | Chars | In a UTF-8 encoded file |     Hex-16 Surrogates      |                SEARCH Regex                |
    •-----------------------•-------•-------------------------•----------------------------•--------------------------------------------•
    |  0041 +  031A         |   A̚   | 41 CC 9A                |           NO               | \x{0041}\x{031A}                  or  ..   |
    |  0041 + 1D17F         |   A𝅿   | 41 F0 9D 85 BF          | D834 + DD7F ( on 2nd char) | \x{0041}\x{D834}\x{DD7F}          or  ...  |
    | 1D400 +  031A         |   𝐀̚   | F0 9D 90 80 CC 9A       | D835 + DC00 ( on 1st char) | \x{D835}\x{DC00}\x{031A}          or  ...  |
    | 1D400 + 1D17F         |   𝐀𝅿   | F0 9D 90 80 F0 9D 85 BF | D835 + DC00 + D834 + DD7F  | \x{D835}\x{DC00}\x{D834}\x{DD7F}  or  .... |
    |  0041 + 1D17F +  031A |   A𝅿̚   | 41 F0 9D 85 BF CC 9A    | D834 + DD7F ( on 2nd char) | \x{0041}\x{D834}\x{DD7F}\x{031A}  or  .... |
    |  0041 +  031A + 1D17F |   A𝅿̚   | 41 CC 9A F0 9D 85 BF    | D834 + DD7F ( on 3rd char) | \x{0041}\x{031A}\x{D834}\x{DD7F}  or  .... |
    | 1D400 +  031A +  0316 |   𝐀̖̚   | F0 9D 90 80 CC 9A CC 96 | D835 + DC00 ( on 1st char) | \x{D835}\x{DC00}\x{031A}\x{0316}  or  .... |
    •-----------------------•-------•-------------------------•----------------------------•--------------------------------------------•

Second example: If we use any of the 3 following regex S/R :

SEARCH (?-s)^.+(.[\x{DC00}-\x{DFFF}]).+

or :

SEARCH (?-s)^.+\x20(..)\x20.+

or :

SEARCH (?-s)^.+(\x{D83D}\x{DCA6}).+

and :

REPLACE A necklace of the SPLASHING SWEAT SYMBOL ––\1––\1––\1––\1––\1––\1––\1––\1––\1––

against the text This is the 💦 character, at the beginning a line, we get the resulting text :

A necklace of the SPLASHING SWEAT SYMBOL ––💦––💦––💦––💦––💦––💦––💦––💦––💦––

Now, let’s go back to your problem :

Fundamentally, the problem arise because your special 💦 character can be matched with the regex .., only, regarding our present regex engine. It looks like, for these characters, the regex engine don’t see the character itself, but the two surrogate 16-bits code units !

When you process the regex ^.*$ against your text : Input line: 💦, it does match the entire line, as the regex syntax .* means any number of chars ( . or .. or ..., and so on )

Now, let’s consider the following regex syntaxes, with a capturing group 1, against this 4-lines text, pasted in a new tab :


💦

Input line: 💦

Note that the 1st and 3rd line are empty, the 2nd line contains your 💦 special char, only and the 4th line ends with that special char

Regarding the following regex examples, below, you may test them, using the -->\1<-- Replace zone

Before, a quick remainder :

The INPUT text :

167844894321
16784
4566499

with the regex S/R :

SEARCH (\d)+

REPLACE -->\1<--

would result in :

-->1<--
-->4<--
-->9<--

As you can see, group 1 always contains the last stored value of the group. So, the regex could also have been rewritten as \d+(\d)

The regex ^(.)$ cannot find anything, as no character, with code <= \x{FFFF}, exists between beginning and end of line
The regex ^(..)$ does find, in line 2, your 💦 special character, with code > \x{FFFF}, between beginning and end of line
Your regex ^(.)*$ simply matches the true empty lines 1 and 3. WHY ?
Well, as the group contains only one dot ., it cannot match your last 💦 special character, in line 2 and 4, which needs to be considered as a pseudo two-chars entity. So the overall regex fails, in these lines !
The regex ^(..)*$ does match all the lines of the subject text, because, luckily, the part Input line:, followed with a space char, is exactly 12 chars long, so an even number ! And the last value of group 1 is your 2-chars 💦 special char, right before the end of the line

Notes :

The regex ^.*(..)$ would match all the non-empty lines 2 and 4, because group 1, .., represents your 💦 special char, ending these lines
And the regex ^(?:..){6}(..)$ would match the line 4, only
The regex ^.............(.)$ does not work properly, because group1 does not contain the 💦 special character ( See after the replacement ! )
On the contrary, the regex ^............(..)$ does find all contents of line 4, as the group 1, .., contains, exactly, the 💦 special character

On the other hand :

The regex ^(.)* selects as many standard characters, with code-point <= \x{FFFF}, so the following strings, but NOT your LAST 💦 special character !
- The null string before your 💦 special char, in line 2
- The string Input line:, followed with a space char, in line 4

And, finally :

The two regexes (.*)$ and (.*), with group 1 selecting all line contents, would match the four lines

Now, your last goal : let’s suppose that you would like to delete any line, which does not contain any Unicode Emojis character :

First, from that link :

http://www.unicode.org/charts/PDF/U1F600.pdf

We learn that the Unicode Emoticons block have code-points between \x{1F600} and \x{1F64F}

With the on-line UTF-8 toll, we verify that the two Hex UTF-16 surrogates are :
- D83D DE00, for the \x{1F600} emoticon
- D83D DE4F, for the \x{1F64F} emoticon

So, we should match all the characters of the Unicode Emoticons block, with the search regex :

SEARCH \x{D83D}[\x{DE00}-\x{DE4F}]

And, yes, it does work as expected. In that case, deleting any non-empty line which does not contain any Emoticon character(s) is easy with the following regex S/R :

SEARCH (?-s)^(?!.*\x{D83D}[\x{DE00}-\x{DE4F}]).+\R

REPLACE Leave EMPTY

In contrast, the regex S/R :

SEARCH (?-s)^(?=.*\x{D83D}[\x{DE00}-\x{DE4F}]).+\R

REPLACE Leave EMPTY

would delete any non-empty line containing one or more emoticon character(s) !

Not asleep yet ? That’s good news :-))

Best Regards,

guy038

P.S. :

Let’s suppose that, instead of the small Unicode Emoticons block, containing 80 characters, we would like to search for any character belonging to the Unicode Miscellaneous Symbols and Pictographs block, which contains 768 characters and where your special 💦 char takes place

Right now, it’s getting really inextricable ! The Unicode range of that block is from \x{1F300} to \x{1F5FF}, but, because of the surrogates mechanism, it must be split in two parts :

The range of chars between \x{1F300} and \x{1F3FF}, so with surrogates pairs D83C DF00 to D83C DFFF
The range of chars between \x{1F400} and \x{1F5FF}, so with surrogates pairs D83D DC00 to D83D DDFF

Therefore, the correct regex to match all the characters of this block is, indeed :

\x{D83C}[\x{DF00}-\x{DFFF}]|\x{D83D}[\x{DC00}-\x{DDFF}]

with an alternative between two regexes, in order to match each subset !

I confirm that this regex does find the 768 characters of the Unicode Miscellaneous Symbols and Pictographs block, with code-point over \x{FFFF} !

It’s really a pity that the N++ regex engine does not handle correctly all the characters outside the BMP. If so, we just would have to simply use the classical [\x{1F300}-\x{1F5FF}] character class !!