<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Regex single dot character in group behaves differently than not in group]]></title><description><![CDATA[<p dir="auto">Regex1: ^.*$</p>
<p dir="auto">Regex2: ^(.)*$</p>
<p dir="auto">Input line: 💦</p>
<p dir="auto">Regex2 does not match the input line, but Regex1 does. I have a bit more complex regex based on Regex2, where I cannot omit the parenthesis and I want it to match. Am I making some mistake, or is there a workaround?</p>
<p dir="auto">I am basically trying to replace lines that do not contain something, but it fails and keeps lines with emojis. This is what I use: ^((?!word).)*$ based on SO answer from <a href="https://stackoverflow.com/questions/406230/regular-expression-to-match-a-line-that-doesnt-contain-a-word" rel="nofollow ugc">here</a></p>
]]></description><link>https://community.notepad-plus-plus.org/topic/18975/regex-single-dot-character-in-group-behaves-differently-than-not-in-group</link><generator>RSS for Node</generator><lastBuildDate>Sun, 10 May 2026 04:36:33 GMT</lastBuildDate><atom:link href="https://community.notepad-plus-plus.org/topic/18975.rss" rel="self" type="application/rss+xml"/><pubDate>Wed, 26 Feb 2020 21:00:00 GMT</pubDate><ttl>60</ttl><item><title><![CDATA[Reply to Regex single dot character in group behaves differently than not in group on Mon, 02 Mar 2020 22:32:31 GMT]]></title><description><![CDATA[<p dir="auto">Hi, <a class="plugin-mentions-user plugin-mentions-a" href="/user/matthews-dylan" aria-label="Profile: matthews-dylan">@<bdi>matthews-dylan</bdi></a> and <strong>All</strong>,</p>
<p dir="auto">I apologize for my <strong>very late</strong> reply, but I needed to do <strong>numerous</strong> verifications and tests ! I’m going to start with some <strong>general</strong> topics, and, then, I’ll come back to your <strong>specific</strong> problem to tell you why your <strong>second</strong> regex <strong><code>^(.)*$</code></strong> matches <strong>empty</strong> lines <strong>only</strong> and I’ll give you a <strong>solution</strong> in order to <strong>delete</strong> any line which does <strong>not</strong> contain any <strong>Emoji</strong> character. Take your time and have a <strong>drink</strong> : this post is quite <strong>long</strong> ;-))</p>
<hr />
<p dir="auto">First, I would say that most of the <strong>monospaced</strong> fonts, using in <strong>code</strong> editors, can display the glyphs of <strong>traditional</strong> characters only ! So, you need to get a more <strong>robust</strong> font, which could display most of <strong>Unicode</strong> symbols properly ;-))</p>
<p dir="auto">So, refer to the <strong>last</strong> section of my other post, below :</p>
<p dir="auto"><a href="https://community.notepad-plus-plus.org/post/50673">https://community.notepad-plus-plus.org/post/50673</a></p>
<hr />
<p dir="auto">Now, after pasting the <strong>input line</strong> of your post, with my current N++ <strong><code>Courier New</code></strong> font, I get the line, below, where your character, <strong>not</strong> handled with that font, is simply <strong>replaced</strong> with a small <strong>white square</strong> box :</p>
<p dir="auto"><strong>`Input line: □</strong></p>
<p dir="auto">To get <strong>information</strong> in that character, refer, again, to the <strong>last</strong> section of this <strong>other</strong> post, which speaks about a <strong>very handy</strong> on-line <strong><code>UTF-8</code></strong> tool :</p>
<p dir="auto"><a href="https://community.notepad-plus-plus.org/post/50983">https://community.notepad-plus-plus.org/post/50983</a></p>
<p dir="auto">With the help of this tool, we deduce that your <strong>special</strong> char has the following characteristics :</p>
<pre><code class="language-z">Character name                           SPLASHING SWEAT SYMBOL

Hex code point                           1F4A6
Decimal code point                       128166

Hex UTF-8 bytes                          F0 9F 92 A6
Octal UTF-8 bytes                        360 237 222 246

UTF-8 bytes as Latin-1 characters bytes  ð &lt;9F&gt; &lt;92&gt; ¦

Hex UTF-16 Surrogates                    D83D DCA6
</code></pre>
<p dir="auto">Refer to the link, below, to see <strong>all</strong> the characters of the <strong>Unicode</strong> <strong><code>Miscellaneous Symbols and Pictographs</code></strong> block :</p>
<p dir="auto"><a href="http://www.unicode.org/charts/PDF/U1F300.pdf" rel="nofollow ugc">http://www.unicode.org/charts/PDF/U1F300.pdf</a></p>
<p dir="auto">Note that the <strong>Unicode</strong> code-point of this character is <strong><code>1F4A6</code></strong>, which is <strong>over</strong> the first <strong><code>65536</code></strong> characters of the <strong>Basic Multilingual Plane</strong> ( <strong><code>BMP</code></strong> ) Therefore, this means that :</p>
<ul>
<li>
<p dir="auto">It is <strong>correctly</strong> encoded in an <strong><code>UTF-8</code></strong> encoded file. So, you <strong>must</strong> use the N++ <strong><code>UTF-8</code></strong> or <strong><code>UTF-8 BOM</code></strong> encodings, which can handle <strong>all Unicode</strong> characters, from <strong><code>\x{0000}</code></strong> to <strong><code>\x{10FFFF}</code></strong></p>
</li>
<li>
<p dir="auto">It <strong>cannot</strong> be inserted in an <strong><code>ANSI</code></strong> encoded file, which handle <strong><code>256</code></strong> characters, only, from <strong><code>\x{00}</code></strong> to <strong><code>\x{FF}</code></strong></p>
</li>
<li>
<p dir="auto">It <strong>cannot</strong> be inserted in a N++ <strong><code>UCS-2 BE BOM</code></strong> and <strong><code>UCS-2 LE BOM</code></strong> encoded file, which can handle <strong>only</strong> the <strong><code>65536</code></strong> characters of the <strong>BMP</strong>, from <strong><code>\x{0000}</code></strong> to <strong><code>\x{FFFF}</code></strong></p>
</li>
</ul>
<hr />
<p dir="auto">Moreover, as the <strong>code-point</strong> of your character is over <strong><code>\x{FFFF}</code></strong> :</p>
<ul>
<li>
<p dir="auto">It <strong>cannot</strong> be represented with the regex syntax <strong><code>\x{1F4A6}</code></strong>, due a bug of the present <strong>Boost</strong> regex engine, which does <strong>not</strong> handle all characters in true <strong><code>32-bits</code></strong> encoding :-(( Also, searching for <strong><code>\x{1F4A6}</code></strong> results in the <strong>error</strong> message <strong><code>Find: Invalid regular expression</code></strong></p>
</li>
<li>
<p dir="auto">The simple regex <strong>dot</strong> symbol <strong><code>.</code></strong> <strong>cannot</strong> match a character, with <strong>Unicode</strong> code-point  <strong><code>&gt; \x{FFFF}</code></strong>, too !</p>
</li>
</ul>
<p dir="auto"><strong>Luckily</strong>, if you paste your character in the <strong><code>Find what:</code></strong> zone, it <strong>does</strong> find <strong>all</strong> occurrences of the <strong><code>SPLASHING SWEAT SYMBOL</code></strong> character !</p>
<hr />
<p dir="auto">Now, the <strong>surrogates</strong> mechanism allows the <strong><code>UTF-16</code></strong> encoding ( <strong>not</strong> used in Notepad++ )  to be able to code <strong>all</strong> characters with code-point <strong>over</strong> <strong><code>\x{FFFF}</code></strong>. Refer below :</p>
<p dir="auto"><a href="https://en.wikipedia.org/wiki/UTF-16#Description" rel="nofollow ugc">https://en.wikipedia.org/wiki/UTF-16#Description</a></p>
<p dir="auto">And I found out that if I write a regex, involving the <strong>surrogates pair</strong> ( 2 <strong><code>16-bit</code></strong> units ) of a character, which is  <strong>over</strong> the <strong><code> BMP</code></strong>, the regex engine is able to <strong>match</strong> this character. For instance, as the <strong>surrogates</strong> pair of your character are : <strong><code>D83D DCA6</code></strong>, the regex <strong><code>\x{D83D}\x{DCA6}</code></strong> does find <strong>all</strong> occurrences of your <strong><code>SPLASHING SWEAT SYMBOL</code></strong> character !</p>
<p dir="auto">I’ve done a <strong>lot</strong> of tests and, unfortunately, using a <strong>similar</strong> syntax, to get any char, with code <strong>over</strong> <strong><code>\x{FFFF}</code></strong>, most of the regexes do <strong>not</strong> work.</p>
<p dir="auto">Indeed, as the <strong>high</strong> <strong><code>16-bits</code></strong> <strong>surrogate</strong> belongs to the <strong><code>[\x{D800}-\x{DBFF}]</code></strong> range and the <strong>low</strong> <strong><code>16-bits</code></strong> <strong>surrogate</strong> belongs to the <strong><code>[\x{DC00}-\x{DFFF}]</code></strong> range :</p>
<ul>
<li>
<p dir="auto">The regex <strong><code>[\x{D800}-\x{DBFF}][\x{DC00}-\x{DFFF}]</code></strong> does <strong>not</strong> find any match</p>
</li>
<li>
<p dir="auto">The regex <strong><code>[\x{D800}-\x{DBFF}]\x{DCA6}</code></strong> does <strong>not</strong> find any match, too</p>
</li>
<li>
<p dir="auto">Luckily, the regex <strong><code>\x{D83D}[\x{DC00}-\x{DFFF}]</code></strong> does match your <strong>special 💦</strong> character :-))</p>
</li>
</ul>
<hr />
<p dir="auto">So, in <strong>summary</strong>, because of the <strong>wrong</strong> handling of characters, in the <strong>present</strong> implementation of the <strong>Boost Regex</strong> library, within Notepad++ :</p>
<ul>
<li>
<p dir="auto">To match any <strong>standard</strong> character, from <strong><code>\x{0000}</code></strong> to <strong><code>\x{FFFF}</code></strong> ( <em>NOT</em> <strong>EOL</strong> chars and the <strong>Form Feed</strong> char <strong><code>\x0c</code></strong> ), use the simple regex <strong><code>.</code></strong></p>
</li>
<li>
<p dir="auto">To match any <strong>standard</strong> character from <strong><code>\x{10000}</code></strong> to <strong><code>\x{10FFFF}</code></strong>, use the regex <strong><code>.[\x{DC00}-\x{DFFF}]</code></strong> OR the shorter syntax <strong><code>..</code></strong></p>
</li>
<li>
<p dir="auto">To match <strong>all standard</strong> characters, from <strong><code>\x{0000}</code></strong> to <strong><code>\x{10FFFF}</code></strong>, use the regex <strong><code>.[\x{DC00}-\x{DFFF}]?</code></strong>  OR the shorter syntax <strong><code>..?</code></strong></p>
</li>
</ul>
<p dir="auto">And :</p>
<ul>
<li>
<p dir="auto">To match a <strong>specific</strong> character of the <strong>BMP</strong>, from <strong><code>\x{0000}</code></strong> to <strong><code>\x{FFFF}</code></strong> use the regex syntax <strong><code>\x{....}</code></strong>, with <strong>four hexadecimal</strong> numbers</p>
</li>
<li>
<p dir="auto">To match a <strong>specific</strong> character over the <strong>BMP</strong>, from <strong><code>\x{10000}</code></strong> to <strong><code>\x{10FFFF}</code></strong>, use the <strong>high</strong> and <strong>low</strong> surrogates equivalent <strong>pair</strong>, with the regex syntax <strong><code>\x{&lt;high&gt;}\x{&lt;low&gt;}</code></strong>, replacing the <em>&lt;high&gt;</em> and <em>&lt;low&gt;</em> values with their exact <strong>hexadecimal</strong> values, using <strong><code>4</code></strong> <strong>hexadecimal</strong> numbers</p>
</li>
</ul>
<hr />
<p dir="auto"><strong>First</strong> example :</p>
<pre><code class="language-z">
From the list of chars, below :

    •----------------------------------•------------•-------•-------------------------•-------------------•--------------------------•
    |       Character NAME             | Code-Point | Char  | In a UTF-8 encoded file | Hex-16 Surrogates |       SEARCH Regex       |
    •----------------------------------•------------•-------•-------------------------•-------------------•--------------------------•
    | LATIN CAPITAL LETTER A           |    0041    |   A   | 41                      |        N/A        | \x{0041}          or  .  |
    | MATHEMATICAL BOLD CAPITAL A      |   1D400    |   𝐀   | F0 9D 90 80             |    D835 + DC00    | \x{D835}\x{DC00}  or  .. |
    | COMBINING GRAVE ACCENT BELOW     |    0316    |   ̖   | CC 96                    |        N/A        | \x{0316}          or  .  |
    | COMBINING LEFT ANGLE ABOVE       |    031A    |   ̚   | CC 9A                    |        N/A        | \x{031A}          or  .  |
    | MUSICAL SYMBOL COMBINING MARCATO |   1D17F    |   𝅿   | F0 9D 85 BF              |    D834 + DD7F    | \x{D834}\x{DD7F}  or  .. |
    •----------------------------------•------------•-------•-------------------------•-------------------•--------------------------•

We may build up some COMPOSED characters, as below :

    •-----------------------•-------•-------------------------•----------------------------•--------------------------------------------•
    |  Code-Points          | Chars | In a UTF-8 encoded file |     Hex-16 Surrogates      |                SEARCH Regex                |
    •-----------------------•-------•-------------------------•----------------------------•--------------------------------------------•
    |  0041 +  031A         |   A̚   | 41 CC 9A                |           NO               | \x{0041}\x{031A}                  or  ..   |
    |  0041 + 1D17F         |   A𝅿   | 41 F0 9D 85 BF          | D834 + DD7F ( on 2nd char) | \x{0041}\x{D834}\x{DD7F}          or  ...  |
    | 1D400 +  031A         |   𝐀̚   | F0 9D 90 80 CC 9A       | D835 + DC00 ( on 1st char) | \x{D835}\x{DC00}\x{031A}          or  ...  |
    | 1D400 + 1D17F         |   𝐀𝅿   | F0 9D 90 80 F0 9D 85 BF | D835 + DC00 + D834 + DD7F  | \x{D835}\x{DC00}\x{D834}\x{DD7F}  or  .... |
    |  0041 + 1D17F +  031A |   A𝅿̚   | 41 F0 9D 85 BF CC 9A    | D834 + DD7F ( on 2nd char) | \x{0041}\x{D834}\x{DD7F}\x{031A}  or  .... |
    |  0041 +  031A + 1D17F |   A𝅿̚   | 41 CC 9A F0 9D 85 BF    | D834 + DD7F ( on 3rd char) | \x{0041}\x{031A}\x{D834}\x{DD7F}  or  .... |
    | 1D400 +  031A +  0316 |   𝐀̖̚   | F0 9D 90 80 CC 9A CC 96 | D835 + DC00 ( on 1st char) | \x{D835}\x{DC00}\x{031A}\x{0316}  or  .... |
    •-----------------------•-------•-------------------------•----------------------------•--------------------------------------------•
</code></pre>
<p dir="auto"><strong>Second</strong> example: If we use <strong>any</strong> of the <strong><code>3</code></strong> following <strong>regex</strong> S/R :</p>
<p dir="auto">SEARCH <strong><code>(?-s)^.+(.[\x{DC00}-\x{DFFF}]).+</code></strong></p>
<p dir="auto">or :</p>
<p dir="auto">SEARCH <strong><code>(?-s)^.+\x20(..)\x20.+</code></strong></p>
<p dir="auto">or :</p>
<p dir="auto">SEARCH <strong><code>(?-s)^.+(\x{D83D}\x{DCA6}).+</code></strong></p>
<p dir="auto">and :</p>
<p dir="auto">REPLACE <strong><code>A necklace of the SPLASHING SWEAT SYMBOL ––\1––\1––\1––\1––\1––\1––\1––\1––\1––</code></strong></p>
<p dir="auto">against the text <strong>This is the 💦 character</strong>, at the <strong>beginning</strong> a line, we get the resulting text :</p>
<p dir="auto"><strong>A necklace of the SPLASHING SWEAT SYMBOL ––💦––💦––💦––💦––💦––💦––💦––💦––💦––</strong></p>
<hr />
<p dir="auto">Now, let’s go <strong>back</strong> to your problem :</p>
<p dir="auto">Fundamentally, the problem arise because your <strong>special 💦</strong> character can be matched with the regex <strong><code>..</code></strong>, <strong>only</strong>, regarding our <strong>present</strong> regex engine. It looks like, for these characters, the regex engine don’t see the character <strong>itself</strong>, but the <strong>two</strong> surrogate <strong><code>16-bits</code></strong> code units !</p>
<p dir="auto">When you process the regex <strong><code>^.*$</code></strong> against your text : <strong>Input line: 💦</strong>, it <strong>does</strong> match the <strong>entire</strong> line, as the regex syntax <strong><code>.*</code></strong> means <strong>any</strong> number of chars ( <strong><code>.</code></strong> or <strong><code>..</code></strong> or <strong><code>...</code></strong>, and so on )</p>
<p dir="auto">Now, let’s consider the <strong>following</strong> regex syntaxes, with a <strong>capturing</strong> group <strong><code>1</code></strong>, against this <strong>4-lines</strong> text, pasted in a <strong>new</strong> tab :</p>
<pre><code class="language-diff">
💦

Input line: 💦
</code></pre>
<p dir="auto">Note that the <strong><code>1st</code></strong> and <strong><code>3rd</code></strong> line are <strong>empty</strong>, the <strong><code>2nd</code></strong> line contains your <strong>💦 special</strong> char, only and the <strong><code>4th</code></strong> line <strong>ends</strong> with that <strong>special</strong> char</p>
<p dir="auto">Regarding the <strong>following</strong> regex examples, below, you may <strong>test</strong> them, using the <strong><code>--&gt;\1&lt;--</code></strong> <strong>Replace</strong> zone</p>
<p dir="auto">Before, a quick <strong>remainder</strong> :</p>
<pre><code class="language-z">The INPUT text :

167844894321
16784
4566499

with the regex S/R :

SEARCH (\d)+

REPLACE --&gt;\1&lt;--

would result in :

--&gt;1&lt;--
--&gt;4&lt;--
--&gt;9&lt;--
</code></pre>
<p dir="auto">As you can see, <strong>group <code>1</code></strong> always contains the <strong>last stored</strong> value of the group. So, the regex could also have been rewritten as <strong><code>\d+(\d)</code></strong></p>
<hr />
<ul>
<li>
<p dir="auto">The regex <strong><code>^(.)$</code></strong> <strong>cannot</strong> find anything, as <strong>no</strong> character, with code <strong><code>&lt;= \x{FFFF}</code></strong>, exists between <strong>beginning</strong> and <strong>end</strong> of line</p>
</li>
<li>
<p dir="auto">The regex <strong><code>^(..)$</code></strong> does find, in line <strong><code>2</code></strong>, your <strong>💦 special</strong> character, with code <strong><code>&gt; \x{FFFF}</code></strong>, between <strong>beginning</strong> and <strong>end</strong> of line</p>
</li>
<li>
<p dir="auto"><strong>Your</strong> regex <strong><code>^(.)*$</code></strong> simply matches the true <strong>empty</strong> lines <strong><code>1</code></strong> and <strong><code>3</code></strong>. WHY ?<br />
Well, as the group contains only <strong>one</strong> dot <strong><code>.</code></strong>, it <strong>cannot</strong> match your last <strong>💦 special</strong> character, in line <strong><code>2</code></strong> and <strong><code>4</code></strong>, which needs to be considered as a pseudo <strong>two-chars</strong> entity. So the <strong>overall</strong> regex fails, in these lines !</p>
</li>
<li>
<p dir="auto">The regex <strong><code>^(..)*$</code></strong> does match <strong>all</strong> the lines of the subject text, because, luckily, the part <strong>Input line:</strong>, followed with a <strong>space</strong> char, is exactly <strong>12</strong> chars long, so an <strong>even</strong> number ! And the <strong>last</strong> value of <strong>group <code>1</code></strong> is your <strong><code>2-chars</code></strong> <strong>💦 special</strong> char, right <strong>before</strong> the end of the line</p>
</li>
</ul>
<p dir="auto"><strong>Notes</strong> :</p>
<ul>
<li>
<p dir="auto">The regex <strong><code>^.*(..)$</code></strong> would match all the <strong>non</strong>-empty lines <strong><code>2</code></strong> and <strong><code>4</code></strong>, because <strong>group <code>1</code></strong>, <strong><code>..</code></strong>, represents your <strong>💦 special</strong> char, <strong>ending</strong> these lines</p>
</li>
<li>
<p dir="auto">And the regex <strong><code>^(?:..){6}(..)$</code></strong> would match the line <strong><code>4</code></strong>, <strong>only</strong></p>
</li>
<li>
<p dir="auto">The regex <strong><code>^.............(.)$</code></strong> does <strong>not</strong> work properly, because <strong>group<code>1</code></strong> does <strong>not</strong> contain the <strong>💦 special</strong> character ( See after the <strong>replacement</strong> ! )</p>
</li>
<li>
<p dir="auto">On the contrary, the regex <strong><code>^............(..)$</code></strong> <strong>does</strong> find <strong>all</strong> contents of line <strong><code>4</code></strong>, as the <strong>group <code>1</code></strong>, <strong><code>..</code></strong>, contains, exactly, the <strong>💦 special</strong> character</p>
</li>
</ul>
<p dir="auto">On the other hand :</p>
<ul>
<li>
<p dir="auto">The regex <strong><code>^(.)*</code></strong> selects as <strong>many</strong> standard characters, with code-point <strong><code>&lt;= \x{FFFF}</code></strong>, so the following strings, but <em>NOT</em> your <em>LAST</em> <strong>💦 special</strong> character !</p>
<ul>
<li>
<p dir="auto">The <strong>null</strong> string <strong>before</strong> your <strong>💦 special</strong> char, in line <strong><code>2</code></strong></p>
</li>
<li>
<p dir="auto">The string <strong><code>Input line:</code></strong>, followed with a <strong>space</strong> char, in line <strong><code>4</code></strong></p>
</li>
</ul>
</li>
</ul>
<p dir="auto">And, finally :</p>
<ul>
<li>The <strong>two</strong> regexes <strong><code>(.*)$</code></strong> and <strong><code>(.*)</code></strong>, with  <strong>group <code>1</code></strong> selecting <strong>all</strong> line contents, would match the <strong>four</strong> lines</li>
</ul>
<hr />
<p dir="auto">Now, your <strong>last</strong> goal : let’s suppose that you would like to <strong>delete any</strong> line, which does <strong>not</strong> contain <strong>any</strong> Unicode <strong><code>Emojis</code></strong> character :</p>
<ul>
<li>First, from that link :</li>
</ul>
<p dir="auto"><a href="http://www.unicode.org/charts/PDF/U1F600.pdf" rel="nofollow ugc">http://www.unicode.org/charts/PDF/U1F600.pdf</a></p>
<p dir="auto">We learn that the Unicode <strong>Emoticons</strong> block have code-points between <strong><code>\x{1F600}</code></strong> and <strong><code>\x{1F64F}</code></strong></p>
<ul>
<li>
<p dir="auto">With the <strong>on-line</strong> <strong><code>UTF-8</code></strong> toll, we verify that the <strong>two</strong> Hex <strong><code>UTF-16</code></strong> surrogates are :</p>
<ul>
<li>
<p dir="auto"><strong><code>D83D DE00</code></strong>, for the <strong><code>\x{1F600}</code></strong> <strong>emoticon</strong></p>
</li>
<li>
<p dir="auto"><strong><code>D83D DE4F</code></strong>, for the <strong><code>\x{1F64F}</code></strong> <strong>emoticon</strong></p>
</li>
</ul>
</li>
</ul>
<p dir="auto">So, we should match <strong>all</strong> the characters of the Unicode <strong><code>Emoticons</code></strong> block, with the search regex :</p>
<p dir="auto">SEARCH <strong><code>\x{D83D}[\x{DE00}-\x{DE4F}]</code></strong></p>
<p dir="auto">And, yes, it does work as <strong>expected</strong>. In that case, <strong>deleting</strong> any <strong>non</strong>-empty line which does <strong>not</strong> contain any <strong>Emoticon</strong> character(s) is easy with the following <strong>regex</strong> S/R :</p>
<p dir="auto">SEARCH <strong><code>(?-s)^(?!.*\x{D83D}[\x{DE00}-\x{DE4F}]).+\R</code></strong></p>
<p dir="auto">REPLACE <strong><code>Leave EMPTY</code></strong></p>
<hr />
<p dir="auto">In contrast, the <strong>regex</strong> S/R :</p>
<p dir="auto">SEARCH <strong><code>(?-s)^(?=.*\x{D83D}[\x{DE00}-\x{DE4F}]).+\R</code></strong></p>
<p dir="auto">REPLACE <strong><code>Leave EMPTY</code></strong></p>
<p dir="auto">would <strong>delete</strong> any <strong>non</strong>-empty line containing <strong>one</strong> or more <strong>emoticon</strong> character(s) !</p>
<p dir="auto">Not <strong>asleep</strong> yet ? That’s good news :-))</p>
<p dir="auto">Best Regards,</p>
<p dir="auto">guy038</p>
<p dir="auto"><strong>P.S.</strong> :</p>
<p dir="auto">Let’s suppose that, instead of the <strong>small</strong> Unicode <strong><code>Emoticons</code></strong> block, containing <strong><code>80</code></strong> characters, we would like to search for <strong>any</strong> character belonging to the <strong>Unicode</strong> <strong><code>Miscellaneous Symbols and Pictographs</code></strong> block, which contains <strong><code>768</code></strong> characters and where your <strong>special 💦</strong> char takes place</p>
<p dir="auto">Right now, it’s getting really <strong>inextricable</strong> ! The <strong>Unicode</strong> range of that <strong>block</strong> is from <strong><code>\x{1F300}</code></strong> to <strong><code>\x{1F5FF}</code></strong>, but, because of the <strong>surrogates</strong> mechanism, it must be split in <strong>two</strong> parts :</p>
<ul>
<li>
<p dir="auto">The range of chars between <strong><code>\x{1F300}</code></strong> and <strong><code>\x{1F3FF}</code></strong>, so with <strong>surrogates</strong> pairs <strong><code>D83C DF00</code></strong> to <strong><code>D83C DFFF</code></strong></p>
</li>
<li>
<p dir="auto">The range of chars between <strong><code>\x{1F400}</code></strong> and <strong><code>\x{1F5FF}</code></strong>, so with <strong>surrogates</strong> pairs <strong><code>D83D DC00</code></strong> to <strong><code>D83D DDFF</code></strong></p>
</li>
</ul>
<p dir="auto">Therefore, the <strong>correct</strong> regex to match <strong>all</strong> the characters of this <strong>block</strong> is, indeed :</p>
<p dir="auto"><strong><code>\x{D83C}[\x{DF00}-\x{DFFF}]|\x{D83D}[\x{DC00}-\x{DDFF}]</code></strong></p>
<p dir="auto">with an <strong>alternative</strong> between <strong>two</strong> regexes, in order to match each <strong>subset</strong> !</p>
<p dir="auto">I confirm that this regex does find the <strong><code>768</code></strong> characters of the Unicode <strong>Miscellaneous Symbols and Pictographs</strong> block, with code-point <strong>over</strong> <strong><code>\x{FFFF}</code></strong> !</p>
<p dir="auto">It’s really a <strong>pity</strong> that the N++ regex <strong>engine</strong> does not handle <strong>correctly</strong> all the characters <strong>outside</strong> the <strong><code>BMP</code></strong>. If so, we just would have to <strong>simply</strong> use the classical <strong><code>[\x{1F300}-\x{1F5FF}]</code></strong> <strong>character class</strong> !!</p>
]]></description><link>https://community.notepad-plus-plus.org/post/51068</link><guid isPermaLink="true">https://community.notepad-plus-plus.org/post/51068</guid><dc:creator><![CDATA[guy038]]></dc:creator><pubDate>Mon, 02 Mar 2020 22:32:31 GMT</pubDate></item><item><title><![CDATA[Reply to Regex single dot character in group behaves differently than not in group on Thu, 27 Feb 2020 15:33:55 GMT]]></title><description><![CDATA[<p dir="auto">Hello, <a class="plugin-mentions-user plugin-mentions-a" href="/user/matthews-dylan" aria-label="Profile: matthews-dylan">@<bdi>matthews-dylan</bdi></a></p>
<p dir="auto">Allow me <strong>some</strong> hours to elaborate a <strong>correct</strong> reply to your problem, which is really <strong>not</strong> easy, as it involves notions such as <strong><code>UTF-8</code></strong> encoding, Unicode <strong>surrogates</strong>, Notepad++ <strong>encodings</strong>, regex <strong>engine</strong> handling of characters and, of course, <strong>fonts</strong> !</p>
<p dir="auto">See you later,</p>
<p dir="auto">guy038</p>
]]></description><link>https://community.notepad-plus-plus.org/post/51008</link><guid isPermaLink="true">https://community.notepad-plus-plus.org/post/51008</guid><dc:creator><![CDATA[guy038]]></dc:creator><pubDate>Thu, 27 Feb 2020 15:33:55 GMT</pubDate></item></channel></rss>