word character list - special characters █►◄ not selected as expected

guy038

Your special characters are :

The FULL BLOCK character ( █ ), from the Unicode Block Elements script, with code-point \x{2588}

Refer http://www.unicode.org/charts/PDF/U2580.pdf

The BLACK RIGHT-POINTING POINTER character ( ► ), from the Unicode Geometric Shapes script, with code-point \x{25BA}
The BLACK LEFT-POINTING POINTER character ( ◄ ), from the Unicode Geometric Shapes script, with code-point \x{25C4}

Refer http://www.unicode.org/charts/PDF/U25A0.pdf

I didn’t search in previous/old versions of Notepad++, for verifications, but I’m afraid that you cannot set characters with code over \x{0080} when using a multi-byte encoding ( So all Unicode encodings : UTF-8, UTF-8-BOM, UCS-2 BE BOM and UCS-2 LE BOM )

Refer to https://www.scintilla.org/ScintillaDoc.html#SCI_GETWORDCHARS

It is said :

For multi-byte encodings, this API will not return meaningful values for 0x80 and above.

So the Scintilla message SCI_SETWORDCHARS, to change the set of words characters, can handle only ASCII characters, if you use, for instance, the default UTF-8 encoding

So, instead of using exotic Unicode characters, I was thinking about using the MACRON symbol of Unicode code-point \x{00AF} ( Don’t laugh ! No relation with the President of the French Republic…as I’m French ! )

Refer to http://www.unicode.org/charts/PDF/U0080.pdf

All that’s next is, of course, a work-around but you may like it and even give yourself other ideas !

Of course, as its code-point is higher than \x{007F} you will not able to select it, along with some word chars. However :

To write it use, either, the shortcut ALT + 0175 ( Unicode, ANSI or Win-1252 encodings ) or ALT + 238 ( OEM-850 encoding )
It can help to isolate your words, easily enough, among other normal text. For instance : This is ¯¯¯Domain¯¯¯ a quick test !
You could highlight any occurrence of that specific character, using, for instance, Search > Mark All > Using 1st Style OR the context menu Style token > Using 1st Style, after selecting it
But above all, you may move from one highlighting to another, with the shortcuts Ctrl + 1 ( forward ) and Ctrl + Shift + 1 ( backward ). Note that your must use the 1 key of the main keyboard !
On the other hand, you could also use the ¯+.+?¯+ regular expression , in the Find or Mark dialogs, to match anything embedded between ¯ characters ! And delete these matches leaving the replacement zone empty

BTW, I verified that if I include the ¯ character as a word character, in Preferences... > Delimiter > Word character list, you can select, for instance, all the string ¯¯¯Domain¯¯¯ with a double-click, if typed in an ANSI encoded file

Best Regards,

guy038

Mohammad Hussain

Sorry for the late reply (I spent some time looking at ranges and testing different characters).

Also, thank you very much for your incredibly detailed reply. I can’t believe how much time you’ve spent trying to help. Truly, truly appreciated!!

Unfortunately, None of this will work well for what I’m doing. Here’s a more clear example of a line in one of the files I distribute to my colleagues, and sometimes clients:
Generate GUID:
https://█►Domain◄█/d2l/guids/d2l.guid.2.asmx/GenerateExpiringGuid?guidType=SSO&orgId=█►MainOrgID◄█&installCode=█►InstallationCode◄█&TTL=60&data=█►Username◄█&key=█►LocalPrivateKey◄█

As you can see, not only the characters I chose are very visible, they also clearly indicate which part to modify, with the arrows helping with that.

I checked all the characters within the 0080 range, and none of them work for my purpose. The only arrow-like characters are used in html/xml files, so using them will be very confusing if someone is trying to edit html/xml.

As for the Macron character (very funny btw!), it’s not obvious enough, although it’s clearly more obvious than most other options. The other issue with it is it doesn’t belong to the 0080 range either, which means (as you mentioned), it only works with ANSI encoding, but not Unicode. All of my files are in Unicode.

I don’t fully understand what Scintilla is, but it does sound like a library/dependency beyond the control of Notepad++ code. If that’s the case, I guess I’ll just keeping the same characters I was using (for visibility/ease of use), and everyone should remove them manually, and hopefully, they will remove the right amount of characters without introducing errors. It’s unfortunate though. Using them before was very convenient…

Thank you again @guy038. I truly appreciate you help :)

PeterJones

@Mohammad-Hussain said in word character list - special characters █►◄ not selected as expected:

everyone should remove them manually, and hopefully, they will remove the right amount of characters

There is an alternative to fully-manual removal. Instructions:

Double-click and overtype as they previously did, which will change █►DOMAIN◄█ into █►blah.url◄█ (for example)
After they finished all the replacements necessary, Search > Replace
- FIND = [\x{2588}\x{25ba}\x{25c4}]
- REPLACE = (leave box empty)
- Search Mode = regular expression
- REPLACE ALL

No hoping required.

Alternate: Don’t have them manually double-click.

Use Search > Find from the beginning
- FIND = \x{2588}\x{25ba}.*?\x{25c4}\x{2588}
- Search Mode = regular expression
- FIND NEXT
click on the tab bar; if they lost the selection (by clicking in the text instead of in the tab bar), hit F3 to re-highlight the next instance
type over the selected text, which will include typing over the █►DOMAIN◄█ into blah.url, so getting rid of the fancy characters
hit F3 and repeat typeover for all the █►...◄█ instances

if, in your encoding, the unicode \x{....} characters doesn’t match, you’d have to tell us what encoding you’re actually using (or possibly just paste in the actual characters, rather than the \x{....} notation).

Alan Kilborn

@PeterJones

Two very good solutions; nicely done.

Additionally, maybe recording some macros helps, and/or the Mark function. After marking, you can jump between marks by using Search > Jump down (or up) > Find Style

Mohammad Hussain

Thank you gentlemen!

Very elegant solutions indeed :)

I’ll probably use these myself (probably the macro one. Automation saves time). Most of my colleagues however don’t even know what regular expressions are, not to mention clients! lol! I guess they’ll either have to do this manually, or just use simple search to remove these characters.

Thanks again:)

Have a great day everyone! Stay safe :)

Alan Kilborn

@Mohammad-Hussain said in word character list - special characters █►◄ not selected as expected:

Most of my colleagues however don’t even know what regular expressions are

Even better for a macro-based solution; just bind a regular expression operation to a keycombo for them, and they don’t need to know much to use it.

guy038

Hi, @mohammad-hussain, @alan-kilborn, @peterjones and All,

@mohammad-hussain, in the second part of this post, I will describe a solution, using macros, for the search of each zone █►...........◄█, in each direction ( forward and backward )

However, I would like, first, to discuss, with Alan and Peter, of a regex search bug that I had already noticed but which did not worry me too much. However, presently, it is very annoying, regarding macro behaviour, involving searches !

Luckily, @mohammad-hussain, I’ve found out a work-around which will enable you to create two macros and use them to search forward / backward for your █►...........◄█ zones ;-))

So, first, let me explain the bug :

Open a new tab
Insert the sample text START é12345 é ABCDEZéGHIùJKZé é67890 é TUVWùXYZé END Zé, containing the very common French letter è and two letters ù
Place the caret at beginning of word START
Open the Find dialog
SEARCH é
Tick the Wrap around option ( IMPORTANT )
Select the Regular expression mode
Click on the Find Next button

=> The first é of the string é12345 is selected

Close the Find dialog
Go on, hitting the F3 key

=> You get the successive occurrences of the è letter

Now, hit the Shift + F3 for a backward search => nothing happens :-(( Backward search is impossible to perform

Notes :

After tests, this bug occurs when the search ends with a character with code-point > \x7F ( so NON pure ASCII char )
- Search of regexes .é, \ué or Zé did not work in backward direction, even if you choose the Backward direction option
- Search of the regex .[\x{0080}-\x{FFFF}] did not work, either, in backward search
But :
- Search of regexes é., é\x20, é\w, .é., .é\x20 or \ué. does search in backward direction
- Search of the regex .[\x{0000}-\x{007F}] or é[\x{0000}-\x{007F}] does work, as well, in backward search
This bug only occurs with an Unicode encoding ( UTF-8, UTF-8-BOM, UCS-2 BE BOM and UCS-2 LE BOM ). With an ANSI encoded file, no bug at all !
This bug does not happen, either, if you use the Normal or Extended (\n, \r, \t, \0, \x...) search mode

So, do you confirm, guys, that it’s a real bug ? If so, I’ll create an issue, soon

Mates, you may think : he’s going to give up ? No, I’m a little stubborn, even quite a lot ! So, do you see a possible work-around to that problem ?

Ah, ah ! Well, the magical regex is (?=(?s).). ( Almost ) obviously, this look-ahead assertion is always TRUE, isdn’t it ?. This expression misleads the regular expression engine, by making it believe that there is some additional kind of character to be taken into account !

So, in the meanwhile, here is a new regex rule :

When you cannot perform a backward search, in regular expression mode, simply add the (?=(?s).) syntax, at the end of you present search regex ;-))

Now, @mohammad-hussain, with this work-around, here are, below, the two macros to be appended at the end of the <Macros>.........</Macros> node of your active shortcuts.xml configuration file :

        <Macro name="Search Zones to Modify (Fwd)" Ctrl="yes" Alt="no" Shift="no" Key="123">                                <!-- Ctrl + F12 shortcut      -->
            <Action type="3" message="1700" wParam="0" lParam="0" sParam="" />                                              <!-- Search Initialisation    -->
            <Action type="3" message="1601" wParam="0" lParam="0" sParam="\x{2588}\x{25ba}.*?\x{25c4}\x{2588}(?=(?s).)" />  <!-- Search of |>........<|   -->
            <Action type="3" message="1625" wParam="0" lParam="2" sParam="" />                                              <!-- Regular Expression mode  -->
            <Action type="3" message="1702" wParam="0" lParam="768" sParam="" />                                            <!-- Search Forward and Wrap  -->
            <Action type="3" message="1701" wParam="0" lParam="1" sParam="" />                                              <!-- Find Next match          -->
        </Macro>
        <Macro name="Search Zones to Modify (Bwd)" Ctrl="yes" Alt="no" Shift="yes" Key="123">                               <!-- Ctrl + Shift + F12       -->
            <Action type="3" message="1700" wParam="0" lParam="0" sParam="" />                                              <!-- Search Initialisation    -->
            <Action type="3" message="1601" wParam="0" lParam="0" sParam="\x{2588}\x{25ba}.*?\x{25c4}\x{2588}(?=(?s).)" />  <!-- Search of |>........<|   -->
            <Action type="3" message="1625" wParam="0" lParam="2" sParam="" />                                              <!-- Regular Expression mode  -->
            <Action type="3" message="1702" wParam="0" lParam="256" sParam="" />                                            <!-- Search Backward and Wrap -->
            <Action type="3" message="1701" wParam="0" lParam="1" sParam="" />                                              <!-- Find Previous match      -->
        </Macro>

Remark :

Depending if you have a local N++ install or not, your shortcuts.xml file can be found :

Along with the notepad++.exe file, for a local configuration, in any folder different from C:\Program files[(x86)]
In the path %AppData%\Notepad++, in case of use of the installer to install N++

I just tried it, with the last v7.8.6 version and everything went OK ! So, in summary :

To get the next █►...........◄█ zone, hit the Ctrl + F12 shortcut, which runs the Search Zones to Modify (Fwd) macro
To get the previous █►...........◄█ zone, hit the Ctrl + Shift F12 shortcut, which runs the Search Zones to Modify (Bwd) macro
Bonus, if you hit the F12 key, you swap between the Post-It screen mode and the Normal screen mode ;-))
On the other hand, you can also run a completely independent search with the F3 and Shift + F3 shortcuts

Best Regards,

guy038

P.S. :

To be rigorous, the look-ahead syntax (?=(?s).) match at any position, within the file but at the very end of file !

So, in case of a █►...........◄█ zone, at the very end of file, simply add a final line-break, after that zone

Alan Kilborn

@guy038 said in word character list - special characters █►◄ not selected as expected:

So, do you confirm, guys, that it’s a real bug ? If so, I’ll create an issue, soon

I confirm the findings.
But I already thought that backwards search in Regular Expression Search mode was problematic in Notepad++.
So, it seems it is nothing truly new, except another example of the problems.

astrosofista

@Alan-Kilborn said in word character list - special characters █►◄ not selected as expected:

I confirm the findings.

Hi @Alan-Kilborn, @guy038 and All:

Me too. Ran only the first tests, not those under the Notes.

By the way, @guy038, your magical regex (?=(?s).) is a nice catch. Thank you, I saved it.

You may want to know one useless but curious thing I found while playing with regex, is an expression that by repeatedly pressing Find Next confines the caret to the first word of the document, making it move in circles from the beginning to the end of the word: \A(?=\b).

Have fun!

Alan Kilborn

@astrosofista said

caret…move in circles from the beginning to the end of the word: \A(?=\b).

You must mean with Wrap around ticked.
I’m not surprised by the behavior of this regex.

It makes sense how it is working.
Well, within the confines of Notepad++ anyway. :-)