Colored "Find what:" zone

guy038

Regarding the Mark style feature, unfortunately, it cannot take in account a range of more of 2,047 characters too, as the Find what: zone :-(

Just duplicate a zone of, let say, 5,000 characters or so ! I had never done such a test before ;-))

BR

guy038

Alan Kilborn

@guy038 said in Colored "Find what:" zone:

Regarding the Mark style feature, unfortunately, it cannot take in account a range of more of 2,047 characters as the Find what: zone

Well, I had not encountered this before (having not attempted such a large “styling”) but it does not surprise me because clearly “styling” is going to involve Notepad++'s internal find routines to do its job. And those, as we know, have this 2047 limit.

But really, it isn’t much of a limitation to what we’re discussing (your usage when doing before/after regex replacement “compares”), right? Perhaps you were thinking that the “styling” method may be better because it might not have this limitation?

As an alternative for such a mechanism for such compares, what I do is to use an independent compare utility. The utility can do quick-to-invoke compares on the last two things copied to the clipboard. Thus it is perfect for your described application. Everyone touts the N++ Compare plugin, but I find a separate utility outside of N++, with possibly some hooks “into” N++ (via PythonScript) to be even more useful. Nothing against N++'s Compare plugin, however.

But, back to the 2047 (or is it 2046? I can’t remember) limit…

Is it truly 2047 characters, or is it 2047 bytes? Also a “can’t remember” for me. If it is “bytes” then, worst case for UTF-8 data, it might be as little as 2047-divided-by-4, or roughly 512 characters.

But even if it is characters, is such a limit “too small” for today’s conditions?
Maybe lobbying the N++ devs for an increase in this number is a reasonable thing to do?

Alan Kilborn

@Alan-Kilborn said in Colored "Find what:" zone:

I think it actually has more value to truly BE a button, or, more accurately, a horizontally-narrow dropdown,

One thing that my earlier proposal does NOT consider, is changing modes via keyboard.
I don’t currently have a great idea for this, without increasing the size on the UI.
But probably it is all pointless anyway, as Find UI changes are rarely considered by the devs.

guy038

Hi, @alan-kilborn and All,

I did a series of tests and I’ve found out an interesting point about the Find What: filling zone !!

A) First case :

Make a normal selection of some text or use the current selection
Hit the Ctrl + F, Ctrl+ H, Ctrl + Shift + F or Ctrl + M shortcut. So, this selection usually fills in the Find What: zone, automatically

=> In this case, the maximum size of this zone is 2,046 bytes, whatever the characters stored and the number of chars to encode each character. For instance, the string Aé▣🎷 contains 1 + 2 + 3 + 4 bytes, in an UTF-8 file. So, 10 bytes are inserted in the Find what: zone

B) Second case :

Copy the current selection in the clipboard, with Ctrl + C
Cancel the current selection
Hit the Ctrl + F, Ctrl+ H, Ctrl + Shift + F or Ctrl + M shortcut
Delete the contents of the Find What: zone, whatever it is
Paste the contents of the clipboard with Ctrl + V, in the Find what zone

=> In that case, the maximum size of this zone is 2046 chars and :

Each character, with Unicode code-point <= U+FFFF, stands for one character
Each character, with Unicode code-point > U+FFFF, stands for two characters !

So the same string Aé▣🎷 contains 1 + 1 + 1 + 2 chars. Thus, 5 pseudo characters are inserted in the Find what: zone

Remark that this case B) occurs, also, for the Replace with: zone, as we need, necessarily, to fill in this zone with clipboard contents, anyway ! Therefore, the maximum size of the Replace with: zone is 2,046 characters, too, with the above distinction between characters within or outside the BMP !

BTW, we get the same results whatever the current search mode used !

Best Regards,

guy038

P.S. :

For a quick test, note the differences between cases A) and B) with the one-line text of the ▣ character ( U+25A3 ), below :

▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣

First case :

Select all the above text ( 2,109 chars )
Open the Find dialog ( The Fins what: zone is filled in, automatically )
Tick the Wrap around option
Click once only on the Find Next button

=> 682 characters ▣ are selected ( each char is coded with three bytes E2 96 A3 => 682 * 3 = 2,046 bytes )

Second case :

Select again all this text ( 2,109 chars )
Hit Ctrl + C
Cancel the selection
Open the Find dialog
Delete anything in the Find what: zone
Paste the clipboard contents with Ctrl + V, in the Find what: zone
Tick the Wrap around option
Click on the Find Next button

=> 2,046 characters ▣ are selected ( each char counts for itself, as its code-point is <= U+FFFF )

Alan Kilborn

@guy038 said in Colored "Find what:" zone:

I did a series of tests and I’ve found out an interesting point about the Find What: filling zone !!

This sounds like more than “interesting” behavior.
It sounds like “buggy” behavior.
And it sounds like possibly several bugs. :-(

guy038

Hi, @alan-kilborn and All,

Yeah, I admit that it’s really border line ! Now, which case seems more logical and which case seems more interesting ?

I would say that :

The first case seems more logical as it considers the total amount of bytes inserted in the Search what: zone, which is strictly equal to the total amount of bytes of the current selection, before calling the Find dialog
Now, the second case, pasting contents in the Search what: zone, is more interesting, of course, because we can search for a greater range of characters ( at least, 2 times more ) !

Best Regards,

guy038

Alan Kilborn

@guy038

I would say, that if the developers are going to set some kind of limit (and often in software a limit must be set), then for user convenience and understanding, it should be a character limit. Users don’t understand characters versus bytes (unless those numbers are strictly the same, and with UTF-8 and other encodings they are NOT).

And, different methods of entry should of course not alter the amount of data that can be accepted.

guy038

Hello, @alan-kilborn and All,

Yes, Alan, the character point of vue should be preferred to the byte one, like, for instance, the sel : number, in status bar, which refers to characters ( not bytes ) !

So, the second case should, therfore, be preferred. However, note that, presently, there is still a difference between chars in the BMP, counting for one char and characters outside the BMP, counting for two. A bit weird, BTW ?

BR

guy038

Alan Kilborn

@guy038 said in Colored "Find what:" zone:

the character point of vue should be preferred to the byte one, like, for instance, the sel : number, in status bar, which refers to characters ( not bytes ) !

And the Pos : number, in the status bar, bothers me somewhat, as it seems intuitively like it should be one character = one “position” change as you cursor over it. But for multibyte characters it is NOT a change of one.

The Pythonscript programmer in me sort of understands this, however. Meaning how Scintilla deals with “position”.

chars in the BMP, counting for one char and characters outside the BMP, counting for two.

You might understand that way better than me.
The “two” makes me think of “surrogate pairs” but here is where I back off because I don’t know what I’m talking about. :-)

guy038

Hello, @Alan-kilborn and All,

Thank, alan, for your feedback !

Yes, I know that the Pos number, in the status bar, refers to exact position (starting from 0), in current file, of the first byte of the sequence needed to write a character, in a specific encoding ! For instance, the UTF-8 sequence of the 🎷 character, representing a saxophone, is the four bytes sequence ( F0 9F 8E B7 ). So, if you insert in a new tab, the string A🎷Z0 you can jump, with the Search > Go to... feature, when the Offset radio button is set, to :

Pos 0, right before the A letter
Pos 1, right before the 🎷 letter
Pos 5, right before the Z letter
Pos 6, right before the 0 digit

And, if you try the offset 2, 3 or 4, which are all within the UTF-8 encoding of the 🎷 character, you would just jump to the next Z char !

This behavior is now correct, because I created an issue about this problem. Refer to this issue !

Now, I think that you’re right regarding your assumption about the two bytes used by a char, over the BMP : this has really something to do with the two bytes of the surrogate mechanism !

For instance :

The regex to get the 🎷 character, use its surrogate pair \x{D83C}\x{DFB7} ( as we cannot use its complete hexadecimal code \x{1F3B7} )
The general regex (?-s).[\x{D800}-\x{DFFF}] finds any character over the BMP ( Basic Multilingual Plane ), so with code-point over \x{FFFF}

BR

guy038