Hello, @Alan-kilborn and All,
Thank, alan, for your feedback !
Yes, I know that the Pos number, in the status bar, refers to exact position (starting from 0), in current file, of the first byte of the sequence needed to write a character, in a specific encoding ! For instance, the UTF-8 sequence of the 🎷 character, representing a saxophone, is the four bytes sequence ( F0 9F 8E B7 ). So, if you insert in a new tab, the string A🎷Z0 you can jump, with the Search > Go to... feature, when the Offset radio button is set, to :
Pos 0, right before the A letter
Pos 1, right before the 🎷 letter
Pos 5, right before the Z letter
Pos 6, right before the 0 digit
And, if you try the offset 2, 3 or 4, which are all within the UTF-8 encoding of the 🎷 character, you would just jump to the next Z char !
This behavior is now correct, because I created an issue about this problem. Refer to this issue !
Now, I think that you’re right regarding your assumption about the two bytes used by a char, over the BMP : this has really something to do with the two bytes of the surrogate mechanism !
For instance :
The regex to get the 🎷 character, use its surrogate pair \x{D83C}\x{DFB7} ( as we cannot use its complete hexadecimal code \x{1F3B7} )
The general regex (?-s).[\x{D800}-\x{DFFF}] finds any character over the BMP ( Basic Multilingual Plane ), so with code-point over \x{FFFF}
BR
guy038