Another encoding issue



  • So, encoding has been a theme here recently in the Community.
    Or, maybe it is just me bringing it up all the time. :-)
    I’m not even working with obscure encodings, just UTF-8 (no BOM).

    Here’s the latest:

    Confirm my encoding setting:

    6415c532-08e8-4302-885d-9d41269b7615-image.png

    I had some text I was working on, which contained this UTF-8 character ➤ :

    d66bf2d3-fbd7-485a-92ae-55df46d34dda-image.png

    And I used the arrow key to caret up from the http line so that I could put some text after the “right arrow” on the line above, and what I was seeing changed to:

    020cab4f-bd02-49a2-b460-938775b06ac7-image.png

    I can’t say for sure much more than that.
    I could undo to get the original text back for the screenshot, but I couldn’t say what my actions were before this happened.
    Meaning, I don’t know what position I was on on line 247 when I pressed up-arrow and got the weirdness.
    (And no matter what I’ve tried, I can’t reproduce it.)

    But, I know that the “x” position on a line is retained sometimes so that when a move up or down is made, the x position can be maintained. Perhaps in this case this somehow caused the caret to end up in the middle of the UTF-8 character?

    This is with 7.9, BTW.



  • @Alan-Kilborn ,

    Not sure. I cannot get anything like that to happen.

    Even in this example:

     ➤ 
     x 
    
        ➤ 
        https://com
        ➤ 
        ➤ 
    

    64c34357-ada1-4ce6-a497-3435eb933a87-image.png
    … where the cursor is at offset 18, shown in red: if I try to go to offset 19,20, or 21, it places the cursor after the (those are the three byte offsets for the UTF8 encoding of that character), and 22 places it after the space.

    So I don’t know how you convinced it to break that character apart. Also, ➤ is U+27A4, so it’s three bytes should be 0xE2, 0x9E, 0xA4… so your screenshot shows that it took the two outer bytes, but the central byte is apparently missing.

    Ooh, that gave me a hint: in my example, if I go to offset 19 then hit DEL, it changed to
    9c3adeb1-fb67-4603-9e99-26f2d5f79240-image.png

    So I am guessing what happened is that you somehow got it to the central offset in the multi-byte character and deleted it – though for me, UNDO works to fix that. So maybe you triggered a script which deleted the byte from the character but plays with the UNDO history so UNDO didn’t work.



  • @PeterJones

    So first, thanks for your thoughts and your experimentation.

    maybe you triggered a script which deleted the byte from the character

    I suppose this IS possible, although I don’t have any scripts that do any deleting when I’m just careting around. :-)

    It seems like even scripts should be somewhat insulated from getting into the middle of multibyte characters. Sure, it should be possible for those that need it (not me), but I pretty much always want to deal with things at a character level. So such a thing should be “difficult” to have happen, if truly caused by a script.

    But, alas, I don’t have more data on this, so that’s the end of conclusions.

    Regarding “character level”, it is a bit disturbing to me that Notepad++ allows the user to jump to an offset right in the middle of a multibyte character. Again, I would expect to be restricted to the character level by this.



  • Hello, @alan-kilborn, @peterjones and All,

    So I created a new issue about this disturbing behavior ;-))

    Best regards,

    guy038


Log in to reply