Community
    • Login

    Go To... offset ignores BOM

    Scheduled Pinned Locked Moved General Discussion
    13 Posts 4 Posters 7.1k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • tictoc wareT
      tictoc ware
      last edited by

      First, thank you guys for a great, great text editor. Brilliant work.

      Now, working with UTF-8 files, I noticed that using the “Go To…/Offset” does not take into account the leading 3-char BOM (ef bb bf).

      The simplest example is to create a UTF-8 BOM file and type “ab”. Then bring up the “Go To…” dialog and type 2 (as offset). This puts the caret behind the “b”. It shouldn’t. Offset-wise, it should remain before the “a”.

      I get it might be a tricky problem to implement correctly but ignoring the BOM ignores characters that affect the true offset in a file.

      (Note that the “length” indicator in the status bar does not report the true size of the file either)

      Thank you all.

      Scott SumnerS 1 Reply Last reply Reply Quote 0
      • Scott SumnerS
        Scott Sumner @tictoc ware
        last edited by

        @tictoc-ware

        Interesting. I think if Notepad++ were a low-level text editor (discounting the HexEditor plugin, which, despite much trying, I’ve never been able to make work), I’d be inclined to agree with you.

        Since it is NOT that, the argument can be made that once a file is opened and the BOM are “consumed” for their intended purpose, what remains is the entire data content of the file. With that perspective, the offset and length seem correct…

        1 Reply Last reply Reply Quote 1
        • tictoc wareT
          tictoc ware
          last edited by

          Perhaps the “Go to…” dialog could have a checkbox for the BOM. As it is, the offset is inaccurate. I mean, the offset option is there for a reason and I assume it means offset in the file. I could be wrong, of course.

          1 Reply Last reply Reply Quote 0
          • gstaviG
            gstavi
            last edited by

            You should not be able to edit BOM. BOM is something the editor adds on save.
            You also don’t see BOM with ‘View -> Show Symbol -> …’ which is OK.
            ‘Offset’ is offset with symbols not bytes. If an encoding take 1 or 4 bytes to encode a specific symbol it does not matter, offset will still be 1.

            tictoc wareT 1 Reply Last reply Reply Quote 0
            • tictoc wareT
              tictoc ware @gstavi
              last edited by

              @gstavi said:

              You should not be able to edit BOM. BOM is something the editor adds on save.
              You also don’t see BOM with ‘View -> Show Symbol -> …’ which is OK.
              ‘Offset’ is offset with symbols not bytes. If an encoding take 1 or 4 bytes to encode a specific symbol it does not matter, offset will still be 1.

              I’m not talking about editing the BOM, just accounting for it in the ‘Go to…/offset’.

              Thank you for pointing out that ‘length’ does not report file size at all. I was getting all kinds of errors.

              1 Reply Last reply Reply Quote 0
              • Claudia FrankC
                Claudia Frank
                last edited by

                Hello,

                I have to partially agree and disagree.
                Afaik, offset takes bytes into account but can’t
                sometimes, but obviously, not display it “correctly”
                as only one “char-width” space is reserved even so the char
                itself uses more bytes for its representation.

                And, length is the length of the buffer, loaded into scintilla view, in bytes.
                So it might be the file length as well or not in case of BOM encoded files.

                See example

                Cheers
                Claudia

                gstaviG 1 Reply Last reply Reply Quote 0
                • gstaviG
                  gstavi @Claudia Frank
                  last edited by

                  @Claudia-Frank

                  If offset or length represented bytes then

                  • For UTF16 encodings every 2 successive offsets would jump to the same location.
                  • Changing encoding from UTF8 to UTF16 would change the length.

                  This is not the case.

                  View -> summary shows file length in bytes (and only for saved files).
                  It seems that Length and Offset count unicode symbols which in my opinion is the correct thing to do.
                  An editor deals with symbols. Its internal encoding for each symbol is irrelevant. On save it should re-encode the document content into properly encoded file and on load do the reverse. These are the only time where BOM is relevant.

                  Claudia FrankC 1 Reply Last reply Reply Quote 1
                  • Claudia FrankC
                    Claudia Frank @gstavi
                    last edited by

                    @gstavi

                    I do see your points and I wasn’t aware that this is different to utf8 encoded files.
                    My first example shows, that file length is 15 but only 13 visible symbols.
                    And knowing that scintilla states that

                    SCI_GETTEXTLENGTH → int
                    SCI_GETLENGTH → int
                    Both these messages return the length of the document in bytes.
                    

                    I was under the impression that this is the case for all documents.

                    Here another example.
                    UTF-8 encoded text is aßz
                    HexEditor shows 4 bytes
                    Length shows 4
                    Summary shows 4

                    So it looks like npp is doing something under the hood like encode/decode certain encodings
                    but not all. Or we identified a bug.

                    Will try to find out what exactly is going on.

                    Cheers
                    Claudia

                    1 Reply Last reply Reply Quote 0
                    • tictoc wareT
                      tictoc ware
                      last edited by

                      Thank you, Claudia. I agree a single solution is unclear. It might come down to providing two ways of looking at the file contents as I mentioned in a BOM option (or “true byte”) in the ‘Go to…/offset’. The status bar can also have data for both views (symbols + bytes).

                      1 Reply Last reply Reply Quote 0
                      • gstaviG
                        gstavi
                        last edited by

                        Did you try pasting aßz going to offset 2 and delete? I don’t think that this is a useful feature.

                        For the record I was not even aware of the ‘offset’ option in the ‘goto’ dialog and I don’t find it very useful.
                        I guess the current state is a mess of half baked definitions (from a time dominated by ANSI) and lacking implementation that fail in UNICODE era.
                        I still think that the guidelines I described above are the correct way to implement it. User sees and edits symbols, not bytes.

                        Scott SumnerS 1 Reply Last reply Reply Quote 0
                        • Scott SumnerS
                          Scott Sumner @gstavi
                          last edited by

                          @gstavi said:

                          ‘offset’ option in the ‘goto’ dialog and I don’t find it very useful.

                          This feature of the Goto dialog can be useful to Pythonscript programmers (and probably also Luascript and Plugin (Scintilla) programmers) that often need to deal with “position” in a document. Not so much as a tool to change the current position, but as a way to see what the current caret position is during test/debug of code that works with position…

                          gstaviG 1 Reply Last reply Reply Quote 0
                          • gstaviG
                            gstavi @Scott Sumner
                            last edited by

                            @Scott-Sumner

                            Obviously any user uses his own subset of features.

                            Went through some of code I wrote once upon a time to refresh my memory.
                            As far as I can tell Scintilla logical view of the document is of array of bytes that holds UTF-8 encoding of the text. For each line number Scintilla knows its current start byte offset into the “array”.
                            This approach is simple and flexible but it demands lots of attention from whoever uses it who may care about Unicode.
                            Scintilla will not protect anyone from placing the caret between bytes that compose a single UTF8 symbol.
                            So I guess that length and offset displayed by NPP are actually byte offsets for UTF-8 encoding, regardless of the encoding in which the file is written.

                            I still think it is a confusing choice but these definitions should be made by people who actually use this feature.

                            1 Reply Last reply Reply Quote 0
                            • Claudia FrankC
                              Claudia Frank
                              last edited by

                              What I guess I found out so far is the following

                              First npp tries to detect if the file is a BOM file, if it is,
                              it gets rid of the BOM signature and continues reading the file in converted utf-8.

                              If it isn’t a BOM file it’s calling chardet library to see what codepage to use.
                              If chardet returns, it is checked if it is reported to be utf-8 -> go on reading the file …
                              if not, convert it to utf-8.

                              But this, of course, happens only “virtual” for scintilla control.

                              Cheers
                              Claudia

                              1 Reply Last reply Reply Quote 0
                              • First post
                                Last post
                              The Community of users of the Notepad++ text editor.
                              Powered by NodeBB | Contributors