Community
    • Login

    Go To... offset ignores BOM

    Scheduled Pinned Locked Moved General Discussion
    13 Posts 4 Posters 7.1k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Scott SumnerS
      Scott Sumner @tictoc ware
      last edited by

      @tictoc-ware

      Interesting. I think if Notepad++ were a low-level text editor (discounting the HexEditor plugin, which, despite much trying, I’ve never been able to make work), I’d be inclined to agree with you.

      Since it is NOT that, the argument can be made that once a file is opened and the BOM are “consumed” for their intended purpose, what remains is the entire data content of the file. With that perspective, the offset and length seem correct…

      1 Reply Last reply Reply Quote 1
      • tictoc wareT
        tictoc ware
        last edited by

        Perhaps the “Go to…” dialog could have a checkbox for the BOM. As it is, the offset is inaccurate. I mean, the offset option is there for a reason and I assume it means offset in the file. I could be wrong, of course.

        1 Reply Last reply Reply Quote 0
        • gstaviG
          gstavi
          last edited by

          You should not be able to edit BOM. BOM is something the editor adds on save.
          You also don’t see BOM with ‘View -> Show Symbol -> …’ which is OK.
          ‘Offset’ is offset with symbols not bytes. If an encoding take 1 or 4 bytes to encode a specific symbol it does not matter, offset will still be 1.

          tictoc wareT 1 Reply Last reply Reply Quote 0
          • tictoc wareT
            tictoc ware @gstavi
            last edited by

            @gstavi said:

            You should not be able to edit BOM. BOM is something the editor adds on save.
            You also don’t see BOM with ‘View -> Show Symbol -> …’ which is OK.
            ‘Offset’ is offset with symbols not bytes. If an encoding take 1 or 4 bytes to encode a specific symbol it does not matter, offset will still be 1.

            I’m not talking about editing the BOM, just accounting for it in the ‘Go to…/offset’.

            Thank you for pointing out that ‘length’ does not report file size at all. I was getting all kinds of errors.

            1 Reply Last reply Reply Quote 0
            • Claudia FrankC
              Claudia Frank
              last edited by

              Hello,

              I have to partially agree and disagree.
              Afaik, offset takes bytes into account but can’t
              sometimes, but obviously, not display it “correctly”
              as only one “char-width” space is reserved even so the char
              itself uses more bytes for its representation.

              And, length is the length of the buffer, loaded into scintilla view, in bytes.
              So it might be the file length as well or not in case of BOM encoded files.

              See example

              Cheers
              Claudia

              gstaviG 1 Reply Last reply Reply Quote 0
              • gstaviG
                gstavi @Claudia Frank
                last edited by

                @Claudia-Frank

                If offset or length represented bytes then

                • For UTF16 encodings every 2 successive offsets would jump to the same location.
                • Changing encoding from UTF8 to UTF16 would change the length.

                This is not the case.

                View -> summary shows file length in bytes (and only for saved files).
                It seems that Length and Offset count unicode symbols which in my opinion is the correct thing to do.
                An editor deals with symbols. Its internal encoding for each symbol is irrelevant. On save it should re-encode the document content into properly encoded file and on load do the reverse. These are the only time where BOM is relevant.

                Claudia FrankC 1 Reply Last reply Reply Quote 1
                • Claudia FrankC
                  Claudia Frank @gstavi
                  last edited by

                  @gstavi

                  I do see your points and I wasn’t aware that this is different to utf8 encoded files.
                  My first example shows, that file length is 15 but only 13 visible symbols.
                  And knowing that scintilla states that

                  SCI_GETTEXTLENGTH → int
                  SCI_GETLENGTH → int
                  Both these messages return the length of the document in bytes.
                  

                  I was under the impression that this is the case for all documents.

                  Here another example.
                  UTF-8 encoded text is aßz
                  HexEditor shows 4 bytes
                  Length shows 4
                  Summary shows 4

                  So it looks like npp is doing something under the hood like encode/decode certain encodings
                  but not all. Or we identified a bug.

                  Will try to find out what exactly is going on.

                  Cheers
                  Claudia

                  1 Reply Last reply Reply Quote 0
                  • tictoc wareT
                    tictoc ware
                    last edited by

                    Thank you, Claudia. I agree a single solution is unclear. It might come down to providing two ways of looking at the file contents as I mentioned in a BOM option (or “true byte”) in the ‘Go to…/offset’. The status bar can also have data for both views (symbols + bytes).

                    1 Reply Last reply Reply Quote 0
                    • gstaviG
                      gstavi
                      last edited by

                      Did you try pasting aßz going to offset 2 and delete? I don’t think that this is a useful feature.

                      For the record I was not even aware of the ‘offset’ option in the ‘goto’ dialog and I don’t find it very useful.
                      I guess the current state is a mess of half baked definitions (from a time dominated by ANSI) and lacking implementation that fail in UNICODE era.
                      I still think that the guidelines I described above are the correct way to implement it. User sees and edits symbols, not bytes.

                      Scott SumnerS 1 Reply Last reply Reply Quote 0
                      • Scott SumnerS
                        Scott Sumner @gstavi
                        last edited by

                        @gstavi said:

                        ‘offset’ option in the ‘goto’ dialog and I don’t find it very useful.

                        This feature of the Goto dialog can be useful to Pythonscript programmers (and probably also Luascript and Plugin (Scintilla) programmers) that often need to deal with “position” in a document. Not so much as a tool to change the current position, but as a way to see what the current caret position is during test/debug of code that works with position…

                        gstaviG 1 Reply Last reply Reply Quote 0
                        • gstaviG
                          gstavi @Scott Sumner
                          last edited by

                          @Scott-Sumner

                          Obviously any user uses his own subset of features.

                          Went through some of code I wrote once upon a time to refresh my memory.
                          As far as I can tell Scintilla logical view of the document is of array of bytes that holds UTF-8 encoding of the text. For each line number Scintilla knows its current start byte offset into the “array”.
                          This approach is simple and flexible but it demands lots of attention from whoever uses it who may care about Unicode.
                          Scintilla will not protect anyone from placing the caret between bytes that compose a single UTF8 symbol.
                          So I guess that length and offset displayed by NPP are actually byte offsets for UTF-8 encoding, regardless of the encoding in which the file is written.

                          I still think it is a confusing choice but these definitions should be made by people who actually use this feature.

                          1 Reply Last reply Reply Quote 0
                          • Claudia FrankC
                            Claudia Frank
                            last edited by

                            What I guess I found out so far is the following

                            First npp tries to detect if the file is a BOM file, if it is,
                            it gets rid of the BOM signature and continues reading the file in converted utf-8.

                            If it isn’t a BOM file it’s calling chardet library to see what codepage to use.
                            If chardet returns, it is checked if it is reported to be utf-8 -> go on reading the file …
                            if not, convert it to utf-8.

                            But this, of course, happens only “virtual” for scintilla control.

                            Cheers
                            Claudia

                            1 Reply Last reply Reply Quote 0
                            • First post
                              Last post
                            The Community of users of the Notepad++ text editor.
                            Powered by NodeBB | Contributors