Community
    • Login

    Find BOMs

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    13 Posts 5 Posters 8.3k Views 2 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Meta ChuhM Offline
      Meta Chuh moderator @Alan Kilborn
      last edited by

      @Alan-Kilborn

      i am bom, i am bom ;-)

      1 Reply Last reply Reply Quote 0
      • Brigham NarinsB Offline
        Brigham Narins @Meta Chuh
        last edited by

        Thanks @Meta-Chuh. And thanks @Alan-Kilborn. I really appreciate your interest in this.

        @Meta-Chuh said:

        i only found some ps, batch, python scripts that list all bom files externally, but you have probably seen them as well (stackoverflow)

        I did see those, yes. Ideally I’d like to come up with a solution inside Notepad++, because these outside scripts and such seem to call for expertise and programs I don’t have.

        ps: if you are faster in implementing something like this, please share it.
        it would be an enrichment.

        I’ll do my best and keep you posted, but I came to you for enrichment and enlightenment! :)

        Alan KilbornA 1 Reply Last reply Reply Quote 2
        • Alan KilbornA Offline
          Alan Kilborn @Brigham Narins
          last edited by

          @Brigham-Narins said:

          I’d like to come up with a solution inside Notepad++

          I understand why you’d want this. My earlier comment was intended to mean that I believe the BOM stuff is “consumed” when a file is opened, and thus isn’t “obtainable” later. I haven’t done any investigation, so could be totally wrong about this…

          By “inside Notepad++”, I’m sure you could write a Pythonscript that could open files in binary and detect BOM. That may or may not qualify as “inside Notepad++” and of course might be more effort than you were hoping to put in…

          1 Reply Last reply Reply Quote 2
          • Alan KilbornA Offline
            Alan Kilborn
            last edited by

            I’m waiting for a Python program to do its work, so I started playing. Here’s a Pythonscript that does what I mentioned, operating on all files currently open within Notepad++. It seemed to work for the little bit of testing I did with it.

            for (filename, bufferID, index, view) in notepad.getFiles():
                inf = open(filename, 'rb')
                data_at_start_of_file = inf.read(3)
                inf.close()
                if len(data_at_start_of_file) >= 3 and ord(data_at_start_of_file[0]) == 0xEF and ord(data_at_start_of_file[1]) == 0xBB and ord(data_at_start_of_file[2]) == 0xBF:
                    print(filename, ': found utf-8 bom')
                elif len(data_at_start_of_file) >= 2 and ord(data_at_start_of_file[0]) == 0xFE and ord(data_at_start_of_file[1]) == 0xFF:
                    print(filename, ': found ucs-2 big endian bom')
                elif len(data_at_start_of_file) >= 2 and ord(data_at_start_of_file[0]) == 0xFF and ord(data_at_start_of_file[1]) == 0xFE:
                    print(filename, ': found ucs-2 little endian bom')
            
            1 Reply Last reply Reply Quote 3
            • guy038G Offline
              guy038
              last edited by

              Hello, @brigham_narins, @meta-chuh, @alan-kilborn and All,

              To simply answer your question, I would say that, among all files created from within N++, the files having a BOM ( a Byte Order Mark ) are :

              • The files with UTF8-BOM encoding, which have a 3 bytes invisible BOM ( EF BB BF )

              • The files with UCS-2 BE BOM encoding, which have a 2 bytes invisible BOM ( FE FF )

              • The files with UCS-2 LE BOM encoding, which have a 2 bytes invisible BOM ( FF FE )

              In all the other encodings, BOM does not exist !


              Here is an other way to verify the presence of a BOM :

              • Click on the View > Summary... menu option

              • Calculate the difference File length (in byte) - Current document length !

              You’ve just got the BOM length, which should be 2 or 3 bytes, depending on the file encoding

              Best Regards,

              guy038

              1 Reply Last reply Reply Quote 2
              • PeterJonesP Offline
                PeterJones
                last edited by

                @Alan-Kilborn said:

                Here’s a Pythonscript that does what I mentioned, operating on all files currently open within Notepad++.

                Thanks for that framework. My thought process was that I wanted to see whether the scintilla buffer contained the BOM or whether it was filtered out before then. With this framework, I added some scintilla-buffer editor.xxx commands, and found that no, the BOM is not in the scintilla buffer:

                firstBufferID = notepad.getCurrentBufferID()
                for (filename, bufferID, index, view) in notepad.getFiles():
                    inf = open(filename, 'rb')
                    data_at_start_of_file = inf.read(3)
                    inf.close()
                    if len(data_at_start_of_file) >= 3 and ord(data_at_start_of_file[0]) == 0xEF and ord(data_at_start_of_file[1]) == 0xBB and ord(data_at_start_of_file[2]) == 0xBF:
                        console.write(filename+': found utf-8 bom'+'\n')
                    elif len(data_at_start_of_file) >= 2 and ord(data_at_start_of_file[0]) == 0xFE and ord(data_at_start_of_file[1]) == 0xFF:
                        console.write(filename+': found ucs-2 big endian bom'+'\n')
                    elif len(data_at_start_of_file) >= 2 and ord(data_at_start_of_file[0]) == 0xFF and ord(data_at_start_of_file[1]) == 0xFE:
                        console.write(filename+': found ucs-2 little endian bom'+'\n')
                
                    # addendum:
                    notepad.activateBufferID( bufferID )
                    str = editor.getText()
                    console.write('buffer: length = {}\n'.format(len(str)))
                    for i in range(3):
                        console.write('\t#{}: {} => {}\n'.format(i, str[i], ord(str[i])))
                
                notepad.activateBufferID( firstBufferID )
                

                Which results in:

                C:\Users\peter.jones\...\Peter's Scratchpad.md: found ucs-2 little endian bom
                buffer: length = 10861
                    #0: ~ => 126
                    #1: ~ => 126
                    #2: ~ => 126
                C:\usr\local\apps\notepad++\plugins\Config\PythonScript\scripts\NppForumPythonScripts\17244-utf-bom-reader.py: found utf-8 bom
                buffer: length = 1513
                    #0: # => 35
                    #1:   => 32
                    #2: e => 101
                

                (And no, normally my scratchpad is in UTF8-BOM, not in UCS-2 LE BOM; I just changed it’s encoding temporarily to test out the other BOM-detections.)

                Alan KilbornA 1 Reply Last reply Reply Quote 2
                • Alan KilbornA Offline
                  Alan Kilborn @PeterJones
                  last edited by

                  @PeterJones said:

                  and found that no, the BOM is not in the scintilla buffer

                  …we’re back to what I postulated in the beginning: meta!

                  Meta ChuhM 1 Reply Last reply Reply Quote 2
                  • Meta ChuhM Offline
                    Meta Chuh moderator @Alan Kilborn
                    last edited by

                    @Alan-Kilborn

                    …we’re back to what I postulated in the beginning: meta!

                    yes … you were calling ? ;-)

                    Alan KilbornA 1 Reply Last reply Reply Quote 3
                    • Alan KilbornA Offline
                      Alan Kilborn @Meta Chuh
                      last edited by

                      @Meta-Chuh

                      LOL

                      Okay, that has me thinking…what does your username actually mean?

                      Meta ChuhM 1 Reply Last reply Reply Quote 2
                      • Meta ChuhM Offline
                        Meta Chuh moderator @Alan Kilborn
                        last edited by Meta Chuh

                        @Alan-Kilborn

                        it’s my real name.
                        unfortunately our family has generations of such strange names.
                        my brothers for example are called pikachuh and raichuh.

                        here’s a family picture of us:

                        Imgur

                        😄

                        seriously: i got meta as a nick name ages ago, as when i was little, i started to use anything for everything, beyond of what specific items were originally intended, or designed to be used for … and through the years, more and more of doing that actually started to work out, without anybody (including me) understanding why. 😉

                        1 Reply Last reply Reply Quote 2

                        Hello! It looks like you're interested in this conversation, but you don't have an account yet.

                        Getting fed up of having to scroll through the same posts each visit? When you register for an account, you'll always come back to exactly where you were before, and choose to be notified of new replies (either via email, or push notification). You'll also be able to save bookmarks and upvote posts to show your appreciation to other community members.

                        With your input, this post could be even better 💗

                        Register Login
                        • First post
                          Last post
                        The Community of users of the Notepad++ text editor.
                        Powered by NodeBB | Contributors