• Login
Community
  • Login

Find BOMs

Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
13 Posts 5 Posters 6.5k Views
Loading More Posts
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • M
    Meta Chuh moderator @Alan Kilborn
    last edited by Mar 6, 2019, 6:36 PM

    @Alan-Kilborn

    i am bom, i am bom ;-)

    1 Reply Last reply Reply Quote 0
    • B
      Brigham Narins @Meta Chuh
      last edited by Mar 6, 2019, 8:17 PM

      Thanks @Meta-Chuh. And thanks @Alan-Kilborn. I really appreciate your interest in this.

      @Meta-Chuh said:

      i only found some ps, batch, python scripts that list all bom files externally, but you have probably seen them as well (stackoverflow)

      I did see those, yes. Ideally I’d like to come up with a solution inside Notepad++, because these outside scripts and such seem to call for expertise and programs I don’t have.

      ps: if you are faster in implementing something like this, please share it.
      it would be an enrichment.

      I’ll do my best and keep you posted, but I came to you for enrichment and enlightenment! :)

      A 1 Reply Last reply Mar 6, 2019, 8:22 PM Reply Quote 2
      • A
        Alan Kilborn @Brigham Narins
        last edited by Mar 6, 2019, 8:22 PM

        @Brigham-Narins said:

        I’d like to come up with a solution inside Notepad++

        I understand why you’d want this. My earlier comment was intended to mean that I believe the BOM stuff is “consumed” when a file is opened, and thus isn’t “obtainable” later. I haven’t done any investigation, so could be totally wrong about this…

        By “inside Notepad++”, I’m sure you could write a Pythonscript that could open files in binary and detect BOM. That may or may not qualify as “inside Notepad++” and of course might be more effort than you were hoping to put in…

        1 Reply Last reply Reply Quote 2
        • A
          Alan Kilborn
          last edited by Mar 6, 2019, 8:51 PM

          I’m waiting for a Python program to do its work, so I started playing. Here’s a Pythonscript that does what I mentioned, operating on all files currently open within Notepad++. It seemed to work for the little bit of testing I did with it.

          for (filename, bufferID, index, view) in notepad.getFiles():
              inf = open(filename, 'rb')
              data_at_start_of_file = inf.read(3)
              inf.close()
              if len(data_at_start_of_file) >= 3 and ord(data_at_start_of_file[0]) == 0xEF and ord(data_at_start_of_file[1]) == 0xBB and ord(data_at_start_of_file[2]) == 0xBF:
                  print(filename, ': found utf-8 bom')
              elif len(data_at_start_of_file) >= 2 and ord(data_at_start_of_file[0]) == 0xFE and ord(data_at_start_of_file[1]) == 0xFF:
                  print(filename, ': found ucs-2 big endian bom')
              elif len(data_at_start_of_file) >= 2 and ord(data_at_start_of_file[0]) == 0xFF and ord(data_at_start_of_file[1]) == 0xFE:
                  print(filename, ': found ucs-2 little endian bom')
          
          1 Reply Last reply Reply Quote 3
          • G
            guy038
            last edited by Mar 6, 2019, 9:07 PM

            Hello, @brigham_narins, @meta-chuh, @alan-kilborn and All,

            To simply answer your question, I would say that, among all files created from within N++, the files having a BOM ( a Byte Order Mark ) are :

            • The files with UTF8-BOM encoding, which have a 3 bytes invisible BOM ( EF BB BF )

            • The files with UCS-2 BE BOM encoding, which have a 2 bytes invisible BOM ( FE FF )

            • The files with UCS-2 LE BOM encoding, which have a 2 bytes invisible BOM ( FF FE )

            In all the other encodings, BOM does not exist !


            Here is an other way to verify the presence of a BOM :

            • Click on the View > Summary... menu option

            • Calculate the difference File length (in byte) - Current document length !

            You’ve just got the BOM length, which should be 2 or 3 bytes, depending on the file encoding

            Best Regards,

            guy038

            1 Reply Last reply Reply Quote 2
            • P
              PeterJones
              last edited by Mar 6, 2019, 9:23 PM

              @Alan-Kilborn said:

              Here’s a Pythonscript that does what I mentioned, operating on all files currently open within Notepad++.

              Thanks for that framework. My thought process was that I wanted to see whether the scintilla buffer contained the BOM or whether it was filtered out before then. With this framework, I added some scintilla-buffer editor.xxx commands, and found that no, the BOM is not in the scintilla buffer:

              firstBufferID = notepad.getCurrentBufferID()
              for (filename, bufferID, index, view) in notepad.getFiles():
                  inf = open(filename, 'rb')
                  data_at_start_of_file = inf.read(3)
                  inf.close()
                  if len(data_at_start_of_file) >= 3 and ord(data_at_start_of_file[0]) == 0xEF and ord(data_at_start_of_file[1]) == 0xBB and ord(data_at_start_of_file[2]) == 0xBF:
                      console.write(filename+': found utf-8 bom'+'\n')
                  elif len(data_at_start_of_file) >= 2 and ord(data_at_start_of_file[0]) == 0xFE and ord(data_at_start_of_file[1]) == 0xFF:
                      console.write(filename+': found ucs-2 big endian bom'+'\n')
                  elif len(data_at_start_of_file) >= 2 and ord(data_at_start_of_file[0]) == 0xFF and ord(data_at_start_of_file[1]) == 0xFE:
                      console.write(filename+': found ucs-2 little endian bom'+'\n')
              
                  # addendum:
                  notepad.activateBufferID( bufferID )
                  str = editor.getText()
                  console.write('buffer: length = {}\n'.format(len(str)))
                  for i in range(3):
                      console.write('\t#{}: {} => {}\n'.format(i, str[i], ord(str[i])))
              
              notepad.activateBufferID( firstBufferID )
              

              Which results in:

              C:\Users\peter.jones\...\Peter's Scratchpad.md: found ucs-2 little endian bom
              buffer: length = 10861
                  #0: ~ => 126
                  #1: ~ => 126
                  #2: ~ => 126
              C:\usr\local\apps\notepad++\plugins\Config\PythonScript\scripts\NppForumPythonScripts\17244-utf-bom-reader.py: found utf-8 bom
              buffer: length = 1513
                  #0: # => 35
                  #1:   => 32
                  #2: e => 101
              

              (And no, normally my scratchpad is in UTF8-BOM, not in UCS-2 LE BOM; I just changed it’s encoding temporarily to test out the other BOM-detections.)

              A 1 Reply Last reply Mar 6, 2019, 9:34 PM Reply Quote 2
              • A
                Alan Kilborn @PeterJones
                last edited by Mar 6, 2019, 9:34 PM

                @PeterJones said:

                and found that no, the BOM is not in the scintilla buffer

                …we’re back to what I postulated in the beginning: meta!

                M 1 Reply Last reply Mar 6, 2019, 9:39 PM Reply Quote 2
                • M
                  Meta Chuh moderator @Alan Kilborn
                  last edited by Mar 6, 2019, 9:39 PM

                  @Alan-Kilborn

                  …we’re back to what I postulated in the beginning: meta!

                  yes … you were calling ? ;-)

                  A 1 Reply Last reply Mar 6, 2019, 9:42 PM Reply Quote 3
                  • A
                    Alan Kilborn @Meta Chuh
                    last edited by Mar 6, 2019, 9:42 PM

                    @Meta-Chuh

                    LOL

                    Okay, that has me thinking…what does your username actually mean?

                    M 1 Reply Last reply Mar 6, 2019, 10:11 PM Reply Quote 2
                    • M
                      Meta Chuh moderator @Alan Kilborn
                      last edited by Meta Chuh Mar 7, 2019, 2:19 AM Mar 6, 2019, 10:11 PM

                      @Alan-Kilborn

                      it’s my real name.
                      unfortunately our family has generations of such strange names.
                      my brothers for example are called pikachuh and raichuh.

                      here’s a family picture of us:

                      Imgur

                      😄

                      seriously: i got meta as a nick name ages ago, as when i was little, i started to use anything for everything, beyond of what specific items were originally intended, or designed to be used for … and through the years, more and more of doing that actually started to work out, without anybody (including me) understanding why. 😉

                      1 Reply Last reply Reply Quote 2
                      13 out of 13
                      • First post
                        13/13
                        Last post
                      The Community of users of the Notepad++ text editor.
                      Powered by NodeBB | Contributors