Community
    • Login

    Find BOMs

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    13 Posts 5 Posters 6.1k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Brigham NarinsB
      Brigham Narins
      last edited by

      Hello:
      Is it possible to use Find in Files to find which files in a folder have the Byte Order Marker in them? I have two test files in a folder–both are UTF-8 encoded, one has the BOM and the other doesn’t. I tried using the regex \xEF\xBB\xBF in the Find What box, but the search returned no results.

      Am I doing something wrong? Is it not possible? Is there another way to find BOMs?

      Thanks,
      Brig

      Meta ChuhM 1 Reply Last reply Reply Quote 2
      • Meta ChuhM
        Meta Chuh moderator @Brigham Narins
        last edited by Meta Chuh

        hi @Brigham-Narins
        this is a very good and intriguing question. 👍

        Am I doing something wrong?

        no, you are doing everything correctly.

        apparently any notepad++ search will only begin after the bom.
        this applies to any search mode, regardless if it is a normal search within the current document, or a find in files search.

        for now i did not find any possibility to find e.g. ef bb bf (utf-8-bom) with the built in functions.
        i only found some ps, batch, python scripts that list all bom files externally, but you have probably seen them as well (stackoverflow)

        i think i/we need some more time to figure out something simple.
        (e.g. a custom batch script at the run menu, that searches all files at the path of the current active document. or a python script if you have this plugin installed)

        ps: if you are faster in implementing something like this, please share it.
        it would be an enrichment.

        Alan KilbornA Brigham NarinsB 2 Replies Last reply Reply Quote 2
        • Alan KilbornA
          Alan Kilborn @Meta Chuh
          last edited by

          @Meta-Chuh said:

          apparently any notepad++ search will only begin after the bom

          And this seems right as BOM is meta

          Meta ChuhM 1 Reply Last reply Reply Quote 1
          • Meta ChuhM
            Meta Chuh moderator @Alan Kilborn
            last edited by

            @Alan-Kilborn

            i am bom, i am bom ;-)

            1 Reply Last reply Reply Quote 0
            • Brigham NarinsB
              Brigham Narins @Meta Chuh
              last edited by

              Thanks @Meta-Chuh. And thanks @Alan-Kilborn. I really appreciate your interest in this.

              @Meta-Chuh said:

              i only found some ps, batch, python scripts that list all bom files externally, but you have probably seen them as well (stackoverflow)

              I did see those, yes. Ideally I’d like to come up with a solution inside Notepad++, because these outside scripts and such seem to call for expertise and programs I don’t have.

              ps: if you are faster in implementing something like this, please share it.
              it would be an enrichment.

              I’ll do my best and keep you posted, but I came to you for enrichment and enlightenment! :)

              Alan KilbornA 1 Reply Last reply Reply Quote 2
              • Alan KilbornA
                Alan Kilborn @Brigham Narins
                last edited by

                @Brigham-Narins said:

                I’d like to come up with a solution inside Notepad++

                I understand why you’d want this. My earlier comment was intended to mean that I believe the BOM stuff is “consumed” when a file is opened, and thus isn’t “obtainable” later. I haven’t done any investigation, so could be totally wrong about this…

                By “inside Notepad++”, I’m sure you could write a Pythonscript that could open files in binary and detect BOM. That may or may not qualify as “inside Notepad++” and of course might be more effort than you were hoping to put in…

                1 Reply Last reply Reply Quote 2
                • Alan KilbornA
                  Alan Kilborn
                  last edited by

                  I’m waiting for a Python program to do its work, so I started playing. Here’s a Pythonscript that does what I mentioned, operating on all files currently open within Notepad++. It seemed to work for the little bit of testing I did with it.

                  for (filename, bufferID, index, view) in notepad.getFiles():
                      inf = open(filename, 'rb')
                      data_at_start_of_file = inf.read(3)
                      inf.close()
                      if len(data_at_start_of_file) >= 3 and ord(data_at_start_of_file[0]) == 0xEF and ord(data_at_start_of_file[1]) == 0xBB and ord(data_at_start_of_file[2]) == 0xBF:
                          print(filename, ': found utf-8 bom')
                      elif len(data_at_start_of_file) >= 2 and ord(data_at_start_of_file[0]) == 0xFE and ord(data_at_start_of_file[1]) == 0xFF:
                          print(filename, ': found ucs-2 big endian bom')
                      elif len(data_at_start_of_file) >= 2 and ord(data_at_start_of_file[0]) == 0xFF and ord(data_at_start_of_file[1]) == 0xFE:
                          print(filename, ': found ucs-2 little endian bom')
                  
                  1 Reply Last reply Reply Quote 3
                  • guy038G
                    guy038
                    last edited by

                    Hello, @brigham_narins, @meta-chuh, @alan-kilborn and All,

                    To simply answer your question, I would say that, among all files created from within N++, the files having a BOM ( a Byte Order Mark ) are :

                    • The files with UTF8-BOM encoding, which have a 3 bytes invisible BOM ( EF BB BF )

                    • The files with UCS-2 BE BOM encoding, which have a 2 bytes invisible BOM ( FE FF )

                    • The files with UCS-2 LE BOM encoding, which have a 2 bytes invisible BOM ( FF FE )

                    In all the other encodings, BOM does not exist !


                    Here is an other way to verify the presence of a BOM :

                    • Click on the View > Summary... menu option

                    • Calculate the difference File length (in byte) - Current document length !

                    You’ve just got the BOM length, which should be 2 or 3 bytes, depending on the file encoding

                    Best Regards,

                    guy038

                    1 Reply Last reply Reply Quote 2
                    • PeterJonesP
                      PeterJones
                      last edited by

                      @Alan-Kilborn said:

                      Here’s a Pythonscript that does what I mentioned, operating on all files currently open within Notepad++.

                      Thanks for that framework. My thought process was that I wanted to see whether the scintilla buffer contained the BOM or whether it was filtered out before then. With this framework, I added some scintilla-buffer editor.xxx commands, and found that no, the BOM is not in the scintilla buffer:

                      firstBufferID = notepad.getCurrentBufferID()
                      for (filename, bufferID, index, view) in notepad.getFiles():
                          inf = open(filename, 'rb')
                          data_at_start_of_file = inf.read(3)
                          inf.close()
                          if len(data_at_start_of_file) >= 3 and ord(data_at_start_of_file[0]) == 0xEF and ord(data_at_start_of_file[1]) == 0xBB and ord(data_at_start_of_file[2]) == 0xBF:
                              console.write(filename+': found utf-8 bom'+'\n')
                          elif len(data_at_start_of_file) >= 2 and ord(data_at_start_of_file[0]) == 0xFE and ord(data_at_start_of_file[1]) == 0xFF:
                              console.write(filename+': found ucs-2 big endian bom'+'\n')
                          elif len(data_at_start_of_file) >= 2 and ord(data_at_start_of_file[0]) == 0xFF and ord(data_at_start_of_file[1]) == 0xFE:
                              console.write(filename+': found ucs-2 little endian bom'+'\n')
                      
                          # addendum:
                          notepad.activateBufferID( bufferID )
                          str = editor.getText()
                          console.write('buffer: length = {}\n'.format(len(str)))
                          for i in range(3):
                              console.write('\t#{}: {} => {}\n'.format(i, str[i], ord(str[i])))
                      
                      notepad.activateBufferID( firstBufferID )
                      

                      Which results in:

                      C:\Users\peter.jones\...\Peter's Scratchpad.md: found ucs-2 little endian bom
                      buffer: length = 10861
                          #0: ~ => 126
                          #1: ~ => 126
                          #2: ~ => 126
                      C:\usr\local\apps\notepad++\plugins\Config\PythonScript\scripts\NppForumPythonScripts\17244-utf-bom-reader.py: found utf-8 bom
                      buffer: length = 1513
                          #0: # => 35
                          #1:   => 32
                          #2: e => 101
                      

                      (And no, normally my scratchpad is in UTF8-BOM, not in UCS-2 LE BOM; I just changed it’s encoding temporarily to test out the other BOM-detections.)

                      Alan KilbornA 1 Reply Last reply Reply Quote 2
                      • Alan KilbornA
                        Alan Kilborn @PeterJones
                        last edited by

                        @PeterJones said:

                        and found that no, the BOM is not in the scintilla buffer

                        …we’re back to what I postulated in the beginning: meta!

                        Meta ChuhM 1 Reply Last reply Reply Quote 2
                        • Meta ChuhM
                          Meta Chuh moderator @Alan Kilborn
                          last edited by

                          @Alan-Kilborn

                          …we’re back to what I postulated in the beginning: meta!

                          yes … you were calling ? ;-)

                          Alan KilbornA 1 Reply Last reply Reply Quote 3
                          • Alan KilbornA
                            Alan Kilborn @Meta Chuh
                            last edited by

                            @Meta-Chuh

                            LOL

                            Okay, that has me thinking…what does your username actually mean?

                            Meta ChuhM 1 Reply Last reply Reply Quote 2
                            • Meta ChuhM
                              Meta Chuh moderator @Alan Kilborn
                              last edited by Meta Chuh

                              @Alan-Kilborn

                              it’s my real name.
                              unfortunately our family has generations of such strange names.
                              my brothers for example are called pikachuh and raichuh.

                              here’s a family picture of us:

                              Imgur

                              😄

                              seriously: i got meta as a nick name ages ago, as when i was little, i started to use anything for everything, beyond of what specific items were originally intended, or designed to be used for … and through the years, more and more of doing that actually started to work out, without anybody (including me) understanding why. 😉

                              1 Reply Last reply Reply Quote 2
                              • First post
                                Last post
                              The Community of users of the Notepad++ text editor.
                              Powered by NodeBB | Contributors