Find BOMs
-
Hello:
Is it possible to use Find in Files to find which files in a folder have the Byte Order Marker in them? I have two test files in a folder–both are UTF-8 encoded, one has the BOM and the other doesn’t. I tried using the regex \xEF\xBB\xBF in the Find What box, but the search returned no results.Am I doing something wrong? Is it not possible? Is there another way to find BOMs?
Thanks,
Brig -
hi @Brigham-Narins
this is a very good and intriguing question. 👍Am I doing something wrong?
no, you are doing everything correctly.
apparently any notepad++ search will only begin after the bom.
this applies to any search mode, regardless if it is a normal search within the current document, or a find in files search.for now i did not find any possibility to find e.g. ef bb bf (utf-8-bom) with the built in functions.
i only found some ps, batch, python scripts that list all bom files externally, but you have probably seen them as well (stackoverflow)i think i/we need some more time to figure out something simple.
(e.g. a custom batch script at the run menu, that searches all files at the path of the current active document. or a python script if you have this plugin installed)ps: if you are faster in implementing something like this, please share it.
it would be an enrichment. -
@Meta-Chuh said:
apparently any notepad++ search will only begin after the bom
And this seems right as BOM is meta
-
i am bom, i am bom ;-)
-
Thanks @Meta-Chuh. And thanks @Alan-Kilborn. I really appreciate your interest in this.
@Meta-Chuh said:
i only found some ps, batch, python scripts that list all bom files externally, but you have probably seen them as well (stackoverflow)
I did see those, yes. Ideally I’d like to come up with a solution inside Notepad++, because these outside scripts and such seem to call for expertise and programs I don’t have.
ps: if you are faster in implementing something like this, please share it.
it would be an enrichment.I’ll do my best and keep you posted, but I came to you for enrichment and enlightenment! :)
-
@Brigham-Narins said:
I’d like to come up with a solution inside Notepad++
I understand why you’d want this. My earlier comment was intended to mean that I believe the BOM stuff is “consumed” when a file is opened, and thus isn’t “obtainable” later. I haven’t done any investigation, so could be totally wrong about this…
By “inside Notepad++”, I’m sure you could write a Pythonscript that could open files in binary and detect BOM. That may or may not qualify as “inside Notepad++” and of course might be more effort than you were hoping to put in…
-
I’m waiting for a Python program to do its work, so I started playing. Here’s a Pythonscript that does what I mentioned, operating on all files currently open within Notepad++. It seemed to work for the little bit of testing I did with it.
for (filename, bufferID, index, view) in notepad.getFiles(): inf = open(filename, 'rb') data_at_start_of_file = inf.read(3) inf.close() if len(data_at_start_of_file) >= 3 and ord(data_at_start_of_file[0]) == 0xEF and ord(data_at_start_of_file[1]) == 0xBB and ord(data_at_start_of_file[2]) == 0xBF: print(filename, ': found utf-8 bom') elif len(data_at_start_of_file) >= 2 and ord(data_at_start_of_file[0]) == 0xFE and ord(data_at_start_of_file[1]) == 0xFF: print(filename, ': found ucs-2 big endian bom') elif len(data_at_start_of_file) >= 2 and ord(data_at_start_of_file[0]) == 0xFF and ord(data_at_start_of_file[1]) == 0xFE: print(filename, ': found ucs-2 little endian bom') -
Hello, @brigham_narins, @meta-chuh, @alan-kilborn and All,
To simply answer your question, I would say that, among all files created from within N++, the files having a
BOM( a Byte Order Mark ) are :-
The files with
UTF8-BOMencoding, which have a3bytes invisible BOM (EF BB BF) -
The files with
UCS-2 BE BOMencoding, which have a2bytes invisible BOM (FE FF) -
The files with
UCS-2 LE BOMencoding, which have a2bytes invisible BOM (FF FE)
In all the other encodings,
BOMdoes not exist !
Here is an other way to verify the presence of a
BOM:-
Click on the
View > Summary...menu option -
Calculate the difference
File length (in byte)-Current document length!
You’ve just got the
BOMlength, which should be2or3bytes, depending on the file encodingBest Regards,
guy038
-
-
@Alan-Kilborn said:
Here’s a Pythonscript that does what I mentioned, operating on all files currently open within Notepad++.
Thanks for that framework. My thought process was that I wanted to see whether the scintilla buffer contained the BOM or whether it was filtered out before then. With this framework, I added some scintilla-buffer
editor.xxxcommands, and found that no, the BOM is not in the scintilla buffer:firstBufferID = notepad.getCurrentBufferID() for (filename, bufferID, index, view) in notepad.getFiles(): inf = open(filename, 'rb') data_at_start_of_file = inf.read(3) inf.close() if len(data_at_start_of_file) >= 3 and ord(data_at_start_of_file[0]) == 0xEF and ord(data_at_start_of_file[1]) == 0xBB and ord(data_at_start_of_file[2]) == 0xBF: console.write(filename+': found utf-8 bom'+'\n') elif len(data_at_start_of_file) >= 2 and ord(data_at_start_of_file[0]) == 0xFE and ord(data_at_start_of_file[1]) == 0xFF: console.write(filename+': found ucs-2 big endian bom'+'\n') elif len(data_at_start_of_file) >= 2 and ord(data_at_start_of_file[0]) == 0xFF and ord(data_at_start_of_file[1]) == 0xFE: console.write(filename+': found ucs-2 little endian bom'+'\n') # addendum: notepad.activateBufferID( bufferID ) str = editor.getText() console.write('buffer: length = {}\n'.format(len(str))) for i in range(3): console.write('\t#{}: {} => {}\n'.format(i, str[i], ord(str[i]))) notepad.activateBufferID( firstBufferID )Which results in:
C:\Users\peter.jones\...\Peter's Scratchpad.md: found ucs-2 little endian bom buffer: length = 10861 #0: ~ => 126 #1: ~ => 126 #2: ~ => 126 C:\usr\local\apps\notepad++\plugins\Config\PythonScript\scripts\NppForumPythonScripts\17244-utf-bom-reader.py: found utf-8 bom buffer: length = 1513 #0: # => 35 #1: => 32 #2: e => 101(And no, normally my scratchpad is in UTF8-BOM, not in UCS-2 LE BOM; I just changed it’s encoding temporarily to test out the other BOM-detections.)
-
@PeterJones said:
and found that no, the BOM is not in the scintilla buffer
…we’re back to what I postulated in the beginning: meta!
-
-
-
it’s my real name.
unfortunately our family has generations of such strange names.
my brothers for example are called pikachuh and raichuh.here’s a family picture of us:

😄
seriously: i got meta as a nick name ages ago, as when i was little, i started to use anything for everything, beyond of what specific items were originally intended, or designed to be used for … and through the years, more and more of doing that actually started to work out, without anybody (including me) understanding why. 😉
Hello! It looks like you're interested in this conversation, but you don't have an account yet.
Getting fed up of having to scroll through the same posts each visit? When you register for an account, you'll always come back to exactly where you were before, and choose to be notified of new replies (either via email, or push notification). You'll also be able to save bookmarks and upvote posts to show your appreciation to other community members.
With your input, this post could be even better 💗
Register Login