Unable to display .MHTML copies of postings

  • I posted the following query to NodeBB:

    Why can’t most browsers (Firefox, Chrome, Iron, Opera (current), Vivaldi) display .mhtml versions of pages saved from your community format style web pages? All the browsers can save the pages, but none of them can view them. So far this issue ONLY shows up on community forums, both yours (https://community.nodebb.org/) and the Notepad++ (https://community.notepad-plus-plus.org/) site. Oddly enough, the saved pages can be viewed by the OperaLegacy browser (which uses the Presto engine) and can be disassembled by the “mht2htm” program (https://pgm.bpalanka.com/mht2htm.html) which will then allow any browser to view the extracted web page. I have been saving .mhtml files for years and have only seen this display issue with websites based on the NodeBB community package. It took a while to notice, because the Windows Explorer ‘preview pane’ will display the NodeBB community pages, but a current web browser will not.

    This is not something I expect any of the Npp Community moderators to explain or fix (but it would be nice). I’m just posting this copy to be transparent. (And I prefer to archive helpful postings at community.nodebb.org in the .MHTML format instead of the more cumbersome “Webpage, Complete” (i.e., ‘topic.html’ with images, etc. in a ‘topic_files’ folder) format.

  • @artie-finkelstein ,

    I don’t know much about the MHTML format, but when I look at what it saved when I tried to save a copy of yor post from Chrome, it appears to include the CSS and images in the .mhtml file, but it doesn’t save any of the javascript code. Thus, I think when the browser is trying to load the JS files while displaying the local MTHML, it would have to download a JS from a remote server… which then gets into “cross-site scripting” issues, and other such issues (*); most modern browsers frown on (disable) cross-site scripting, because that’s a security violation of the highest order, so it really doesn’t surprise me that modern browsers refuse to display something that has cross-site scripting, whereas a browser with “legacy” in its name (which is probably from before such security concerns were implemented) doesn’t likely have that safety feature.

    As to why it works for other pages you download but not for NodeBB: not sure; I see that NodeBB uses rel="prefetch", which might prevent the browser-save-as-MHTML feature from including the JS in the MTHML file… or maybe save-as-MHTML never includes JS on those. Or maybe my wild stab in the dark is completely unfounded.

    (*: Caveat: I am not a web-security expert, and I may be mis-using the technical term “cross-site scripting”; but whether this would technically be cross-site scripting or not, the idea behind the security hole is the same: if the HTML is on one “server” (or your local MHTML file) but the javascript code is hosted on some remote server, the controller of the local HTML/MHTML has no control over whether or not someone could inject malicious code into the remote server’s javascript)

    But likely there’s nothing the administrators of this Forum can do, short of switching away from NodeBB (and since NodeBB is giving Don free hosting for this Forum, that’s not likely to happen).

    As an alternative, if you don’t like the multi-file based save, maybe you’d be willing to use Chrome’s print-to-PDF feature, which gives the all-in-one for a single page, but embeds it as a single file.

    other than that, I’m not sure else what to suggest for you.

  • Thank you for the insight on javascript, et al; but I don’t believe it’s a cross-domain script issue (see the third paragraph for my reasoning), but, it could be a side effect of a preload command or other issue, which I’ll investigate.

    The MHTML format (alternately known as the ‘mime html’ or the ‘Microsoft html’ format) tries to save a pretty accurate snapshot of a webpage, much like a printed .pdf version, but the .mhtml can also be viewed in a web browser (I have all my browsers configured to download .PDFs instead of display them in the browser). Over the years I found the MHTML format faster to acquire (there’s no messing around with the pdf output formatting options (header, pagination, etc)) and it’s much faster to reload on later viewings .

    I never noticed the problem on the nodeBB snapshots until I tried tracking down a helpful answer from one of the Notepad++ Community pages and found I couldn’t view any of the saved files. Because all the time, the preview pane of Windows Explorer had been displaying an excellent rendering of the original webpage (which is something a .html file with images will not do, even though it’s a Microsoft abortion extension). If Explorer is showing the full webpage image, no additional downloading is required, and I believe no javascript is executing.

    I never expected the Notepad++ staff/community to “fix this issue”. It’s up to me to adapt to the web ecology. BUT, since only sites using the nodeBB software seem to fail to ‘reload’ captured .mhtml files, I raised the issue with nodeBB. As I said, I copied the Notepad++ community to be transparent. I understood that nodeBB is providing Don Ho / Notepad++ Community with service gratis. I also felt that if I didn’t point what seemed to be a problem with the nodeBB package, it may never get changed fixed. I’ll stick with fix, because it’s never been an issue with stackoverflow, The New York Times, Microsoft, NPR or even Amazon (although they have somehow blocked Vivaldi (my primary browser) from allowing the save as mhtml option on IMDb).

    The fact that you went out of your way to provide an answer is a wonderful testament to the dedication of the community moderators. After the recent publishing of the latest manual (a very nice improvement on the 2011 stand alone html based manual), I found regularly perusing the Community pages provided much helpful insight, explanation, occasional history or design intent (which helps me understand the rational behind the inner workings) and absolutely amazing scripts and other ways to iron the wrinkles. I often save the long answers with nicely formatted code blocks for later reference.

    Yes, Opera Legacy is not a supported browser, BUT is also uses a unique web page engine ‘Presto’, very unlike the ‘Chrome’ engine behind most other current browsers (including Vivaldi). I have tried Firefox, but it sees that the .mhtml files are associated with Vivaldi and offers to pass the file off to Vivaldi to load it. So far I haven’t been able to convince Firefox to open it itself.

    Again, thank you for caring and responding. I’ll also play around with .pdf saves to see how well they capture the essence of the original posts (many website snapshots appear quite different between .pdf & .mhtml captures).

  • Just in case anyone is interested, I filed a bug report w/ the Chromium/Blink group-

    Issue 1216783: can't offline load .mhtml version of a saved page
    Reported by [Artie] on Sat, Jun 5, 2021, 11:46 AM CDT (just now)
    UserAgent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.95 Safari/537.36
    Example URL:
        [https://community.notepad-plus-plus.org/topic/11234/new-plugin-luascript/26?_=1622589546261](link url)
    Steps to reproduce the problem:
        1.browse [https://community.notepad-plus-plus.org/topic/11234/new-plugin-luascript/26?_=1622589546261](link url)
        2.save page as .mhtml
        3.try to open the saved page in a Blink based browser
        see other comments for greater detail; offending code shows up in the 'nodeBB' Community s/w package and can also be seen at: [https://community.nodebb.org/topic/4456/nodebb-plugin-ns-likes-ns-likes](link url)
    What is the expected behavior?
        view the page as it was displayed before it was saved
    What went wrong?
        there is nothing displayed in the browser window, it stays empty
    Does it occur on multiple sites: Yes
    Is it a problem with a plugin? No 
    Did this work before? No 
    Does this work in other browsers? Yes
    Chrome version: 90.0.4430.95  Channel: n/a
    OS Version: 6.1 (Windows 7, Windows Server 2008 R2)
    Flash Version: none
    doesn't matter which Blink based browser is used (Chrome/Vivaldi/Opera/Iron) ; the 'offending' line of html code in the web page is: <link rel="prefetch stylesheet" href="https://community.notepad-plus-plus.org/plugins/nodebb-plugin-markdown/styles/railscasts.css"> ; if the 'prefetch' is removed from the saved .mhtml file (or even replaced w/ 'preload' the saved file can be loaded in any browser (Blink, Presto, or QtWeb based)

    Based on my testing and reading of the W3 documents concerning ‘link rel’ statements: It appears to me that the nodeBB code is legal and the Blink engine can’t handle it when loading a saved .mhtml page. Although, I must admit, adding a prefetch argument for a 1.2 KB CSS file when the main CSS file is over 340 KB seems to be trying to optimize the wrong thing.

    I have found a simple work around for the .mhtml reload of Notepad++ Community pages:

    • change the extension of previously saved Notepad++ Community pages from .mhtml to .mht
    • configure Windows (using the ‘Open With’ dialog) to use the Opera Legacy browser to open .mht extensions
    • use the ‘3-step shuffle’ to copy long code blocks (they don’t auto-scroll):
      • left click; press <End>; complete drag to end of code

    the alternative is write a small SED script (or similar) to patch all the .mhtml files to remove the ‘prefetch’ argument before the ‘stylesheet’ argument

    Anyway, this issue is done for me and it’s time to put a fork in it

  • Much later followup-

    The minimal Chromium version (93.0.4549.0) required to display offline MHTML copies of NodeBB based web pages, which includes the Notepad++ Community, has made into the current Chrome and Vivaldi (4.2.2406.44) releases.

    [I’m taking the fork out, so the rest of this carcass can now fade away.]

Log in to reply