FAQ Desk: Why Does My .docx File Look Like Junk In Notepad++



  • Hello, and welcome to the FAQ Desk. If you find yourself linked to this post, it is likely because you asked something like “Why does my .docx file look like junk [in Notepad++]?”, or “Why doesn’t Find In Files match anything when searching a directory of PDFs?”

    Imgur

    Notepad++ is an editor for plaintext files. The file types .docx (modern MS Word files) and .pdf (Adobe’s Portable Document Format files) are binary (ie, not plaintext) files which can hold text, formatting, images, embedded objects, links, etc – but the “binary” means they are encoded in such a way that the sequence of bytes in the file (without additional decoding) do not necessarily match any plaintext representations (like ASCII, ISO 8859-*, or UTF8 Unicode), and are thus unintelligible to Notepad++. The fact that Notepad++ renders any of the text from the document as readable plain text, or that its find-in-files feature discovers any matches in those file types, is the exception rather than the rule.

    (There are times, especially in the PDF, when there is some plain text… but there is no guarantee that the sequence you are looking for will stay contained in plaintext; it might get separated by some binary characters, or otherwise have binary control characters embedded along with it, inhibiting your search. Even here, finding what you expect might be the exception rather than the rule.)

    Notepad++ was not built to read such binary files; if you want to read or search .docx files, you need to use a program (usually a word processor, such as MS Word, LibreOffice, OpenOffice, or the like) that is specifically designed to read such files; similarly, for reading .pdf files, you need a program like Adobe Acrobat Reader or other PDF-viewers or editors which are specifically designed to read such files. (The reason “acrobat search … finds many” is because acrobat is designed to read and search .pdf files)

    What you are asking is the equivalent of “I just brought my friend, who only reads English, over to index my personal library: Why is she not able to index my Russian, Hindi, and ancient Greek books?” That friend could be reasonably expected to understand British English, American English, Canadian English, and Australian English (in my analogy, various standard encodings of the same underlying text, such as the ASCII, UTF8, …), but it is unreasonable to expect her to also understand shorthand Sanskrit (in my analogy, a compressed binary format with its own proprietary encoding).