How to find and replace unrecognizable characters in multiple files of a folder with the correct character using Notepad ++?

Ramanand Jhingade

How to find and replace unrecognizable characters in multiple files of a folder with the correct character using Command Prompt, Power Shell or Notepad ++?
On the web page, it shows as a � character and in the file, when opened with Notepad++ it shows as x92, x93, x94 etc.

Ramanand Jhingade

@Ramanand-Jhingade By file, I mean the “source” file of the webpage with the .txt extension

PeterJones

@Ramanand-Jhingade

This Forum isn’t a generic help forum; we are focused on Notepad++; if you want help with command prompt or power shell, go elsewhere.

To search for x92 in Notepad++, look for \x92 when in regular expression mode.

Your real problem in that file is that you don’t understand the file’s actual encoding. On the web page, you have to send the right encoding information in the header (and maybe in the meta tag)… By “correct”, I mean that the sent encoding must match with the actual encoding of the file. And in Notepad++, it sometimes guesses the encoding wrong, because to a program, it’s all a bunch of bytes, and while there are heuristics that identify certain encodings, any encoding that doesn’t use the Unicode BOM is likely to be misinterpreted under the right (wrong?) circumstances.

Please note that in a so-called “ANSI” encoding, x91 - x94 are the “smart quotes”: ‘ ’ “ ” . So it looks like you’ve got a file where you put in smart quotes, and saved the file as ANSI (probably really Windows 1252), and that probably when you are sending the webpage, you are saying it’s UTF8; and Notepad++ has probably mis-guessed that it’s UTF8. Or, even worse, you have a mix of UTF8 and WIN-1252-encoded characters in your file, which is just wrong. If you want to keep the encoding as-is, use the following search => replace pairs:

\x91 => ‘
\x92 => ’
\x93 => “
\x94 => ”

But don’t do that until you actually understand the encoding issues involved.

By file, I mean the “source” file of the webpage with the .txt extension

Why have you named your webpage source file with the .txt extension?

I would highly recommend doing research on how file encoding, especially for webpages, works. Because if you don’t, you’re likely to mess things up more than they currently are. And this Forum isn’t here to guide you though the intricacies of web design; we are here to talk about (and help with) the usage of Notepad++.

Ramanand Jhingade

@PeterJones I tried to search for \x93 and \x94 after selecting the "Regular expression " mode, but it says, "```
Can’t find the text \x93

PeterJones

@Ramanand-Jhingade ,

I tried to search for \x93 and \x94 after selecting the “Regular expression” mode, but it says, Can’t find the text "\x93"

Okay, I can replicate: if I have a file open that Notepad++ thinks is UTF8 (or UTF-8-BOM), and search for that text, it won’t find it. in an ANSI file, where x93 is a valid byte at codepoint 0x93, \x93 search does work. (In UTF-8, the single byte x93 is not a representation of a real character; U+0093 needs a different sequence of bytes to encode it in UTF-8… which is the crux of the problem)

If you know there aren’t any other UTF8 characters in the file, then do Encoding > ANSI (not Encoding > Convert to ANSI). This will re-interpret those bytes as WIN-1252 (“ANSI”), so it will know they are really smart quotes

At this point, you could do Encoding > Convert to UTF-8-BOM if your end application (webserver) defaults to UTF-8. Or just leave it in “ANSI” and pray that nothing messes it up again in the future, or that you don’t later want to enter text that isn’t in the “ANSI” encoding.

Ramanand Jhingade

@PeterJones How do I do

Encoding > Convert to UTF-8-BOM

? Will that cause any problems for the images, alphabets or numerals on any of the webpages (the webpages are with .html extensions but I edit them with Notepad++)?

PeterJones

@Ramanand-Jhingade said in How to find and replace unrecognizable characters in multiple files of a folder with the correct character using Notepad ++?:

@PeterJones How do I do
Encoding > Convert to UTF-8-BOM

You look on the Notepad++ menu, where it has the word “Encoding” as a menu entry; you click on it. Then you go to the menu entry called “Convert to UTF-8-BOM” and click on it.

? Will that cause any problems for the images, alphabets or numerals on any of the webpages

“Images”, no. Your image data isn’t in the HTML source file. If you don’t know that, you probably have some studying of web technology to do.

“Alphabets or numerals”: I don’t know what’s in your page. That’s up to you to know. I already gave the caveat “if you know there aren’t any other UTF8 characters in the file” before following that procedure.

(the webpages are with .html extensions but I edit them with Notepad++)?

Yes, that’s the way that web source files work: you use a text editor to edit the plain text HTML source. If you think you have to clarify that statement because it’s not intuitively obvious to you, then you probably have some studying of web technology to do.

Ekopalypse

@Ramanand-Jhingade
and in addition to what has already been said, you can take a look here to get a better understanding about ansi, unicode and their friends.

Ramanand Jhingade

@PeterJones I did what you said and then tried to replace the unrecognizable characters with the codes you typed above but it still says,

Can’t find the text "\x93"

PeterJones

@Ramanand-Jhingade said in How to find and replace unrecognizable characters in multiple files of a folder with the correct character using Notepad ++?:

I did what you said and then tried to replace the unrecognizable characters with the codes you typed above but it still says,

Can you show screenshots of your steps, similar to what I did above, or my example below? All you have to do is hit Alt+PrintScreen inside Notepad++ (or use the windows Snipping Tool with Shift+WindowsKey+S and then draw a box around the area of screen you want to snip) then paste into your reply here.

It would be nice if you showed enough of your window so we could see the x93 characters and what they become at each step, and also see the full status bar along the bottom.

For example:

see that it’s UTF-8-BOM right now, so it doesn’t know what to do with the x91 and similar invalid UTF8 characters
notice how they look like x91 right now
use the menus to set Encoding > ANSI

notice how they look like smart quotes now? That’s because they are. And Notepad++ knows this.
At this point, a search should work. But you don’t need to search and replace, because notepad++ recognizes the characters at this point. There is nothing to search and replace, because the characters are right.
menu Encoding > Convert to UTF-8 or Convert to UTF-8-BOM. Now this will put the file into a valid UTF-8 byte sequence.

Notice also that the length changed on the status bar: that’s because in UTF-8, the smart quotes each take up 3 bytes, plus 2 bytes for the newline sequence at the end (3*4+2 = 14)
if everything looks right to you, Save

Note that step 3 is only needed if your webserver is expecting the file to be in UTF-8 (or is otherwise telling the outside world that it is UTF-8). It might be that if you’re looking at a local file in your local browser (no webserver involved) it assumes UTF-8. Or maybe it assumes something different. I cannot tell you, because I have no insight into your webserver or your local computer.

-----
Note: you are responsible for your own data. I am assuming you have backed up any critical data. I am not liable for any data loss that you might incur while correctly or incorrectly following my advice.

Ramanand Jhingade

@PeterJones I am not lying but I will do what you typed above and send screenshots when I can make some time. Please think of a solution meanwhile. Thanks for your time and help.

Ramanand Jhingade

@PeterJones Showing you screenshots: Screenshot of unreconised utf character.PNG
Source of the same opened with Notepad++.PNG
The encoding is already UTF 8, so how to find and replace the unrecognized characters?

Ramanand Jhingade

@PeterJones I found a method to find all non-ascii characters from multiple files of a folder here: notepad-tip-find-out-non-ascii
I think I should check all that individually but if you know a less time consuming method, please let me know! I observed that at most places, they are showing up as they should, it is only in some places that a unicode is shown (probably a bug)

Ramanand Jhingade

@PeterJones @guy038 Is there a way to find invalid characters using the information here: how-to-change-all-invalid-characters-to-spaces ?

PeterJones

@Ramanand-Jhingade said in How to find and replace unrecognizable characters in multiple files of a folder with the correct character using Notepad ++?:

[posted screenshots]

Thank you for doing screenshots. Unfortunately, you didn’t pay attention to my request or look at my example screenshots, because your screenshots did not show the Notepad++ status bar at the bottom of the window, so there was no proof of the encoding. I will just have to take your word for “the encoding is already UTF 8”, whereas if you had done what I asked, it would have been included in the screenshots, so I could be sure. Further, you didn’t understand that my request wanted you to show a screenshot at each of the four steps of the procedure I gave you, just like my example gave four screenshots, one at each of the four steps.

The encoding is already UTF 8, so how to find and replace the unrecognized characters?

You appear to be not understanding my posts and screenshots.

Did you notice in my #1 screenshot above, shown again here:

… that the “encoding is already UTF 8” – you can see this in the lower-right corner, in the Notepad++ status bar; that’s the reason I included the status bar in my screenshot, and why I asked you to include the status bar in your screenshot.

The fact that the “encoding is already UTF 8” was the whole point of what I was trying to show you: Notepad++ thinks the encoding is UTF-8, but it has run across the x91x92x93x94 bytes which are not valid UTF 8 encoded characters – so you have badly-formed UTF-8.

You also linked to,

notepad-tip-find-out-non-ascii : https://www.datagenx.net/2015/12/notepad-tip-find-out-non-ascii.html

which suggests that you use [^\x00-\x7F]+. That would work, if you were in ANSI or one of the character-set encodings. But if your file is interpreted as UTF-8, then search will not find any such codepoints, because the bytes x93 and x94 are not properly encoded characters, so the search function does not always find them. See this example:

Notice how the only two lines bookmarked are the first (where the bytes run into each other, so that the high bytes at least match the UTF-8 requirement of having multiple 0x80-0xFF bytes adjacent to each other, rather than with non-high-bit characters like a space between) and the fourth (where there are other non-ASCII but validly-encoded UTF-8 characters); it does not match line 2 (where the bytes are space separated).

Trying to find a search in Notepad++ to find invalidly-encoded characters is hard, because the Notepad++ search function assumes your data is properly encoded in whatever encoding Notepad++ is currently set to.

However, I did some more experimenting, and found a procedure that should work without ruining other UTF-8 text, and just fix the poorly-encoded smart quotes.

Verify that the status bar and/or Notepad++ Encoding menu currently is selected on UTF-8 or UTF-8-BOM
Use Encoding > ANSI to convince Notepad++ that your bytes are ANSI, not UTF-8.
1. Before:
2. After:
3. You will notice that the “good” characters currently “look” wrong. Don’t worry about that for now. Trust me. But now “arulvaakku” looks right
_WARNING: If you are not showing as “ANSI” encoding before starting step 3, you have not followed my instructions and this will not work! Step 2 will get you to the right point, but only if you have followed by instructions.
Do a couple of search/replace. These four will change all single and double smart quotes into the correct three-byte sequence. (use regular expression search mode for all search/replace below)
1. search \x91 replace \xE2\x80\x98
2. search \x92 replace \xE2\x80\x99
3. search \x93 replace \xE2\x80\x9C
4. search \x94 replace \xE2\x80\x9D
At this point, it will look “worse”, but that’s okay. Trust me.
Use the Encoding > UTF-8 to tell Notepad++ to re-interpret the file as if the bytes were UTF-8, which is what you want. At this point, everything looks good:
SAVE

I think I should check all that individually but if you know a less time consuming method, please let me know! I observed that at most places, they are showing up as they should, it is only in some places that a unicode is shown (probably a bug)

My method won’t be great if you have a lot of files. If there is a bug, it’s a bug in how your HTML was generated.

Alternatives

If the only non-ASCII characters in your entire file are the x93 and x94 smart quotes, then just ignore how it “looks” in notepad++, and tell your webserver that the file is encoded as Windows-1252 (using both server settings and maybe a meta-charset HTML tag

If the only non-ASCII characters in your entire file are x93 and x94 smart quotes, then try to convince Notepad++ to automatically interpret it as ANSI. Some things to try to get that result

Settings > Preferences > New Document:
- Set “Encoding” to either “ANSI” or “Windows-1252”
- Make sure “Apply to opened ANSI files” is not checked
Settings > Preferences > MISC
- Try changing the setting of “Autodetect character encoding” to either checked or not.

After changing any of those settings, you may have to reload your file to get Notepad++ to apply its new settings. I do not guarantee that these settings will work for you… the auto-detect is notorious for disagreeing with the user as to what encoding it thinks is there, and everyone has different ideas of the “right” settings, depending on what their text normally looks like, and what bytes they contain.

After loading a file, if Notepad++ doesn’t get it right, and you see the x93 and x94 boxes, just switch to Encoding > ANSI and everything will look right. On that file, you’d definitely want to include the meta-charset tag

Non-Notepad++ Alternative

If you have lots of files that have mixed encoding with some normal UTF-8 characters and some windows-1252 smart quotes, it might not be efficient to make the changes in Notepad++. Instead, you might want to find a non-Notepad++ solution. I would suggest trying command line tools, maybe like “iconv” or “sed” – there are windows versions of those tools, but this forum is not the right place to find help on those.

Done

I have explained these to the best of my ability. I am not confident that you have understood the points I have been making, or my instructions for how to fix your data. Unfortunately, I don’t know how else to say it. If you have more questions, feel free to ask; but I am going to likely leave it up to someone different to step in and try to help you, because I don’t know what more I could say that I haven’t already said.

PeterJones

@Ramanand-Jhingade ,

If you end up going down the route of non-Notepad++ solutions (remembering that here is not the right place to ask questions if you do), @Vasile-Caraus has posted a couple of non-Notepad++ tools that might be able to do the search-and-replace in the way that you want, the tools listed in these two posts. The second tool which he mentioned, grepWin, has been recommended by other users on the forum as well, especially in circumstances when Notepad++'s find-in-files wasn’t properly handling encoding-detection.

Vasile showed it working for the bytes ï¿½ (which is the UTF-encoding for �) because that was the focus of that previous discussion. But it will likely also work if you wanted to replace \x93 with “ and \x94 with ” – and might be easier for you to figure out than iconv or command-line grep.

Given your requirements, grepWin may be the best tool for you for this particular smart-quote problem. (if you have grepWin questions, you will need to find a grepWin forum or other generic help site, because the Notepad++ Community is focused on Notepad++)

Ramanand Jhingade

@PeterJones I finally found a solution here: how-to-find-non-ascii-unprintable-characters-using-notepad-plus-plus
We have to select the Regular expression mode and search/find with this code: [\x00-\x08\x0B\x0C\x0E-\x1F\x7F-\x9F]
I will do the replacements one by one instead of using “Replace all”

Ramanand Jhingade

@Ramanand-Jhingade The

Find All

is making Notepad++ to stop working and close if I use the above code. Any suggestions to avoid that?

PeterJones

@Ramanand-Jhingade ,

Find All is making Notepad++ to stop working and close if I use the above code.

Which Find All do you mean? Do you mean the Find > Find All in Current Document, Find > Find All in Opened Documents, or Find in Files > Find All ?

Please note that the Find in Files adds another level of confusion, because Notepad++ is trying to figure out the encoding on each file individually, and depending on the bytes in the file and your settings (as described above), it might think some are UTF-8 and others are ANSI or might pick a strange character-set value. The Find in Files isn’t great with non-ASCII characters, unfortunately. There are bug reports / feature requests, but they are taking time to get worked out.

I suggest doing one file at a time for now.

Ramanand Jhingade

@PeterJones @Ekopalypse Thank you both for your time and help. @PeterJones Please post here if the bug is fixed and I can Find all/search in multiple files of a folder