Community
    • Login

    How to find and replace unrecognizable characters in multiple files of a folder with the correct character using Notepad ++?

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    26 Posts 5 Posters 59.6k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • PeterJonesP
      PeterJones @Ramanand Jhingade
      last edited by

      @Ramanand-Jhingade

      This Forum isn’t a generic help forum; we are focused on Notepad++; if you want help with command prompt or power shell, go elsewhere.

      To search for x92 in Notepad++, look for \x92 when in regular expression mode.

      Your real problem in that file is that you don’t understand the file’s actual encoding. On the web page, you have to send the right encoding information in the header (and maybe in the meta tag)… By “correct”, I mean that the sent encoding must match with the actual encoding of the file. And in Notepad++, it sometimes guesses the encoding wrong, because to a program, it’s all a bunch of bytes, and while there are heuristics that identify certain encodings, any encoding that doesn’t use the Unicode BOM is likely to be misinterpreted under the right (wrong?) circumstances.

      Please note that in a so-called “ANSI” encoding, x91 - x94 are the “smart quotes”: ‘ ’ “ ” . So it looks like you’ve got a file where you put in smart quotes, and saved the file as ANSI (probably really Windows 1252), and that probably when you are sending the webpage, you are saying it’s UTF8; and Notepad++ has probably mis-guessed that it’s UTF8. Or, even worse, you have a mix of UTF8 and WIN-1252-encoded characters in your file, which is just wrong. If you want to keep the encoding as-is, use the following search => replace pairs:

      • \x91 => ‘
      • \x92 => ’
      • \x93 => “
      • \x94 => ”

      But don’t do that until you actually understand the encoding issues involved.

      By file, I mean the “source” file of the webpage with the .txt extension

      Why have you named your webpage source file with the .txt extension?

      I would highly recommend doing research on how file encoding, especially for webpages, works. Because if you don’t, you’re likely to mess things up more than they currently are. And this Forum isn’t here to guide you though the intricacies of web design; we are here to talk about (and help with) the usage of Notepad++.

      Ramanand JhingadeR 1 Reply Last reply Reply Quote 2
      • Ramanand JhingadeR
        Ramanand Jhingade @PeterJones
        last edited by Ramanand Jhingade

        @PeterJones I tried to search for \x93 and \x94 after selecting the "Regular expression " mode, but it says, "```
        Can’t find the text \x93

        PeterJonesP 1 Reply Last reply Reply Quote 0
        • PeterJonesP
          PeterJones @Ramanand Jhingade
          last edited by

          @Ramanand-Jhingade ,

          I tried to search for \x93 and \x94 after selecting the “Regular expression” mode, but it says, Can’t find the text "\x93"

          Okay, I can replicate: if I have a file open that Notepad++ thinks is UTF8 (or UTF-8-BOM), and search for that text, it won’t find it. in an ANSI file, where x93 is a valid byte at codepoint 0x93, \x93 search does work. (In UTF-8, the single byte x93 is not a representation of a real character; U+0093 needs a different sequence of bytes to encode it in UTF-8… which is the crux of the problem)

          e43fbd90-1c22-466e-8177-f18d615bb25f-image.png

          If you know there aren’t any other UTF8 characters in the file, then do Encoding > ANSI (not Encoding > Convert to ANSI). This will re-interpret those bytes as WIN-1252 (“ANSI”), so it will know they are really smart quotes

          b7228473-2238-4de9-8386-046609d55e59-image.png

          At this point, you could do Encoding > Convert to UTF-8-BOM if your end application (webserver) defaults to UTF-8. Or just leave it in “ANSI” and pray that nothing messes it up again in the future, or that you don’t later want to enter text that isn’t in the “ANSI” encoding.

          Ramanand JhingadeR 1 Reply Last reply Reply Quote 1
          • Ramanand JhingadeR
            Ramanand Jhingade @PeterJones
            last edited by Ramanand Jhingade

            @PeterJones How do I do

            Encoding > Convert to UTF-8-BOM
            

            ? Will that cause any problems for the images, alphabets or numerals on any of the webpages (the webpages are with .html extensions but I edit them with Notepad++)?

            PeterJonesP 1 Reply Last reply Reply Quote 0
            • PeterJonesP
              PeterJones @Ramanand Jhingade
              last edited by

              @Ramanand-Jhingade said in How to find and replace unrecognizable characters in multiple files of a folder with the correct character using Notepad ++?:

              @PeterJones How do I do

              Encoding > Convert to UTF-8-BOM
              

              You look on the Notepad++ menu, where it has the word “Encoding” as a menu entry; you click on it. Then you go to the menu entry called “Convert to UTF-8-BOM” and click on it.

              ? Will that cause any problems for the images, alphabets or numerals on any of the webpages

              “Images”, no. Your image data isn’t in the HTML source file. If you don’t know that, you probably have some studying of web technology to do.

              “Alphabets or numerals”: I don’t know what’s in your page. That’s up to you to know. I already gave the caveat “if you know there aren’t any other UTF8 characters in the file” before following that procedure.

              (the webpages are with .html extensions but I edit them with Notepad++)?

              Yes, that’s the way that web source files work: you use a text editor to edit the plain text HTML source. If you think you have to clarify that statement because it’s not intuitively obvious to you, then you probably have some studying of web technology to do.

              Ramanand JhingadeR 1 Reply Last reply Reply Quote 1
              • EkopalypseE
                Ekopalypse
                last edited by

                @Ramanand-Jhingade
                and in addition to what has already been said, you can take a look here to get a better understanding about ansi, unicode and their friends.

                1 Reply Last reply Reply Quote 1
                • Ramanand JhingadeR
                  Ramanand Jhingade @PeterJones
                  last edited by Ramanand Jhingade

                  @PeterJones I did what you said and then tried to replace the unrecognizable characters with the codes you typed above but it still says,

                  Can’t find the text "\x93"
                  
                  PeterJonesP 1 Reply Last reply Reply Quote 0
                  • PeterJonesP
                    PeterJones @Ramanand Jhingade
                    last edited by PeterJones

                    @Ramanand-Jhingade said in How to find and replace unrecognizable characters in multiple files of a folder with the correct character using Notepad ++?:

                    I did what you said and then tried to replace the unrecognizable characters with the codes you typed above but it still says,

                    Can you show screenshots of your steps, similar to what I did above, or my example below? All you have to do is hit Alt+PrintScreen inside Notepad++ (or use the windows Snipping Tool with Shift+WindowsKey+S and then draw a box around the area of screen you want to snip) then paste into your reply here.

                    It would be nice if you showed enough of your window so we could see the x93 characters and what they become at each step, and also see the full status bar along the bottom.

                    For example:

                    1. see that it’s UTF-8-BOM right now, so it doesn’t know what to do with the x91 and similar invalid UTF8 characters 8ac6e99a-4939-408c-a052-8ac8f443d8d4-image.png
                      notice how they look like x91 right now

                    2. use the menus to set Encoding > ANSI
                      d9a704ff-702c-4f88-89b6-9bf940d06e3d-image.png
                      notice how they look like smart quotes now? That’s because they are. And Notepad++ knows this.

                    3. At this point, a search should work. But you don’t need to search and replace, because notepad++ recognizes the characters at this point. There is nothing to search and replace, because the characters are right.
                      3b8827c3-782e-4150-bd29-ce2018f4554c-image.png

                    4. menu Encoding > Convert to UTF-8 or Convert to UTF-8-BOM. Now this will put the file into a valid UTF-8 byte sequence.
                      896ca1eb-bf0a-45ab-927b-5be57303f858-image.png
                      Notice also that the length changed on the status bar: that’s because in UTF-8, the smart quotes each take up 3 bytes, plus 2 bytes for the newline sequence at the end (3*4+2 = 14)

                    5. if everything looks right to you, Save

                    Note that step 3 is only needed if your webserver is expecting the file to be in UTF-8 (or is otherwise telling the outside world that it is UTF-8). It might be that if you’re looking at a local file in your local browser (no webserver involved) it assumes UTF-8. Or maybe it assumes something different. I cannot tell you, because I have no insight into your webserver or your local computer.

                    -----
                    Note: you are responsible for your own data. I am assuming you have backed up any critical data. I am not liable for any data loss that you might incur while correctly or incorrectly following my advice.

                    Ramanand JhingadeR 4 Replies Last reply Reply Quote 0
                    • Ramanand JhingadeR
                      Ramanand Jhingade @PeterJones
                      last edited by

                      @PeterJones I am not lying but I will do what you typed above and send screenshots when I can make some time. Please think of a solution meanwhile. Thanks for your time and help.

                      1 Reply Last reply Reply Quote 0
                      • Ramanand JhingadeR
                        Ramanand Jhingade @PeterJones
                        last edited by

                        @PeterJones Showing you screenshots: Screenshot of unreconised utf character.PNG
                        Source of the same opened with Notepad++.PNG
                        The encoding is already UTF 8, so how to find and replace the unrecognized characters?

                        PeterJonesP 1 Reply Last reply Reply Quote 0
                        • Ramanand JhingadeR
                          Ramanand Jhingade @PeterJones
                          last edited by Ramanand Jhingade

                          @PeterJones I found a method to find all non-ascii characters from multiple files of a folder here: notepad-tip-find-out-non-ascii
                          I think I should check all that individually but if you know a less time consuming method, please let me know! I observed that at most places, they are showing up as they should, it is only in some places that a unicode is shown (probably a bug)

                          1 Reply Last reply Reply Quote 0
                          • Ramanand JhingadeR
                            Ramanand Jhingade @PeterJones
                            last edited by

                            @PeterJones @guy038 Is there a way to find invalid characters using the information here: how-to-change-all-invalid-characters-to-spaces ?

                            PeterJonesP 1 Reply Last reply Reply Quote 0
                            • PeterJonesP
                              PeterJones @Ramanand Jhingade
                              last edited by PeterJones

                              @Ramanand-Jhingade said in How to find and replace unrecognizable characters in multiple files of a folder with the correct character using Notepad ++?:

                              [posted screenshots]

                              Thank you for doing screenshots. Unfortunately, you didn’t pay attention to my request or look at my example screenshots, because your screenshots did not show the Notepad++ status bar at the bottom of the window, so there was no proof of the encoding. I will just have to take your word for “the encoding is already UTF 8”, whereas if you had done what I asked, it would have been included in the screenshots, so I could be sure. Further, you didn’t understand that my request wanted you to show a screenshot at each of the four steps of the procedure I gave you, just like my example gave four screenshots, one at each of the four steps.

                              The encoding is already UTF 8, so how to find and replace the unrecognized characters?

                              You appear to be not understanding my posts and screenshots.

                              Did you notice in my #1 screenshot above, shown again here:

                              … that the “encoding is already UTF 8” – you can see this in the lower-right corner, in the Notepad++ status bar; that’s the reason I included the status bar in my screenshot, and why I asked you to include the status bar in your screenshot.

                              The fact that the “encoding is already UTF 8” was the whole point of what I was trying to show you: Notepad++ thinks the encoding is UTF-8, but it has run across the x91x92x93x94 bytes which are not valid UTF 8 encoded characters – so you have badly-formed UTF-8.

                              You also linked to,

                              notepad-tip-find-out-non-ascii : https://www.datagenx.net/2015/12/notepad-tip-find-out-non-ascii.html

                              which suggests that you use [^\x00-\x7F]+. That would work, if you were in ANSI or one of the character-set encodings. But if your file is interpreted as UTF-8, then search will not find any such codepoints, because the bytes x93 and x94 are not properly encoded characters, so the search function does not always find them. See this example:
                              b9a7709e-2edb-4cb6-9e5e-61b17d86bc0a-image.png
                              Notice how the only two lines bookmarked are the first (where the bytes run into each other, so that the high bytes at least match the UTF-8 requirement of having multiple 0x80-0xFF bytes adjacent to each other, rather than with non-high-bit characters like a space between) and the fourth (where there are other non-ASCII but validly-encoded UTF-8 characters); it does not match line 2 (where the bytes are space separated).

                              Trying to find a search in Notepad++ to find invalidly-encoded characters is hard, because the Notepad++ search function assumes your data is properly encoded in whatever encoding Notepad++ is currently set to.

                              However, I did some more experimenting, and found a procedure that should work without ruining other UTF-8 text, and just fix the poorly-encoded smart quotes.

                              1. Verify that the status bar and/or Notepad++ Encoding menu currently is selected on UTF-8 or UTF-8-BOM
                                474e38d3-f8f3-4589-a7fa-678b22e3ebbb-image.png

                              2. Use Encoding > ANSI to convince Notepad++ that your bytes are ANSI, not UTF-8.

                                1. Before: 79c6d537-3efd-4078-a6e5-de032d2b631d-image.png
                                2. After: 768408a9-8e20-4500-b37f-a172cc33ae7d-image.png
                                3. You will notice that the “good” characters currently “look” wrong. Don’t worry about that for now. Trust me. But now “arulvaakku” looks right

                                _WARNING: If you are not showing as “ANSI” encoding before starting step 3, you have not followed my instructions and this will not work! Step 2 will get you to the right point, but only if you have followed by instructions.

                              3. Do a couple of search/replace. These four will change all single and double smart quotes into the correct three-byte sequence. (use regular expression search mode for all search/replace below)

                                1. search \x91 replace \xE2\x80\x98
                                2. search \x92 replace \xE2\x80\x99
                                3. search \x93 replace \xE2\x80\x9C
                                4. search \x94 replace \xE2\x80\x9D

                                At this point, it will look “worse”, but that’s okay. Trust me. ed37959f-afd0-4854-ab01-0def66fc86ad-image.png

                              4. Use the Encoding > UTF-8 to tell Notepad++ to re-interpret the file as if the bytes were UTF-8, which is what you want. At this point, everything looks good:
                                1f368cdd-719a-4a86-9adc-5c796a8e329b-image.png

                              5. SAVE

                              I think I should check all that individually but if you know a less time consuming method, please let me know! I observed that at most places, they are showing up as they should, it is only in some places that a unicode is shown (probably a bug)

                              My method won’t be great if you have a lot of files. If there is a bug, it’s a bug in how your HTML was generated.

                              Alternatives

                              If the only non-ASCII characters in your entire file are the x93 and x94 smart quotes, then just ignore how it “looks” in notepad++, and tell your webserver that the file is encoded as Windows-1252 (using both server settings and maybe a meta-charset HTML tag

                              If the only non-ASCII characters in your entire file are x93 and x94 smart quotes, then try to convince Notepad++ to automatically interpret it as ANSI. Some things to try to get that result

                              1. Settings > Preferences > New Document: a9035066-10fa-4947-8960-9057f194e036-image.png

                                • Set “Encoding” to either “ANSI” or “Windows-1252”
                                • Make sure “Apply to opened ANSI files” is not checked
                              2. Settings > Preferences > MISC

                                • Try changing the setting of “Autodetect character encoding” to either checked or not.

                              After changing any of those settings, you may have to reload your file to get Notepad++ to apply its new settings. I do not guarantee that these settings will work for you… the auto-detect is notorious for disagreeing with the user as to what encoding it thinks is there, and everyone has different ideas of the “right” settings, depending on what their text normally looks like, and what bytes they contain.

                              1. After loading a file, if Notepad++ doesn’t get it right, and you see the x93 and x94 boxes, just switch to Encoding > ANSI and everything will look right. On that file, you’d definitely want to include the meta-charset tag

                              Non-Notepad++ Alternative

                              If you have lots of files that have mixed encoding with some normal UTF-8 characters and some windows-1252 smart quotes, it might not be efficient to make the changes in Notepad++. Instead, you might want to find a non-Notepad++ solution. I would suggest trying command line tools, maybe like “iconv” or “sed” – there are windows versions of those tools, but this forum is not the right place to find help on those.

                              Done

                              I have explained these to the best of my ability. I am not confident that you have understood the points I have been making, or my instructions for how to fix your data. Unfortunately, I don’t know how else to say it. If you have more questions, feel free to ask; but I am going to likely leave it up to someone different to step in and try to help you, because I don’t know what more I could say that I haven’t already said.

                              1 Reply Last reply Reply Quote 3
                              • PeterJonesP
                                PeterJones @Ramanand Jhingade
                                last edited by

                                @Ramanand-Jhingade ,

                                If you end up going down the route of non-Notepad++ solutions (remembering that here is not the right place to ask questions if you do), @Vasile-Caraus has posted a couple of non-Notepad++ tools that might be able to do the search-and-replace in the way that you want, the tools listed in these two posts. The second tool which he mentioned, grepWin, has been recommended by other users on the forum as well, especially in circumstances when Notepad++'s find-in-files wasn’t properly handling encoding-detection.

                                Vasile showed it working for the bytes � (which is the UTF-encoding for �) because that was the focus of that previous discussion. But it will likely also work if you wanted to replace \x93 with “ and \x94 with ” – and might be easier for you to figure out than iconv or command-line grep.

                                Given your requirements, grepWin may be the best tool for you for this particular smart-quote problem. (if you have grepWin questions, you will need to find a grepWin forum or other generic help site, because the Notepad++ Community is focused on Notepad++)

                                Ramanand JhingadeR 1 Reply Last reply Reply Quote 1
                                • Ramanand JhingadeR
                                  Ramanand Jhingade @PeterJones
                                  last edited by Ramanand Jhingade

                                  @PeterJones I finally found a solution here: how-to-find-non-ascii-unprintable-characters-using-notepad-plus-plus
                                  We have to select the Regular expression mode and search/find with this code: [\x00-\x08\x0B\x0C\x0E-\x1F\x7F-\x9F]
                                  I will do the replacements one by one instead of using “Replace all”

                                  Ramanand JhingadeR PeterJonesP 4 Replies Last reply Reply Quote 1
                                  • Ramanand JhingadeR
                                    Ramanand Jhingade @Ramanand Jhingade
                                    last edited by Ramanand Jhingade

                                    @Ramanand-Jhingade The

                                    Find All
                                    

                                    is making Notepad++ to stop working and close if I use the above code. Any suggestions to avoid that?

                                    1 Reply Last reply Reply Quote 0
                                    • PeterJonesP
                                      PeterJones @Ramanand Jhingade
                                      last edited by

                                      @Ramanand-Jhingade ,

                                      Find All is making Notepad++ to stop working and close if I use the above code.

                                      Which Find All do you mean? Do you mean the Find > Find All in Current Document, Find > Find All in Opened Documents, or Find in Files > Find All ?

                                      Please note that the Find in Files adds another level of confusion, because Notepad++ is trying to figure out the encoding on each file individually, and depending on the bytes in the file and your settings (as described above), it might think some are UTF-8 and others are ANSI or might pick a strange character-set value. The Find in Files isn’t great with non-ASCII characters, unfortunately. There are bug reports / feature requests, but they are taking time to get worked out.

                                      I suggest doing one file at a time for now.

                                      Ramanand JhingadeR 1 Reply Last reply Reply Quote 0
                                      • Ramanand JhingadeR
                                        Ramanand Jhingade @PeterJones
                                        last edited by

                                        @PeterJones @Ekopalypse Thank you both for your time and help. @PeterJones Please post here if the bug is fixed and I can Find all/search in multiple files of a folder

                                        1 Reply Last reply Reply Quote 0
                                        • PeterJonesP
                                          PeterJones @Ramanand Jhingade
                                          last edited by

                                          Please post here if the bug is fixed and I can Find all/search in multiple files of a folder

                                          Don’t misunderstand. Find in Files > Find All works for ASCII characters. And it works with valid characters in well-defined encodings (so a UTF-8-BOM or UCS-2-LE BOM file should properly search-and-replace with any valid character). It’s just when you’re making Notepad++ guess the encoding (one of the many character-set “encodings”) or when there are invalid characters (byte x93 all alone rather than in the appropriate mutli-byte sequence in a UTF-8 file). So, for your unusual use-case, it’s not currently working; but usually, it does.

                                          Ramanand JhingadeR 1 Reply Last reply Reply Quote 0
                                          • Ramanand JhingadeR
                                            Ramanand Jhingade @PeterJones
                                            last edited by

                                            @PeterJones By God’s grace, the Find all/search in multiple files of a folder using the Regular expression mode finally worked after I removed all the files that did not end with the

                                            .html
                                            

                                            extension. I thank you again from the bottom of my heart!

                                            1 Reply Last reply Reply Quote 0
                                            • First post
                                              Last post
                                            The Community of users of the Notepad++ text editor.
                                            Powered by NodeBB | Contributors