How to Find "�" in multiple notepad++ files ?



  • hello. I don’t know what is this � but I need to delete from all my files. I try to simple find it, but is not find it. I must deleted one by one from all files.

    any idea ?



  • So, if you were to read a UTF-8 file assuming some other 8-bit encoding (like ANSI), those three one-byte characters could appear when it’s reading an actually three-byte character.

    Specifically, those are the bytes EF BF BD, which are the three-byte encoding for U+FFFD, which is the Unicode “REPLACEMENT CHARACTER”: � is “used to replace an incoming character whose value is unknown or unrepresentable in Unicode”. So what I’m guessing really happened is that whatever generated your text file got confused about some character (it was told to output a character it didn’t know), so it output UTF-8 for the �; you then read it with an ANSI-type encoding, rather than reading it in UTF-8, so it showed up as three characters, rather than a single Unicode character �. Probably searching for the three characters � in all open files, or search-in-files option, is treating those as three valid unicode characters, so looking for more than those three bytes, and it never finds them.

    You might be able to search for \x{fffd} and replace it with whatever replacement you want for the REPLACEMENT CHARACTER. But it probably depends on what encoding you have Notepad++ set to default to, and whether you try to autodetect encoding, whether that search-and-replace will work for you.

    As a hint: anytime you see two three random-seeming bytes that are in the 0x80-0xFF range in a row, think “this was probably UTF-8, but incorrectly read as ANSI”; those random 8bit bytes are just what happen when you misinterpret UTF-8.



  • Followup: one thing you can do to find out what a random group of those, like �, represents:

    1. Create a new empty file in Notepad++
    2. Set Encoding > ANSI (not Encoding > Convert to ANSI)
    3. Paste those characters in
    4. Set Encoding > UTF-8 (not Encoding > Convert to UTF-8)
    5. The UTF-8 text is now visible, and hopefully more readable

    If you follow that sequence with �, you will get

    Now try it yourself. With the byte sequence ŃōťëƤāđ┼┿, what obfuscated text would those bytes be if properly inerpreted as UTF-8?



  • Hello , @Vasile-caraus, @peterjones and All,

    @vasile-caraus, I first thought that you probably, wrongly, saw the Byte Order Mark ( BOM ) of an Unicode encoded file. Normally, these three or two bytes, at the very beginning of an Unicode file, are not part of file and are invisible in any decent text editor. This character, of Unicode code-point \x{FEFF}, is used to signal the right bytes order, in Unicode files !

    Refer to the link, below, for further information :

    https://en.wikipedia.org/wiki/Byte_order_mark#Byte_order_marks_by_encoding


    However, I was totally wrong, because, in the first row of that table, the three bytes of the BOM ( EF BB BF) in an UTF-8 encoded file ( column 2 ), correspond to the three chars  in a Windows-1252 encoded file ( column 4 ) which is the usual ANSI encoding for Latin languages.

    Therefore, it seemed quite different from your three bytes � that you’ve noticed ! And I was wondering about this difference, without finding any satisfying solution :-(( Luckily, @peterjones did find the correct explanation !


    Regarding the Unicode Replacement character, refer to that link, below :

    https://en.wikipedia.org/wiki/Specials_(Unicode_block)#Replacement_character

    The fact that you see the three bytes �, also means that you currently use the Windows-1252 encoding, as default ANSI encoding, for all your Non-Unicode programs. To verify this statement, could you verify that the contents of the table, below :

    https://en.wikipedia.org/wiki/Windows-1252

    is identical to the table displayed when you click on the menu option Edit > Character Panel, from within Notepad++ ?


    May be, this link to this tiny in-line UTF-8 tool could be of some interest :

    http://www.cogsci.ed.ac.uk/~richard/utf-8.cgi?

    • First, read the few notes, at the end of the screen

    • Secondly, choose the representation of the searched Unicode character, selecting the appropriate radio button, from the list :

      • Character
      • Hex code-point
      • Decimal code-point
      • Hex UTF-8 bytes
      • Octal UTF-8 bytes
      • UTF-8 bytes as Latin-1 characters
      • Hex surrogates
    • Thirdly, type in that representation

    • Click on the Go button => All the representations, of the Unicode char searched, will be displayed

    For instance :

    • Select the Hex UTF-8 bytes radio button, first

    • Type in EF BF BD, with a space between each byte

    • Click on the Go button

    => You’ll get all the representations of the Unicode Replacement character

    Now :

    • Select the Hex UTF-8 bytes radio button, first

    • Type in EF BB BF, with a space between each byte

    • Click on the Go button

    => This time, you’ll get all data, regarding the Unicode Byte Order Mark character. Note that, if this character is found at any location, different from the very beginning of current file, it is known as the Unicode ZERO WIDTH NO-BREAK SPACE character !

    Best Regards,

    guy038


Log in to reply