Community
    • Login

    How to Find "�" in multiple notepad++ files ?

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    6 Posts 3 Posters 3.4k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Vasile CarausV
      Vasile Caraus
      last edited by

      hello. I don’t know what is this � but I need to delete from all my files. I try to simple find it, but is not find it. I must deleted one by one from all files.

      any idea ?

      1 Reply Last reply Reply Quote 0
      • PeterJonesP
        PeterJones
        last edited by

        So, if you were to read a UTF-8 file assuming some other 8-bit encoding (like ANSI), those three one-byte characters could appear when it’s reading an actually three-byte character.

        Specifically, those are the bytes EF BF BD, which are the three-byte encoding for U+FFFD, which is the Unicode “REPLACEMENT CHARACTER”: � is “used to replace an incoming character whose value is unknown or unrepresentable in Unicode”. So what I’m guessing really happened is that whatever generated your text file got confused about some character (it was told to output a character it didn’t know), so it output UTF-8 for the �; you then read it with an ANSI-type encoding, rather than reading it in UTF-8, so it showed up as three characters, rather than a single Unicode character �. Probably searching for the three characters � in all open files, or search-in-files option, is treating those as three valid unicode characters, so looking for more than those three bytes, and it never finds them.

        You might be able to search for \x{fffd} and replace it with whatever replacement you want for the REPLACEMENT CHARACTER. But it probably depends on what encoding you have Notepad++ set to default to, and whether you try to autodetect encoding, whether that search-and-replace will work for you.

        As a hint: anytime you see two three random-seeming bytes that are in the 0x80-0xFF range in a row, think “this was probably UTF-8, but incorrectly read as ANSI”; those random 8bit bytes are just what happen when you misinterpret UTF-8.

        1 Reply Last reply Reply Quote 0
        • PeterJonesP
          PeterJones
          last edited by PeterJones

          Followup: one thing you can do to find out what a random group of those, like �, represents:

          1. Create a new empty file in Notepad++
          2. Set Encoding > ANSI (not Encoding > Convert to ANSI)
          3. Paste those characters in
          4. Set Encoding > UTF-8 (not Encoding > Convert to UTF-8)
          5. The UTF-8 text is now visible, and hopefully more readable

          If you follow that sequence with �, you will get �

          Now try it yourself. With the byte sequence ŃōťëƤāđ┼┿, what obfuscated text would those bytes be if properly inerpreted as UTF-8?

          1 Reply Last reply Reply Quote 1
          • guy038G
            guy038
            last edited by guy038

            Hello , @Vasile-caraus, @peterjones and All,

            @vasile-caraus, I first thought that you probably, wrongly, saw the Byte Order Mark ( BOM ) of an Unicode encoded file. Normally, these three or two bytes, at the very beginning of an Unicode file, are not part of file and are invisible in any decent text editor. This character, of Unicode code-point \x{FEFF}, is used to signal the right bytes order, in Unicode files !

            Refer to the link, below, for further information :

            https://en.wikipedia.org/wiki/Byte_order_mark#Byte_order_marks_by_encoding


            However, I was totally wrong, because, in the first row of that table, the three bytes of the BOM ( EF BB BF) in an UTF-8 encoded file ( column 2 ), correspond to the three chars  in a Windows-1252 encoded file ( column 4 ) which is the usual ANSI encoding for Latin languages.

            Therefore, it seemed quite different from your three bytes � that you’ve noticed ! And I was wondering about this difference, without finding any satisfying solution :-(( Luckily, @peterjones did find the correct explanation !


            Regarding the Unicode Replacement character, refer to that link, below :

            https://en.wikipedia.org/wiki/Specials_(Unicode_block)#Replacement_character

            The fact that you see the three bytes �, also means that you currently use the Windows-1252 encoding, as default ANSI encoding, for all your Non-Unicode programs. To verify this statement, could you verify that the contents of the table, below :

            https://en.wikipedia.org/wiki/Windows-1252

            is identical to the table displayed when you click on the menu option Edit > Character Panel, from within Notepad++ ?


            May be, this link to this tiny in-line UTF-8 tool could be of some interest :

            http://www.cogsci.ed.ac.uk/~richard/utf-8.cgi?

            • First, read the few notes, at the end of the screen

            • Secondly, choose the representation of the searched Unicode character, selecting the appropriate radio button, from the list :

              • Character
              • Hex code-point
              • Decimal code-point
              • Hex UTF-8 bytes
              • Octal UTF-8 bytes
              • UTF-8 bytes as Latin-1 characters
              • Hex surrogates
            • Thirdly, type in that representation

            • Click on the Go button => All the representations, of the Unicode char searched, will be displayed

            For instance :

            • Select the Hex UTF-8 bytes radio button, first

            • Type in EF BF BD, with a space between each byte

            • Click on the Go button

            => You’ll get all the representations of the Unicode Replacement character

            Now :

            • Select the Hex UTF-8 bytes radio button, first

            • Type in EF BB BF, with a space between each byte

            • Click on the Go button

            => This time, you’ll get all data, regarding the Unicode Byte Order Mark character. Note that, if this character is found at any location, different from the very beginning of current file, it is known as the Unicode ZERO WIDTH NO-BREAK SPACE character !

            Best Regards,

            guy038

            1 Reply Last reply Reply Quote 3
            • Vasile CarausV
              Vasile Caraus
              last edited by Vasile Caraus

              The safest and easiest way to find Unicode Characters such as � or � is to use Adobe Dreamweaver. Use the option Find in Folder… and select the folder that you want to look up. It always find unicode characters. Also, you can use the Replace section. Works wonderful !

              589e89be-302b-43cb-9f91-e4ad5a5f4049-image.png

              1 Reply Last reply Reply Quote 0
              • Vasile CarausV
                Vasile Caraus
                last edited by

                Another VERY GOOD method to find � or � unicode characters is by using grepWin tool. You have to check the box “treat file as binary” so grepWin won’t try to detect the encoding.

                995c26b8-c2d0-49fc-b57a-9b009db23ffb-image.png

                1 Reply Last reply Reply Quote 1
                • First post
                  Last post
                The Community of users of the Notepad++ text editor.
                Powered by NodeBB | Contributors