Community
    • Login

    "Find in files" special characters not working anymore

    Scheduled Pinned Locked Moved General Discussion
    8 Posts 4 Posters 3.4k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Pro BgP
      Pro Bg
      last edited by

      Hello.

      I’ve been Search and Replacing in a bunch of files for years till one day.

      I have TV shows subtitles containing special characters that don’t show up on my TV, so I replace those with normal letters and they’re a lot, so I need to batch.

      The chars don’t show as they should in notepad++ either, though I’m on UTF-8, they are ș ț Ș Ț and I was replacing them like this:

      º -> s
      þ -> t
      ª -> S
      Þ -> T

      Find in current file works just fine on my characters. Find and replace in files works just fine on normal letters.

      For example if I search for º in the open file it finds it just fine. Not in Find in Files, I get 0 results. But if look for any normal letter it works as it should.

      I’ve reinstalled hoping some setting blew up but no luck, probably something on my part :(

      Notepad++ v7.7.1 (64-bit)
      Build time : Jun 16 2019 - 21:24:47
      Path : C:\Program Files\Notepad++\notepad++.exe
      Admin mode : OFF
      Local Conf mode : OFF
      OS : Windows 10 (64-bit)
      Plugins : AutoSave.dll BetterMultiSelection.dll Explorer.dll mimeTools.dll NppConverter.dll NppToolBucket.dll PreviewHTML.dll

      1 Reply Last reply Reply Quote 0
      • guy038G
        guy038
        last edited by guy038

        Hello, @Pro-Bg and All,

        You said, in your post :

        The chars don’t show as they should in notepad++ either, though I’m on UTF-8, they are ș ț Ș Ț and I was …

        So, seemingly, you refer to the 4 characters, below :

        From the Latin Extended-B Unicode Script [ 0180 – 024F ] :
        
        |  0218  |  Letter Ș  |  LATIN CAPITAL LETTER S WITH COMMA BELOW  |
        
        |  0219  |  Letter ș  |  LATIN SMALL LETTER S WITH COMMA BELOW    |
        
        |  021A  |  Letter Ț  |  LATIN CAPITAL LETTER T WITH COMMA BELOW  |
        
        |  021B  |  Letter ț  |  LATIN SMALL LETTER T WITH COMMA BEL      |
        

        See, to that purpose :

        http://www.unicode.org/charts/PDF/U0180.pdf

        However, after some searches and from the characters you see, effectively, in your file ( characters ª º Þ and þ ), due to an erroneous encoding, I suppose that you refer, instead, to these 4 characters :

        From the Latin Extended-A Unicode Script [ 0100 - 017F ] :
        
        |  015E  |  Letter Ş  |  LATIN CAPITAL LETTER S WITH CEDILLA  |
        
        |  015F  |  Letter ş  |  LATIN SMALL LETTER S WITH CEDILLA    |
        
        |  0162  |  Letter Ţ  |  LATIN CAPITAL LETTER T WITH CEDILLA  |
        
        |  0163  |  Letter ţ  |  LATIN SMALL LETTER T WITH CEDILLA    |
        

        See, to that purpose :

        http://www.unicode.org/charts/PDF/U0100.pdf

        Note that if you intend to copy /paste some characters from these PDF files, of the Unicode Consortium, I advice you to download them, first. Just because, depending of your browser, some characters, although well displayed, may not be correctly pasted :-((

        So, before going any further, which kind of characters are you referring to ? Indeed, depending of the set of characters used , we should need a different font, which properly handles these characters and correctly displays their glyphs !

        See you later,

        Best Regards,

        guy038

        1 Reply Last reply Reply Quote 2
        • Pro BgP
          Pro Bg
          last edited by Pro Bg

          Sorry, I was looking at another file and presumed all are UTF-8, but noticed later that they’re ANSI. That how the subbers made them, I have no idea.

          And yes, those are the Romanian letters, with comma below, not cedilla, but subbers in my country follow their own rules…

          I’ll upload two of the subtitles here https://gofile.io/?c=Hz4Uts because I don’t have enough privileges to upload in this topic.

          I can do search and replace in the current open file and it works just fine, it’s just the function that finds in files that doesn’t seem to work…

          1 Reply Last reply Reply Quote 1
          • guy038G
            guy038
            last edited by guy038

            Hi, @Pro-Bg and All,

            Firstly, when you begin to ask about characters representation and/or code, the best is to ask yourself : Does my operating system contains a font which can properly handle these characters and correctly displays their glyphs ?

            Now, unfortunately, these Romanian 4 characters Ș,ș, Ț and ț, of Unicode code-point 0218, 0219, 021A and 021B, are handled by very few proportional fonts and, AFAIK, by the monospaced font Consolas only !

            So I advice you to use the Consolas font, which should be part of your system… On Windows 7, its version is 5.22 and from Windows 8, its version is 5.32 and contains 2,735 glyphs

            From within notepad++ :

            • Select the Settings > Style Configurator > option

            • Select Global styles in the Language drop-down list

            • Select Default style in the Style drop-down list

            • In the Font Style area, choose the Consolas font, from the drop-down list of fonts

            • Click on the Save & Close button

            Remark :

            In Notepad++, comparing the glyphs of these 4 Romanian characters ( with comma below ) with their equivalent chars ( with a cedilla ), with the Consolas font, I noticed, when maximum zoom is used, that :

            • Regarding the letter S and s, the cedilla seems closer to the bottom of character than the comma : Ș ș Ş ş

            • Regarding the letter T and t, the character appearance seems rather identical : Ț ț Ţ ţ


            Secondly, I don’t see any reason which could explain that the search/replacement would work when using the Replace dialog and NOT with the Find in Files dialog !

            Two solutions :

            • Open the Replace dialog

              • SEARCH (\x{0218})|(\x{0219})|(\x{021A})|(\x{021B})

              • REPLACE (?1S)(?2s)(?3T)(?4t)

              • Untick the Match whole word only, if necessary

              • Tick the Match case box option ( Important )

              • Tick the Wrap around box option

              • Select the Regular expression radio expression mode

              • Click on the Replace All button

            • Open the Find in Files dialog

              • SEARCH (\x{0218})|(\x{0219})|(\x{021A})|(\x{021B})

              • REPLACE (?1S)(?2s)(?3T)(?4t)

              • Type in the correct file type in the Filters: zone

              • Type in the correct absolute path name to your file, in the Directory: zone or click on the Follow current doc. box option

              • Choose, optionally, the In all sub-folders box option, if you need to browse a file tree

              • Untick the Match whole word only, if necessary

              • Tick the Match case box option ( Important )

              • Select the Regular expression radio expression mode

              • Click on the Replace in Files button

              • Valid the Are you sure? dialog

            Notes :

            • In search, any of these 4 characters \x{####} are stored in groups, from 1 to 4, due to the embedded parentheses ()

            • In replacement, due to the conditional replacement syntax (?#....), where # is the number of the matched group, the appropriate standard replacement letter, S, s, T or t is just rewritten !

            Cheers,

            guy038

            1 Reply Last reply Reply Quote 1
            • PeterJonesP
              PeterJones
              last edited by

              I’m going to assume the solution @guy038 posted will work, because they usually are (or, at least, they are moving in the direction of working for whoever asked the question, because Guy doesn’t stop until they do work).

              However, before he posted, I had started down a non-regex road; I think it will be useful, so even after Guy’s post, I continued to write it up.

              @Pro-Bg said:

              …noticed later that they’re ANSI. That how the subbers made them, …
              And yes, those are the Romanian letters,

              When you said that, I took a look at the files. When you open them with Preferences > Settings > Misc > ☑ Autodetect character encoding enabled, they detect as “ANSI”, and those characters show up as you originally posted. Since you said “Romanian”, I assumed maybe it was really a Central or Eastern European encoding used, rather than the default “ANSI” Western European encoding.

              So I went to **Encoding > Character Sets > Eastern European > **: Choosing ISO 8859-2 appeared to work. But while writing this up, I realized that Romanian can be considered Central European as well, so I tried choosing … > Central European > OEM 852, which made those characters box-drawing, so that was obviously wrong. … > Central European > Windows 1250 appeared to convert those to the right characters as well.

              I don’t know all the differences between ISO 8859-2 and Windows 1250 – ah, per Wikipedia, “Windows-1250 is similar to ISO-8859-2 and has all the printable characters it has and more. However a few of them are rearranged”. You would have to know more about the files to determine which of those encodings they really are; though my guess, if they’re for subtitles, then they were done with the ISO 8859-2, not the Microsoft-centric Windows-1250.

              So really, in Notepad++, instead of doing a search-replace, all you need to do is to change the Encoding > Character Set to the appropriate one (probably ISO 8859-2, but maybe Windows-1250). After doing that, so it’s displayed properly, you should be able to read and edit the file to your heart’s content. If you’re going to be editing the file multiple times, I would suggest Encoding > Convert to UTF-8-BOM, so it will change the encoded single-byte Romanian characters to their UTF-8 multi-byte encoding, with the BOM character inserted at the beginning of the file. Once you save after the conversion, then the next time you open the file with Notepad++, it will properly interpret it as UTF-8, and all the characters will be interpreted and displayed correctly.

              As far as subtitles go: I’m guessing what prompted this is that the subtitles were showing up wrong in your video player of choice. My guess is that it was because your player didn’t know / couldn’t guess the right encoding for the file, so used ANSI like Notepad++ did. I don’t know whether your player handles UTF-8 better than a random encoding… but if it does, then maybe saving the file after converting to UTF-8-BOM will make it work right in your player. You might be able to google for your player’s name and “encoding” or “utf-8” or “unicode”, to find out which encoding it assumes or prefers.

              However, if you have a lot of files, Notepad++ might not be the most efficient for batch-converting the encoding.
              The superuser answer that I referenced in my post in another thread links to a version of iconv for Windows, which should be able to automate the conversion from ISO 8859-2 to UTF-8.

              iconv -f ISO-8859-2 -t utf-8 sourcefile.srt > outfile.srt
              

              To get that to do all files in a given directory, open a cmd.exe prompt in that directory, and run

              FOR %f in (*.srt) do @( iconv -f ISO-8859-2 -t utf-8 "%f" > "%~nf.utf8%~xf" )
              

              When I ran that on the marco polo files you showed us for download, it did properly convert them to utf-8.

              1 Reply Last reply Reply Quote 1
              • guy038G
                guy038
                last edited by guy038

                Hello, @Pro-Bg, @peterjones and All,

                Thanks to @peterjones, I understood that I simply forgot to act in the right order :-(( So, @Pro-Bg, just forget the second part of my previous post, where I described the regex S/R, which is wrong :-((

                So, I, first, downloaded your archive and extracted the Marco Polo S01E01 The Wayfarer 720p BluRay DTS x264-EbP.srt file

                When opening your file, in Notepad++, I get an ANSI encoded file. BTW, I also tried to untick the Settings > Preferences > MISC > Autodetect character encoding option. Luckily, after re-opening Notepad++ and loading your file, its encoding have not been changed and was still ANSI !

                I renamed your file with a shorter name and chose the .txt extension. So, from now on, your initial file will be named Test.txt !

                I’m about to show you 3 different methods to solve the @Pro-Bg’s problem. Note that the first one is just Peter’s solution !


                FIRST method :

                • I used the iconv utility, as suggested by Peter, running the command, below, in a DOS console window :
                iconv -f ISO-8859-2 -t UTF-8 Test.txt > Test_ICONV.txt
                

                Indeed, the result is fine and the 4 characters ª , º, Þ and þ were correctly translated in the 4 chars Ş , ş, Ţ and ţ :-))

                Remark :

                If we assume that your file was, initially, a Windows-1250 encoded file and that we run the command, below :

                iconv -f WINDOWS-1250 -t UTF-8 Test.txt > Test_2.txt
                

                One can easily verify that the two output files are quite identical. So, regarding this file, these two encodings are equivalent. Nice !

                Note :

                Be aware, however, that the 4 characters Ş , ş, Ţ and ţ, in the output file, are letters with a cedilla and not the Romanian letters with a comma below : Ș , ș, Ț and ț !


                SECOND method :

                • Open a new file ( Ctrl + N )

                • If your default encoding, for new files, is not ANSI, select the first option Encoding > ANSI for this empty file. Note that , as your file is empty, you could, either, run the option Encoding > Convert to ANSI

                => The ANSI encoding should be displayed in the status bar

                • Now, copy / paste the contents of the Test.txt file, in this new file

                • Then, run one of the two options :

                  • Encoding > Character Sets > Central European > Windows-1250

                  • Encoding > Character Sets > Eastern European > ISO 8859-2

                • A small window, with title Lose Undo Ability Waning pops up : You should save the current modification. All the saved modifications can not be undone. Continue ?

                • Choose the default choice, clicking on the Yes button

                • The Save as dialog then occurs. So, save this new file as , let’s say, Test_NPP.txt

                => Note that the Windows-1250 ( or ISO 8859-2 ) encoding is shown in the status bar

                • Then select the Encoding > Convert to UTF-8 option ( Do not choose the UTF-8 only option ! )

                => This time, the UTF-8 encoding is displayed in the status bar

                • Save the modifications ( Ctrl + S )

                The nice thing is that the Test_NPP.txt file, built from within N++ and the Test_ICONV.txt file, output of the iconv Dos command, are strictly identical !


                THIRD method ( a bit longer ! ) :

                • Open a new file ( Ctrl + N )

                • If your default encoding, for new files, is not ANSI, select the first option Encoding > ANSI for this empty file

                => The ANSI encoding should be displayed in the status bar

                • Now, copy / paste the contents of Test.txt, in this new file

                • First, we’ll try to get rid of standard characters, in order to identify which characters would have a different byte sequence, when migrated to UTF-8. This concerns, principally, characters with code-point above \x7F. So :

                • Suppression of any ASCII character, with code in the [ 0 - 127 ] range :

                  • SEARCH [\x00-\x7f]+

                  • REPLACE Leave EMPTY

                • Let only one character per line :

                  • SEARCH .

                  • REPLACE $0\r\n

                • Run the Edit > Line Operations > Sort Lines Lexicographically Ascending option

                • Run the Edit > Line Operations > Remove Consecutive Duplicate Lines option

                => You’re left with a tiny list of 9 characters ª º Ã Î Þ â ã î þ

                • Run the Encoding > Character Sets > Central European > Windows-1250 option

                • A small window, with title Lose Undo Ability Waning pops up : You should save the current modification. All the saved modifications can not be undone. Continue ?

                • Choose the default choice, clicking on the Yes button

                • The Save as dialog occurs. So, save this new file, anywhere, with a dummy name

                => The Windows-1250 encoding is shown in the status bar

                • The tiny list have been changed into these 9 following characters Ş ş Ă Î Ţ â ă î ţ, rewritten, below, with their codes :
                Characters           Ş      ş      Ă      Î      Ţ      â      ă      î      ţ
                
                In Windows-1250    00AA   00BA   00C3   00CE   00DE   00E2   00E3   00EE   00FE
                
                ( Unicode value    015E   015f   0102   00CE   0162   00E2   0103   00EE   0163 )
                

                Refer, to that purpose, to the link :

                https://en.wikipedia.org/wiki/Windows-1250

                After examination of the different Unicode values, we can eliminate the 3 characters Î, â and î, which are identical in the two encodings ( Note that they correspond to the characters with an Unicode value under \x0100 )

                • Open a new file ( Ctrl + N )

                • If your default encoding, for new files, is not ANSI, select the first option Encoding > ANSI for this empty file

                => The ANSI encoding should be displayed in the status bar

                • Now, copy / paste the contents of Test.txt, in this new file

                • Run the Encoding > Convert to UTF-8 option ( Do not choose the UTF-8 only option ! )

                => The UTF-8 encoding is, now, displayed in the status bar

                • Perform the following regex S/R :

                  • SEARCH (\x{00AA})|(\x{00BA})|(\x{00C3})|(\x{00DE})|(\x{00E3})|(\x{00FE})

                  • REPLACE (?1\x{015E})(?2\x{015F})(?3\x{0102})(?4\x{0162})(?5\x{0103})(?6\x{0163})

                => 733 replacements done

                • Save this new file and name it, let’s say, Test_REGEX.txt

                Again, the nice thing is that the Test_REGEX.txt file, built from within N++, with a regex S/R, and the Test_ICONV.txt file, output of the iconv Dos command, are strictly identical, too !

                Best Regards,

                guy038

                P.S. :

                Now, @Pro-Bg, if you really want to see the Romanian Ș , ș, Ț and ț letters, with comma below :

                • In N++, open, either, the Test_ICONV.txt, Test_NPP.txt or Test_REGEX output file, ( identical UTF-8 encoded files ! )

                • Perform this last regex S/R :

                  • SEARCH (\x{015E})|(\x{015F})|(\x{0162})|(\x{0163}) ( Characters S and T with cedilla )

                  • REPLACE (?1\x{0218})(?2\x{0219})(?3\x{021A})(?4\x{021B}) ( Romanian Characters S and T with comma below )

                • Re-save your file

                P.P.S. :

                You are really lucky, whose mother tongue is English ! You have to worry, very little, about all these encoding problems ;-))

                1 Reply Last reply Reply Quote 1
                • Pro BgP
                  Pro Bg
                  last edited by

                  Thank you for your time and answers, gentlemen, wasn’t expecting such support on this forum.

                  1 Reply Last reply Reply Quote 2
                  • GregoriG
                    Gregori
                    last edited by Gregori

                    Hello!
                    I do not want to create a new topic because my problem is pretty much the same. But I would like to get a short answer, I’m not interested in the character coding stuff.

                    So, I have a lot of .cpp files, all of them are ANSI and to my luck, each comment made in Korean language. For example, there is a comment: “ÇöŔç Ŕ§ÄˇżˇĽ­ »çżëÇŇ Ľö ľř˝Ŕ´Ď´Ů.”
                    (This means “Not available at this location.” if I change the character encoding to Windows-949 but it is not important now.)

                    Few notepad++ patches before I was able to search in my source files for special encoded characters, but nowadays I can’t.

                    So my question is, what can I do to fix the search?

                    I do not want to install an older version of notepad just because of this, but I think I must. What happened? Why not working correctly the search anymore? What can I do?

                    Thank you in advance!

                    1 Reply Last reply Reply Quote 0
                    • First post
                      Last post
                    The Community of users of the Notepad++ text editor.
                    Powered by NodeBB | Contributors