"Find in files" special characters not working anymore



  • Hello.

    I’ve been Search and Replacing in a bunch of files for years till one day.

    I have TV shows subtitles containing special characters that don’t show up on my TV, so I replace those with normal letters and they’re a lot, so I need to batch.

    The chars don’t show as they should in notepad++ either, though I’m on UTF-8, they are ș ț Ș Ț and I was replacing them like this:

    º -> s
    þ -> t
    ª -> S
    Þ -> T

    Find in current file works just fine on my characters. Find and replace in files works just fine on normal letters.

    For example if I search for º in the open file it finds it just fine. Not in Find in Files, I get 0 results. But if look for any normal letter it works as it should.

    I’ve reinstalled hoping some setting blew up but no luck, probably something on my part :(

    Notepad++ v7.7.1 (64-bit)
    Build time : Jun 16 2019 - 21:24:47
    Path : C:\Program Files\Notepad++\notepad++.exe
    Admin mode : OFF
    Local Conf mode : OFF
    OS : Windows 10 (64-bit)
    Plugins : AutoSave.dll BetterMultiSelection.dll Explorer.dll mimeTools.dll NppConverter.dll NppToolBucket.dll PreviewHTML.dll



  • Hello, @Pro-Bg and All,

    You said, in your post :

    The chars don’t show as they should in notepad++ either, though I’m on UTF-8, they are ș ț Ș Ț and I was …

    So, seemingly, you refer to the 4 characters, below :

    From the Latin Extended-B Unicode Script [ 0180 – 024F ] :
    
    |  0218  |  Letter Ș  |  LATIN CAPITAL LETTER S WITH COMMA BELOW  |
    
    |  0219  |  Letter ș  |  LATIN SMALL LETTER S WITH COMMA BELOW    |
    
    |  021A  |  Letter Ț  |  LATIN CAPITAL LETTER T WITH COMMA BELOW  |
    
    |  021B  |  Letter ț  |  LATIN SMALL LETTER T WITH COMMA BEL      |
    

    See, to that purpose :

    http://www.unicode.org/charts/PDF/U0180.pdf

    However, after some searches and from the characters you see, effectively, in your file ( characters ª º Þ and þ ), due to an erroneous encoding, I suppose that you refer, instead, to these 4 characters :

    From the Latin Extended-A Unicode Script [ 0100 - 017F ] :
    
    |  015E  |  Letter Ş  |  LATIN CAPITAL LETTER S WITH CEDILLA  |
    
    |  015F  |  Letter ş  |  LATIN SMALL LETTER S WITH CEDILLA    |
    
    |  0162  |  Letter Ţ  |  LATIN CAPITAL LETTER T WITH CEDILLA  |
    
    |  0163  |  Letter ţ  |  LATIN SMALL LETTER T WITH CEDILLA    |
    

    See, to that purpose :

    http://www.unicode.org/charts/PDF/U0100.pdf

    Note that if you intend to copy /paste some characters from these PDF files, of the Unicode Consortium, I advice you to download them, first. Just because, depending of your browser, some characters, although well displayed, may not be correctly pasted :-((

    So, before going any further, which kind of characters are you referring to ? Indeed, depending of the set of characters used , we should need a different font, which properly handles these characters and correctly displays their glyphs !

    See you later,

    Best Regards,

    guy038



  • Sorry, I was looking at another file and presumed all are UTF-8, but noticed later that they’re ANSI. That how the subbers made them, I have no idea.

    And yes, those are the Romanian letters, with comma below, not cedilla, but subbers in my country follow their own rules…

    I’ll upload two of the subtitles here https://gofile.io/?c=Hz4Uts because I don’t have enough privileges to upload in this topic.

    I can do search and replace in the current open file and it works just fine, it’s just the function that finds in files that doesn’t seem to work…



  • Hi, @Pro-Bg and All,

    Firstly, when you begin to ask about characters representation and/or code, the best is to ask yourself : Does my operating system contains a font which can properly handle these characters and correctly displays their glyphs ?

    Now, unfortunately, these Romanian 4 characters Ș,ș, Ț and ț, of Unicode code-point 0218, 0219, 021A and 021B, are handled by very few proportional fonts and, AFAIK, by the monospaced font Consolas only !

    So I advice you to use the Consolas font, which should be part of your system… On Windows 7, its version is 5.22 and from Windows 8, its version is 5.32 and contains 2,735 glyphs

    From within notepad++ :

    • Select the Settings > Style Configurator > option

    • Select Global styles in the Language drop-down list

    • Select Default style in the Style drop-down list

    • In the Font Style area, choose the Consolas font, from the drop-down list of fonts

    • Click on the Save & Close button

    Remark :

    In Notepad++, comparing the glyphs of these 4 Romanian characters ( with comma below ) with their equivalent chars ( with a cedilla ), with the Consolas font, I noticed, when maximum zoom is used, that :

    • Regarding the letter S and s, the cedilla seems closer to the bottom of character than the comma : Ș ș Ş ş

    • Regarding the letter T and t, the character appearance seems rather identical : Ț ț Ţ ţ


    Secondly, I don’t see any reason which could explain that the search/replacement would work when using the Replace dialog and NOT with the Find in Files dialog !

    Two solutions :

    • Open the Replace dialog

      • SEARCH (\x{0218})|(\x{0219})|(\x{021A})|(\x{021B})

      • REPLACE (?1S)(?2s)(?3T)(?4t)

      • Untick the Match whole word only, if necessary

      • Tick the Match case box option ( Important )

      • Tick the Wrap around box option

      • Select the Regular expression radio expression mode

      • Click on the Replace All button

    • Open the Find in Files dialog

      • SEARCH (\x{0218})|(\x{0219})|(\x{021A})|(\x{021B})

      • REPLACE (?1S)(?2s)(?3T)(?4t)

      • Type in the correct file type in the Filters: zone

      • Type in the correct absolute path name to your file, in the Directory: zone or click on the Follow current doc. box option

      • Choose, optionally, the In all sub-folders box option, if you need to browse a file tree

      • Untick the Match whole word only, if necessary

      • Tick the Match case box option ( Important )

      • Select the Regular expression radio expression mode

      • Click on the Replace in Files button

      • Valid the Are you sure? dialog

    Notes :

    • In search, any of these 4 characters \x{####} are stored in groups, from 1 to 4, due to the embedded parentheses ()

    • In replacement, due to the conditional replacement syntax (?#....), where # is the number of the matched group, the appropriate standard replacement letter, S, s, T or t is just rewritten !

    Cheers,

    guy038



  • I’m going to assume the solution @guy038 posted will work, because they usually are (or, at least, they are moving in the direction of working for whoever asked the question, because Guy doesn’t stop until they do work).

    However, before he posted, I had started down a non-regex road; I think it will be useful, so even after Guy’s post, I continued to write it up.

    @Pro-Bg said:

    …noticed later that they’re ANSI. That how the subbers made them, …
    And yes, those are the Romanian letters,

    When you said that, I took a look at the files. When you open them with Preferences > Settings > Misc > ☑ Autodetect character encoding enabled, they detect as “ANSI”, and those characters show up as you originally posted. Since you said “Romanian”, I assumed maybe it was really a Central or Eastern European encoding used, rather than the default “ANSI” Western European encoding.

    So I went to **Encoding > Character Sets > Eastern European > **: Choosing ISO 8859-2 appeared to work. But while writing this up, I realized that Romanian can be considered Central European as well, so I tried choosing … > Central European > OEM 852, which made those characters box-drawing, so that was obviously wrong. … > Central European > Windows 1250 appeared to convert those to the right characters as well.

    I don’t know all the differences between ISO 8859-2 and Windows 1250 – ah, per Wikipedia, “Windows-1250 is similar to ISO-8859-2 and has all the printable characters it has and more. However a few of them are rearranged”. You would have to know more about the files to determine which of those encodings they really are; though my guess, if they’re for subtitles, then they were done with the ISO 8859-2, not the Microsoft-centric Windows-1250.

    So really, in Notepad++, instead of doing a search-replace, all you need to do is to change the Encoding > Character Set to the appropriate one (probably ISO 8859-2, but maybe Windows-1250). After doing that, so it’s displayed properly, you should be able to read and edit the file to your heart’s content. If you’re going to be editing the file multiple times, I would suggest Encoding > Convert to UTF-8-BOM, so it will change the encoded single-byte Romanian characters to their UTF-8 multi-byte encoding, with the BOM character inserted at the beginning of the file. Once you save after the conversion, then the next time you open the file with Notepad++, it will properly interpret it as UTF-8, and all the characters will be interpreted and displayed correctly.

    As far as subtitles go: I’m guessing what prompted this is that the subtitles were showing up wrong in your video player of choice. My guess is that it was because your player didn’t know / couldn’t guess the right encoding for the file, so used ANSI like Notepad++ did. I don’t know whether your player handles UTF-8 better than a random encoding… but if it does, then maybe saving the file after converting to UTF-8-BOM will make it work right in your player. You might be able to google for your player’s name and “encoding” or “utf-8” or “unicode”, to find out which encoding it assumes or prefers.

    However, if you have a lot of files, Notepad++ might not be the most efficient for batch-converting the encoding.
    The superuser answer that I referenced in my post in another thread links to a version of iconv for Windows, which should be able to automate the conversion from ISO 8859-2 to UTF-8.

    iconv -f ISO-8859-2 -t utf-8 sourcefile.srt > outfile.srt
    

    To get that to do all files in a given directory, open a cmd.exe prompt in that directory, and run

    FOR %f in (*.srt) do @( iconv -f ISO-8859-2 -t utf-8 "%f" > "%~nf.utf8%~xf" )
    

    When I ran that on the marco polo files you showed us for download, it did properly convert them to utf-8.



  • Hello, @Pro-Bg, @peterjones and All,

    Thanks to @peterjones, I understood that I simply forgot to act in the right order :-(( So, @Pro-Bg, just forget the second part of my previous post, where I described the regex S/R, which is wrong :-((

    So, I, first, downloaded your archive and extracted the Marco Polo S01E01 The Wayfarer 720p BluRay DTS x264-EbP.srt file

    When opening your file, in Notepad++, I get an ANSI encoded file. BTW, I also tried to untick the Settings > Preferences > MISC > Autodetect character encoding option. Luckily, after re-opening Notepad++ and loading your file, its encoding have not been changed and was still ANSI !

    I renamed your file with a shorter name and chose the .txt extension. So, from now on, your initial file will be named Test.txt !

    I’m about to show you 3 different methods to solve the @Pro-Bg’s problem. Note that the first one is just Peter’s solution !


    FIRST method :

    • I used the iconv utility, as suggested by Peter, running the command, below, in a DOS console window :
    iconv -f ISO-8859-2 -t UTF-8 Test.txt > Test_ICONV.txt
    

    Indeed, the result is fine and the 4 characters ª , º, Þ and þ were correctly translated in the 4 chars Ş , ş, Ţ and ţ :-))

    Remark :

    If we assume that your file was, initially, a Windows-1250 encoded file and that we run the command, below :

    iconv -f WINDOWS-1250 -t UTF-8 Test.txt > Test_2.txt
    

    One can easily verify that the two output files are quite identical. So, regarding this file, these two encodings are equivalent. Nice !

    Note :

    Be aware, however, that the 4 characters Ş , ş, Ţ and ţ, in the output file, are letters with a cedilla and not the Romanian letters with a comma below : Ș , ș, Ț and ț !


    SECOND method :

    • Open a new file ( Ctrl + N )

    • If your default encoding, for new files, is not ANSI, select the first option Encoding > ANSI for this empty file. Note that , as your file is empty, you could, either, run the option Encoding > Convert to ANSI

    => The ANSI encoding should be displayed in the status bar

    • Now, copy / paste the contents of the Test.txt file, in this new file

    • Then, run one of the two options :

      • Encoding > Character Sets > Central European > Windows-1250

      • Encoding > Character Sets > Eastern European > ISO 8859-2

    • A small window, with title Lose Undo Ability Waning pops up : You should save the current modification. All the saved modifications can not be undone. Continue ?

    • Choose the default choice, clicking on the Yes button

    • The Save as dialog then occurs. So, save this new file as , let’s say, Test_NPP.txt

    => Note that the Windows-1250 ( or ISO 8859-2 ) encoding is shown in the status bar

    • Then select the Encoding > Convert to UTF-8 option ( Do not choose the UTF-8 only option ! )

    => This time, the UTF-8 encoding is displayed in the status bar

    • Save the modifications ( Ctrl + S )

    The nice thing is that the Test_NPP.txt file, built from within N++ and the Test_ICONV.txt file, output of the iconv Dos command, are strictly identical !


    THIRD method ( a bit longer ! ) :

    • Open a new file ( Ctrl + N )

    • If your default encoding, for new files, is not ANSI, select the first option Encoding > ANSI for this empty file

    => The ANSI encoding should be displayed in the status bar

    • Now, copy / paste the contents of Test.txt, in this new file

    • First, we’ll try to get rid of standard characters, in order to identify which characters would have a different byte sequence, when migrated to UTF-8. This concerns, principally, characters with code-point above \x7F. So :

    • Suppression of any ASCII character, with code in the [ 0 - 127 ] range :

      • SEARCH [\x00-\x7f]+

      • REPLACE Leave EMPTY

    • Let only one character per line :

      • SEARCH .

      • REPLACE $0\r\n

    • Run the Edit > Line Operations > Sort Lines Lexicographically Ascending option

    • Run the Edit > Line Operations > Remove Consecutive Duplicate Lines option

    => You’re left with a tiny list of 9 characters ª º Ã Î Þ â ã î þ

    • Run the Encoding > Character Sets > Central European > Windows-1250 option

    • A small window, with title Lose Undo Ability Waning pops up : You should save the current modification. All the saved modifications can not be undone. Continue ?

    • Choose the default choice, clicking on the Yes button

    • The Save as dialog occurs. So, save this new file, anywhere, with a dummy name

    => The Windows-1250 encoding is shown in the status bar

    • The tiny list have been changed into these 9 following characters Ş ş Ă Î Ţ â ă î ţ, rewritten, below, with their codes :
    Characters           Ş      ş      Ă      Î      Ţ      â      ă      î      ţ
    
    In Windows-1250    00AA   00BA   00C3   00CE   00DE   00E2   00E3   00EE   00FE
    
    ( Unicode value    015E   015f   0102   00CE   0162   00E2   0103   00EE   0163 )
    

    Refer, to that purpose, to the link :

    https://en.wikipedia.org/wiki/Windows-1250

    After examination of the different Unicode values, we can eliminate the 3 characters Î, â and î, which are identical in the two encodings ( Note that they correspond to the characters with an Unicode value under \x0100 )

    • Open a new file ( Ctrl + N )

    • If your default encoding, for new files, is not ANSI, select the first option Encoding > ANSI for this empty file

    => The ANSI encoding should be displayed in the status bar

    • Now, copy / paste the contents of Test.txt, in this new file

    • Run the Encoding > Convert to UTF-8 option ( Do not choose the UTF-8 only option ! )

    => The UTF-8 encoding is, now, displayed in the status bar

    • Perform the following regex S/R :

      • SEARCH (\x{00AA})|(\x{00BA})|(\x{00C3})|(\x{00DE})|(\x{00E3})|(\x{00FE})

      • REPLACE (?1\x{015E})(?2\x{015F})(?3\x{0102})(?4\x{0162})(?5\x{0103})(?6\x{0163})

    => 733 replacements done

    • Save this new file and name it, let’s say, Test_REGEX.txt

    Again, the nice thing is that the Test_REGEX.txt file, built from within N++, with a regex S/R, and the Test_ICONV.txt file, output of the iconv Dos command, are strictly identical, too !

    Best Regards,

    guy038

    P.S. :

    Now, @Pro-Bg, if you really want to see the Romanian Ș , ș, Ț and ț letters, with comma below :

    • In N++, open, either, the Test_ICONV.txt, Test_NPP.txt or Test_REGEX output file, ( identical UTF-8 encoded files ! )

    • Perform this last regex S/R :

      • SEARCH (\x{015E})|(\x{015F})|(\x{0162})|(\x{0163}) ( Characters S and T with cedilla )

      • REPLACE (?1\x{0218})(?2\x{0219})(?3\x{021A})(?4\x{021B}) ( Romanian Characters S and T with comma below )

    • Re-save your file

    P.P.S. :

    You are really lucky, whose mother tongue is English ! You have to worry, very little, about all these encoding problems ;-))



  • Thank you for your time and answers, gentlemen, wasn’t expecting such support on this forum.


Log in to reply