"Find in files" special characters not working anymore
- 
 Hello. I’ve been Search and Replacing in a bunch of files for years till one day. I have TV shows subtitles containing special characters that don’t show up on my TV, so I replace those with normal letters and they’re a lot, so I need to batch. The chars don’t show as they should in notepad++ either, though I’m on UTF-8, they are ș ț Ș Ț and I was replacing them like this: º -> s 
 þ -> t
 ª -> S
 Þ -> TFind in current file works just fine on my characters. Find and replace in files works just fine on normal letters. For example if I search for º in the open file it finds it just fine. Not in Find in Files, I get 0 results. But if look for any normal letter it works as it should. I’ve reinstalled hoping some setting blew up but no luck, probably something on my part :( Notepad++ v7.7.1 (64-bit) 
 Build time : Jun 16 2019 - 21:24:47
 Path : C:\Program Files\Notepad++\notepad++.exe
 Admin mode : OFF
 Local Conf mode : OFF
 OS : Windows 10 (64-bit)
 Plugins : AutoSave.dll BetterMultiSelection.dll Explorer.dll mimeTools.dll NppConverter.dll NppToolBucket.dll PreviewHTML.dll
- 
 Hello, @Pro-Bg and All, You said, in your post : The chars don’t show as they should in notepad++ either, though I’m on UTF-8, they are ș ț Ș Ț and I was … So, seemingly, you refer to the 4characters, below :From the Latin Extended-B Unicode Script [ 0180 – 024F ] : | 0218 | Letter Ș | LATIN CAPITAL LETTER S WITH COMMA BELOW | | 0219 | Letter ș | LATIN SMALL LETTER S WITH COMMA BELOW | | 021A | Letter Ț | LATIN CAPITAL LETTER T WITH COMMA BELOW | | 021B | Letter ț | LATIN SMALL LETTER T WITH COMMA BEL |See, to that purpose : http://www.unicode.org/charts/PDF/U0180.pdf However, after some searches and from the characters you see, effectively, in your file ( characters ª º Þ and þ), due to an erroneous encoding, I suppose that you refer, instead, to these4characters :From the Latin Extended-A Unicode Script [ 0100 - 017F ] : | 015E | Letter Ş | LATIN CAPITAL LETTER S WITH CEDILLA | | 015F | Letter ş | LATIN SMALL LETTER S WITH CEDILLA | | 0162 | Letter Ţ | LATIN CAPITAL LETTER T WITH CEDILLA | | 0163 | Letter ţ | LATIN SMALL LETTER T WITH CEDILLA |See, to that purpose : http://www.unicode.org/charts/PDF/U0100.pdf Note that if you intend to copy /paste some characters from these PDFfiles, of the Unicode Consortium, I advice you to download them, first. Just because, depending of your browser, some characters, although well displayed, may not be correctly pasted :-((So, before going any further, which kind of characters are you referring to ? Indeed, depending of the set of characters used , we should need a different font, which properly handles these characters and correctly displays their glyphs ! See you later, Best Regards, guy038 
- 
 Sorry, I was looking at another file and presumed all are UTF-8, but noticed later that they’re ANSI. That how the subbers made them, I have no idea. And yes, those are the Romanian letters, with comma below, not cedilla, but subbers in my country follow their own rules… I’ll upload two of the subtitles here https://gofile.io/?c=Hz4Uts because I don’t have enough privileges to upload in this topic. I can do search and replace in the current open file and it works just fine, it’s just the function that finds in files that doesn’t seem to work… 
- 
 Hi, @Pro-Bg and All, Firstly, when you begin to ask about characters representation and/or code, the best is to ask yourself : Does my operating system contains a font which can properly handle these characters and correctly displays their glyphs ? Now, unfortunately, these Romanian 4charactersȘ,ș,Țandț, of Unicode code-point0218,0219,021Aand021B, are handled by very few proportional fonts and, AFAIK, by the monospaced fontConsolasonly !So I advice you to use the Consolasfont, which should be part of your system… OnWindows 7, its version is5.22and fromWindows 8, its version is5.32and contains2,735glyphsFrom within notepad++ : - 
Select the Settings > Style Configurator >option
- 
Select Global stylesin the Language drop-down list
- 
Select Default stylein the Style drop-down list
- 
In the Font Style area, choose the Consolasfont, from the drop-down list of fonts
- 
Click on the Save & Closebutton
 Remark : In Notepad++, comparing the glyphs of these 4Romanian characters ( with comma below ) with their equivalent chars ( with a cedilla ), with theConsolasfont, I noticed, when maximum zoom is used, that :- 
Regarding the letter Sands, the cedilla seems closer to the bottom of character than the comma :Ș ș Ş ş
- 
Regarding the letter Tandt, the character appearance seems rather identical :Ț ț Ţ ţ
 
 Secondly, I don’t see any reason which could explain that the search/replacement would work when using the Replacedialog and NOT with theFind in Filesdialog !Two solutions : - 
Open the Replacedialog- 
SEARCH (\x{0218})|(\x{0219})|(\x{021A})|(\x{021B})
- 
REPLACE (?1S)(?2s)(?3T)(?4t)
- 
Untick the Match whole word only, if necessary
- 
Tick the Match casebox option ( Important )
- 
Tick the Wrap aroundbox option
- 
Select the Regular expressionradio expression mode
- 
Click on the Replace Allbutton
 
- 
- 
Open the Find in Filesdialog- 
SEARCH (\x{0218})|(\x{0219})|(\x{021A})|(\x{021B})
- 
REPLACE (?1S)(?2s)(?3T)(?4t)
- 
Type in the correct file type in the Filters:zone
- 
Type in the correct absolute path name to your file, in the Directory:zone or click on theFollow current doc.box option
- 
Choose, optionally, the In all sub-foldersbox option, if you need to browse a file tree
- 
Untick the Match whole word only, if necessary
- 
Tick the Match casebox option ( Important )
- 
Select the Regular expressionradio expression mode
- 
Click on the Replace in Filesbutton
- 
Valid the Are you sure? dialog 
 
- 
 Notes : - 
In search, any of these 4characters\x{####}are stored in groups, from1to4, due to the embedded parentheses()
- 
In replacement, due to the conditional replacement syntax (?#....), where#is the number of the matched group, the appropriate standard replacement letter,S,s,Tortis just rewritten !
 Cheers, guy038 
- 
- 
 I’m going to assume the solution @guy038 posted will work, because they usually are (or, at least, they are moving in the direction of working for whoever asked the question, because Guy doesn’t stop until they do work). However, before he posted, I had started down a non-regex road; I think it will be useful, so even after Guy’s post, I continued to write it up. @Pro-Bg said: …noticed later that they’re ANSI. That how the subbers made them, … 
 And yes, those are the Romanian letters,When you said that, I took a look at the files. When you open them with Preferences > Settings > Misc > ☑ Autodetect character encoding enabled, they detect as “ANSI”, and those characters show up as you originally posted. Since you said “Romanian”, I assumed maybe it was really a Central or Eastern European encoding used, rather than the default “ANSI” Western European encoding. So I went to **Encoding > Character Sets > Eastern European > **: Choosing ISO 8859-2 appeared to work. But while writing this up, I realized that Romanian can be considered Central European as well, so I tried choosing … > Central European > OEM 852, which made those characters box-drawing, so that was obviously wrong. … > Central European > Windows 1250 appeared to convert those to the right characters as well. I don’t know all the differences between ISO 8859-2 and Windows 1250 – ah, per Wikipedia, “Windows-1250 is similar to ISO-8859-2 and has all the printable characters it has and more. However a few of them are rearranged”. You would have to know more about the files to determine which of those encodings they really are; though my guess, if they’re for subtitles, then they were done with the ISO 8859-2, not the Microsoft-centric Windows-1250. So really, in Notepad++, instead of doing a search-replace, all you need to do is to change the Encoding > Character Set to the appropriate one (probably ISO 8859-2, but maybe Windows-1250). After doing that, so it’s displayed properly, you should be able to read and edit the file to your heart’s content. If you’re going to be editing the file multiple times, I would suggest Encoding > Convert to UTF-8-BOM, so it will change the encoded single-byte Romanian characters to their UTF-8 multi-byte encoding, with the BOM character inserted at the beginning of the file. Once you save after the conversion, then the next time you open the file with Notepad++, it will properly interpret it as UTF-8, and all the characters will be interpreted and displayed correctly. As far as subtitles go: I’m guessing what prompted this is that the subtitles were showing up wrong in your video player of choice. My guess is that it was because your player didn’t know / couldn’t guess the right encoding for the file, so used ANSI like Notepad++ did. I don’t know whether your player handles UTF-8 better than a random encoding… but if it does, then maybe saving the file after converting to UTF-8-BOM will make it work right in your player. You might be able to google for your player’s name and “encoding” or “utf-8” or “unicode”, to find out which encoding it assumes or prefers. However, if you have a lot of files, Notepad++ might not be the most efficient for batch-converting the encoding. 
 The superuser answer that I referenced in my post in another thread links to a version oficonvfor Windows, which should be able to automate the conversion from ISO 8859-2 to UTF-8.iconv -f ISO-8859-2 -t utf-8 sourcefile.srt > outfile.srtTo get that to do all files in a given directory, open a cmd.exe prompt in that directory, and run FOR %f in (*.srt) do @( iconv -f ISO-8859-2 -t utf-8 "%f" > "%~nf.utf8%~xf" )When I ran that on the marco polo files you showed us for download, it did properly convert them to utf-8. 
- 
 Hello, @Pro-Bg, @peterjones and All, Thanks to @peterjones, I understood that I simply forgot to act in the right order :-(( So, @Pro-Bg, just forget the second part of my previous post, where I described the regex S/R, which is wrong :-(( So, I, first, downloaded your archive and extracted the Marco Polo S01E01 The Wayfarer 720p BluRay DTS x264-EbP.srtfileWhen opening your file, in Notepad++, I get an ANSIencoded file. BTW, I also tried to untick theSettings > Preferences > MISC > Autodetect character encodingoption. Luckily, after re-opening Notepad++ and loading your file, its encoding have not been changed and was stillANSI!I renamed your file with a shorter name and chose the .txtextension. So, from now on, your initial file will be namedTest.txt!I’m about to show you 3different methods to solve the @Pro-Bg’s problem. Note that the first one is just Peter’s solution !
 FIRST method : - I used the iconv utility, as suggested by Peter, running the command, below, in a DOS console window :
 iconv -f ISO-8859-2 -t UTF-8 Test.txt > Test_ICONV.txtIndeed, the result is fine and the 4charactersª,º,Þandþwere correctly translated in the4charsŞ,ş,Ţandţ:-))Remark : If we assume that your file was, initially, a Windows-1250encoded file and that we run the command, below :iconv -f WINDOWS-1250 -t UTF-8 Test.txt > Test_2.txtOne can easily verify that the two output files are quite identical. So, regarding this file, these two encodings are equivalent. Nice ! Note : Be aware, however, that the 4charactersŞ,ş,Ţandţ, in the output file, are letters with a cedilla and not the Romanian letters with a comma below :Ș,ș,Țandț!
 SECOND method : - 
Open a new file ( Ctrl + N)
- 
If your default encoding, for new files, is not ANSI, select the first optionEncoding > ANSIfor this empty file. Note that , as your file is empty, you could, either, run the optionEncoding > Convert to ANSI
 => The ANSIencoding should be displayed in the status bar- 
Now, copy / paste the contents of the Test.txtfile, in this new file
- 
Then, run one of the two options : - 
Encoding > Character Sets > Central European > Windows-1250
- 
Encoding > Character Sets > Eastern European > ISO 8859-2
 
- 
- 
A small window, with title Lose Undo Ability Waning pops up : You should save the current modification. All the saved modifications can not be undone. Continue ? 
- 
Choose the default choice, clicking on the Yesbutton
- 
The Save as dialog then occurs. So, save this new file as , let’s say, Test_NPP.txt
 => Note that the Windows-1250( orISO 8859-2) encoding is shown in the status bar- Then select the Encoding > Convert to UTF-8option ( Do not choose theUTF-8only option ! )
 => This time, the UTF-8encoding is displayed in the status bar- Save the modifications ( Ctrl + S)
 The nice thing is that the Test_NPP.txtfile, built from within N++ and theTest_ICONV.txtfile, output of theiconvDos command, are strictly identical !
 THIRD method ( a bit longer ! ) : - 
Open a new file ( Ctrl + N)
- 
If your default encoding, for new files, is not ANSI, select the first optionEncoding > ANSIfor this empty file
 => The ANSIencoding should be displayed in the status bar- 
Now, copy / paste the contents of Test.txt, in this new file
- 
First, we’ll try to get rid of standard characters, in order to identify which characters would have a different byte sequence, when migrated to UTF-8. This concerns, principally, characters with code-point above\x7F. So :
- 
Suppression of any ASCIIcharacter, with code in the[ 0 - 127 ]range :- 
SEARCH [\x00-\x7f]+
- 
REPLACE Leave EMPTY
 
- 
- 
Let only one character per line : - 
SEARCH .
- 
REPLACE $0\r\n
 
- 
- 
Run the Edit > Line Operations > Sort Lines Lexicographically Ascendingoption
- 
Run the Edit > Line Operations > Remove Consecutive Duplicate Linesoption
 => You’re left with a tiny list of 9charactersª º Ã Î Þ â ã î þ- 
Run the Encoding > Character Sets > Central European > Windows-1250option
- 
A small window, with title Lose Undo Ability Waning pops up : You should save the current modification. All the saved modifications can not be undone. Continue ? 
- 
Choose the default choice, clicking on the Yesbutton
- 
The Save as dialog occurs. So, save this new file, anywhere, with a dummy name 
 => The Windows-1250encoding is shown in the status bar- The tiny list have been changed into these 9following charactersŞ ş Ă Î Ţ â ă î ţ, rewritten, below, with their codes :
 Characters Ş ş Ă Î Ţ â ă î ţ In Windows-1250 00AA 00BA 00C3 00CE 00DE 00E2 00E3 00EE 00FE ( Unicode value 015E 015f 0102 00CE 0162 00E2 0103 00EE 0163 )Refer, to that purpose, to the link : https://en.wikipedia.org/wiki/Windows-1250 After examination of the different Unicode values, we can eliminate the 3charactersÎ,âandî, which are identical in the two encodings ( Note that they correspond to the characters with an Unicode value under\x0100)- 
Open a new file ( Ctrl + N)
- 
If your default encoding, for new files, is not ANSI, select the first optionEncoding > ANSIfor this empty file
 => The ANSIencoding should be displayed in the status bar- 
Now, copy / paste the contents of Test.txt, in this new file
- 
Run the Encoding > Convert to UTF-8option ( Do not choose theUTF-8only option ! )
 => The UTF-8encoding is, now, displayed in the status bar- 
Perform the following regex S/R : - 
SEARCH (\x{00AA})|(\x{00BA})|(\x{00C3})|(\x{00DE})|(\x{00E3})|(\x{00FE})
- 
REPLACE (?1\x{015E})(?2\x{015F})(?3\x{0102})(?4\x{0162})(?5\x{0103})(?6\x{0163})
 
- 
 => 733replacements done- Save this new file and name it, let’s say, Test_REGEX.txt
 Again, the nice thing is that the Test_REGEX.txtfile, built from within N++, with a regex S/R, and theTest_ICONV.txtfile, output of theiconvDos command, are strictly identical, too !Best Regards, guy038 P.S. : Now, @Pro-Bg, if you really want to see the Romanian Ș,ș,Țandțletters, with comma below :- 
In N++, open, either, the Test_ICONV.txt,Test_NPP.txtorTest_REGEXoutput file, ( identicalUTF-8encoded files ! )
- 
Perform this last regex S/R : - 
SEARCH (\x{015E})|(\x{015F})|(\x{0162})|(\x{0163})( CharactersSandTwith cedilla )
- 
REPLACE (?1\x{0218})(?2\x{0219})(?3\x{021A})(?4\x{021B})( Romanian CharactersSandTwith comma below )
 
- 
- 
Re-save your file 
 P.P.S. : You are really lucky, whose mother tongue is English ! You have to worry, very little, about all these encoding problems ;-)) 
- 
 Thank you for your time and answers, gentlemen, wasn’t expecting such support on this forum. 
- 
 Hello! 
 I do not want to create a new topic because my problem is pretty much the same. But I would like to get a short answer, I’m not interested in the character coding stuff.So, I have a lot of .cpp files, all of them are ANSI and to my luck, each comment made in Korean language. For example, there is a comment: “ÇöŔç Ŕ§ÄˇżˇĽ »çżëÇŇ Ľö ľř˝Ŕ´Ď´Ů.” 
 (This means “Not available at this location.” if I change the character encoding to Windows-949 but it is not important now.)Few notepad++ patches before I was able to search in my source files for special encoded characters, but nowadays I can’t. So my question is, what can I do to fix the search? I do not want to install an older version of notepad just because of this, but I think I must. What happened? Why not working correctly the search anymore? What can I do? Thank you in advance! 
