"Find in files" special characters not working anymore
-
Hello.
I’ve been Search and Replacing in a bunch of files for years till one day.
I have TV shows subtitles containing special characters that don’t show up on my TV, so I replace those with normal letters and they’re a lot, so I need to batch.
The chars don’t show as they should in notepad++ either, though I’m on UTF-8, they are ș ț Ș Ț and I was replacing them like this:
º -> s
þ -> t
ª -> S
Þ -> TFind in current file works just fine on my characters. Find and replace in files works just fine on normal letters.
For example if I search for º in the open file it finds it just fine. Not in Find in Files, I get 0 results. But if look for any normal letter it works as it should.
I’ve reinstalled hoping some setting blew up but no luck, probably something on my part :(
Notepad++ v7.7.1 (64-bit)
Build time : Jun 16 2019 - 21:24:47
Path : C:\Program Files\Notepad++\notepad++.exe
Admin mode : OFF
Local Conf mode : OFF
OS : Windows 10 (64-bit)
Plugins : AutoSave.dll BetterMultiSelection.dll Explorer.dll mimeTools.dll NppConverter.dll NppToolBucket.dll PreviewHTML.dll -
Hello, @Pro-Bg and All,
You said, in your post :
The chars don’t show as they should in notepad++ either, though I’m on UTF-8, they are ș ț Ș Ț and I was …
So, seemingly, you refer to the
4characters, below :From the Latin Extended-B Unicode Script [ 0180 – 024F ] : | 0218 | Letter Ș | LATIN CAPITAL LETTER S WITH COMMA BELOW | | 0219 | Letter ș | LATIN SMALL LETTER S WITH COMMA BELOW | | 021A | Letter Ț | LATIN CAPITAL LETTER T WITH COMMA BELOW | | 021B | Letter ț | LATIN SMALL LETTER T WITH COMMA BEL |See, to that purpose :
http://www.unicode.org/charts/PDF/U0180.pdf
However, after some searches and from the characters you see, effectively, in your file ( characters
ª º Þ and þ), due to an erroneous encoding, I suppose that you refer, instead, to these4characters :From the Latin Extended-A Unicode Script [ 0100 - 017F ] : | 015E | Letter Ş | LATIN CAPITAL LETTER S WITH CEDILLA | | 015F | Letter ş | LATIN SMALL LETTER S WITH CEDILLA | | 0162 | Letter Ţ | LATIN CAPITAL LETTER T WITH CEDILLA | | 0163 | Letter ţ | LATIN SMALL LETTER T WITH CEDILLA |See, to that purpose :
http://www.unicode.org/charts/PDF/U0100.pdf
Note that if you intend to copy /paste some characters from these
PDFfiles, of the Unicode Consortium, I advice you to download them, first. Just because, depending of your browser, some characters, although well displayed, may not be correctly pasted :-((So, before going any further, which kind of characters are you referring to ? Indeed, depending of the set of characters used , we should need a different font, which properly handles these characters and correctly displays their glyphs !
See you later,
Best Regards,
guy038
-
Sorry, I was looking at another file and presumed all are UTF-8, but noticed later that they’re ANSI. That how the subbers made them, I have no idea.
And yes, those are the Romanian letters, with comma below, not cedilla, but subbers in my country follow their own rules…
I’ll upload two of the subtitles here https://gofile.io/?c=Hz4Uts because I don’t have enough privileges to upload in this topic.
I can do search and replace in the current open file and it works just fine, it’s just the function that finds in files that doesn’t seem to work…
-
Hi, @Pro-Bg and All,
Firstly, when you begin to ask about characters representation and/or code, the best is to ask yourself : Does my operating system contains a font which can properly handle these characters and correctly displays their glyphs ?
Now, unfortunately, these Romanian
4charactersȘ,ș,Țandț, of Unicode code-point0218,0219,021Aand021B, are handled by very few proportional fonts and, AFAIK, by the monospaced fontConsolasonly !So I advice you to use the
Consolasfont, which should be part of your system… OnWindows 7, its version is5.22and fromWindows 8, its version is5.32and contains2,735glyphsFrom within notepad++ :
-
Select the
Settings > Style Configurator >option -
Select
Global stylesin the Language drop-down list -
Select
Default stylein the Style drop-down list -
In the Font Style area, choose the
Consolasfont, from the drop-down list of fonts -
Click on the
Save & Closebutton
Remark :
In Notepad++, comparing the glyphs of these
4Romanian characters ( with comma below ) with their equivalent chars ( with a cedilla ), with theConsolasfont, I noticed, when maximum zoom is used, that :-
Regarding the letter
Sands, the cedilla seems closer to the bottom of character than the comma :Ș ș Ş ş -
Regarding the letter
Tandt, the character appearance seems rather identical :Ț ț Ţ ţ
Secondly, I don’t see any reason which could explain that the search/replacement would work when using the
Replacedialog and NOT with theFind in Filesdialog !Two solutions :
-
Open the
Replacedialog-
SEARCH
(\x{0218})|(\x{0219})|(\x{021A})|(\x{021B}) -
REPLACE
(?1S)(?2s)(?3T)(?4t) -
Untick the
Match whole word only, if necessary -
Tick the
Match casebox option ( Important ) -
Tick the
Wrap aroundbox option -
Select the
Regular expressionradio expression mode -
Click on the
Replace Allbutton
-
-
Open the
Find in Filesdialog-
SEARCH
(\x{0218})|(\x{0219})|(\x{021A})|(\x{021B}) -
REPLACE
(?1S)(?2s)(?3T)(?4t) -
Type in the correct file type in the
Filters:zone -
Type in the correct absolute path name to your file, in the
Directory:zone or click on theFollow current doc.box option -
Choose, optionally, the
In all sub-foldersbox option, if you need to browse a file tree -
Untick the
Match whole word only, if necessary -
Tick the
Match casebox option ( Important ) -
Select the
Regular expressionradio expression mode -
Click on the
Replace in Filesbutton -
Valid the Are you sure? dialog
-
Notes :
-
In search, any of these
4characters\x{####}are stored in groups, from1to4, due to the embedded parentheses() -
In replacement, due to the conditional replacement syntax
(?#....), where#is the number of the matched group, the appropriate standard replacement letter,S,s,Tortis just rewritten !
Cheers,
guy038
-
-
I’m going to assume the solution @guy038 posted will work, because they usually are (or, at least, they are moving in the direction of working for whoever asked the question, because Guy doesn’t stop until they do work).
However, before he posted, I had started down a non-regex road; I think it will be useful, so even after Guy’s post, I continued to write it up.
@Pro-Bg said:
…noticed later that they’re ANSI. That how the subbers made them, …
And yes, those are the Romanian letters,When you said that, I took a look at the files. When you open them with Preferences > Settings > Misc > ☑ Autodetect character encoding enabled, they detect as “ANSI”, and those characters show up as you originally posted. Since you said “Romanian”, I assumed maybe it was really a Central or Eastern European encoding used, rather than the default “ANSI” Western European encoding.
So I went to **Encoding > Character Sets > Eastern European > **: Choosing ISO 8859-2 appeared to work. But while writing this up, I realized that Romanian can be considered Central European as well, so I tried choosing … > Central European > OEM 852, which made those characters box-drawing, so that was obviously wrong. … > Central European > Windows 1250 appeared to convert those to the right characters as well.
I don’t know all the differences between ISO 8859-2 and Windows 1250 – ah, per Wikipedia, “Windows-1250 is similar to ISO-8859-2 and has all the printable characters it has and more. However a few of them are rearranged”. You would have to know more about the files to determine which of those encodings they really are; though my guess, if they’re for subtitles, then they were done with the ISO 8859-2, not the Microsoft-centric Windows-1250.
So really, in Notepad++, instead of doing a search-replace, all you need to do is to change the Encoding > Character Set to the appropriate one (probably ISO 8859-2, but maybe Windows-1250). After doing that, so it’s displayed properly, you should be able to read and edit the file to your heart’s content. If you’re going to be editing the file multiple times, I would suggest Encoding > Convert to UTF-8-BOM, so it will change the encoded single-byte Romanian characters to their UTF-8 multi-byte encoding, with the BOM character inserted at the beginning of the file. Once you save after the conversion, then the next time you open the file with Notepad++, it will properly interpret it as UTF-8, and all the characters will be interpreted and displayed correctly.
As far as subtitles go: I’m guessing what prompted this is that the subtitles were showing up wrong in your video player of choice. My guess is that it was because your player didn’t know / couldn’t guess the right encoding for the file, so used ANSI like Notepad++ did. I don’t know whether your player handles UTF-8 better than a random encoding… but if it does, then maybe saving the file after converting to UTF-8-BOM will make it work right in your player. You might be able to google for your player’s name and “encoding” or “utf-8” or “unicode”, to find out which encoding it assumes or prefers.
However, if you have a lot of files, Notepad++ might not be the most efficient for batch-converting the encoding.
The superuser answer that I referenced in my post in another thread links to a version oficonvfor Windows, which should be able to automate the conversion from ISO 8859-2 to UTF-8.iconv -f ISO-8859-2 -t utf-8 sourcefile.srt > outfile.srtTo get that to do all files in a given directory, open a cmd.exe prompt in that directory, and run
FOR %f in (*.srt) do @( iconv -f ISO-8859-2 -t utf-8 "%f" > "%~nf.utf8%~xf" )When I ran that on the marco polo files you showed us for download, it did properly convert them to utf-8.
-
Hello, @Pro-Bg, @peterjones and All,
Thanks to @peterjones, I understood that I simply forgot to act in the right order :-(( So, @Pro-Bg, just forget the second part of my previous post, where I described the regex S/R, which is wrong :-((
So, I, first, downloaded your archive and extracted the
Marco Polo S01E01 The Wayfarer 720p BluRay DTS x264-EbP.srtfileWhen opening your file, in Notepad++, I get an
ANSIencoded file. BTW, I also tried to untick theSettings > Preferences > MISC > Autodetect character encodingoption. Luckily, after re-opening Notepad++ and loading your file, its encoding have not been changed and was stillANSI!I renamed your file with a shorter name and chose the
.txtextension. So, from now on, your initial file will be namedTest.txt!I’m about to show you
3different methods to solve the @Pro-Bg’s problem. Note that the first one is just Peter’s solution !
FIRST method :
- I used the iconv utility, as suggested by Peter, running the command, below, in a DOS console window :
iconv -f ISO-8859-2 -t UTF-8 Test.txt > Test_ICONV.txtIndeed, the result is fine and the
4charactersª,º,Þandþwere correctly translated in the4charsŞ,ş,Ţandţ:-))Remark :
If we assume that your file was, initially, a
Windows-1250encoded file and that we run the command, below :iconv -f WINDOWS-1250 -t UTF-8 Test.txt > Test_2.txtOne can easily verify that the two output files are quite identical. So, regarding this file, these two encodings are equivalent. Nice !
Note :
Be aware, however, that the
4charactersŞ,ş,Ţandţ, in the output file, are letters with a cedilla and not the Romanian letters with a comma below :Ș,ș,Țandț!
SECOND method :
-
Open a new file (
Ctrl + N) -
If your default encoding, for new files, is not
ANSI, select the first optionEncoding > ANSIfor this empty file. Note that , as your file is empty, you could, either, run the optionEncoding > Convert to ANSI
=> The
ANSIencoding should be displayed in the status bar-
Now, copy / paste the contents of the
Test.txtfile, in this new file -
Then, run one of the two options :
-
Encoding > Character Sets > Central European > Windows-1250 -
Encoding > Character Sets > Eastern European > ISO 8859-2
-
-
A small window, with title Lose Undo Ability Waning pops up : You should save the current modification. All the saved modifications can not be undone. Continue ?
-
Choose the default choice, clicking on the
Yesbutton -
The Save as dialog then occurs. So, save this new file as , let’s say,
Test_NPP.txt
=> Note that the
Windows-1250( orISO 8859-2) encoding is shown in the status bar- Then select the
Encoding > Convert to UTF-8option ( Do not choose theUTF-8only option ! )
=> This time, the
UTF-8encoding is displayed in the status bar- Save the modifications (
Ctrl + S)
The nice thing is that the
Test_NPP.txtfile, built from within N++ and theTest_ICONV.txtfile, output of theiconvDos command, are strictly identical !
THIRD method ( a bit longer ! ) :
-
Open a new file (
Ctrl + N) -
If your default encoding, for new files, is not
ANSI, select the first optionEncoding > ANSIfor this empty file
=> The
ANSIencoding should be displayed in the status bar-
Now, copy / paste the contents of
Test.txt, in this new file -
First, we’ll try to get rid of standard characters, in order to identify which characters would have a different byte sequence, when migrated to
UTF-8. This concerns, principally, characters with code-point above\x7F. So : -
Suppression of any
ASCIIcharacter, with code in the[ 0 - 127 ]range :-
SEARCH
[\x00-\x7f]+ -
REPLACE
Leave EMPTY
-
-
Let only one character per line :
-
SEARCH
. -
REPLACE
$0\r\n
-
-
Run the
Edit > Line Operations > Sort Lines Lexicographically Ascendingoption -
Run the
Edit > Line Operations > Remove Consecutive Duplicate Linesoption
=> You’re left with a tiny list of
9charactersª º Ã Î Þ â ã î þ-
Run the
Encoding > Character Sets > Central European > Windows-1250option -
A small window, with title Lose Undo Ability Waning pops up : You should save the current modification. All the saved modifications can not be undone. Continue ?
-
Choose the default choice, clicking on the
Yesbutton -
The Save as dialog occurs. So, save this new file, anywhere, with a dummy name
=> The
Windows-1250encoding is shown in the status bar- The tiny list have been changed into these
9following charactersŞ ş Ă Î Ţ â ă î ţ, rewritten, below, with their codes :
Characters Ş ş Ă Î Ţ â ă î ţ In Windows-1250 00AA 00BA 00C3 00CE 00DE 00E2 00E3 00EE 00FE ( Unicode value 015E 015f 0102 00CE 0162 00E2 0103 00EE 0163 )Refer, to that purpose, to the link :
https://en.wikipedia.org/wiki/Windows-1250
After examination of the different Unicode values, we can eliminate the
3charactersÎ,âandî, which are identical in the two encodings ( Note that they correspond to the characters with an Unicode value under\x0100)-
Open a new file (
Ctrl + N) -
If your default encoding, for new files, is not
ANSI, select the first optionEncoding > ANSIfor this empty file
=> The
ANSIencoding should be displayed in the status bar-
Now, copy / paste the contents of
Test.txt, in this new file -
Run the
Encoding > Convert to UTF-8option ( Do not choose theUTF-8only option ! )
=> The
UTF-8encoding is, now, displayed in the status bar-
Perform the following regex S/R :
-
SEARCH
(\x{00AA})|(\x{00BA})|(\x{00C3})|(\x{00DE})|(\x{00E3})|(\x{00FE}) -
REPLACE
(?1\x{015E})(?2\x{015F})(?3\x{0102})(?4\x{0162})(?5\x{0103})(?6\x{0163})
-
=>
733replacements done- Save this new file and name it, let’s say,
Test_REGEX.txt
Again, the nice thing is that the
Test_REGEX.txtfile, built from within N++, with a regex S/R, and theTest_ICONV.txtfile, output of theiconvDos command, are strictly identical, too !Best Regards,
guy038
P.S. :
Now, @Pro-Bg, if you really want to see the Romanian
Ș,ș,Țandțletters, with comma below :-
In N++, open, either, the
Test_ICONV.txt,Test_NPP.txtorTest_REGEXoutput file, ( identicalUTF-8encoded files ! ) -
Perform this last regex S/R :
-
SEARCH
(\x{015E})|(\x{015F})|(\x{0162})|(\x{0163})( CharactersSandTwith cedilla ) -
REPLACE
(?1\x{0218})(?2\x{0219})(?3\x{021A})(?4\x{021B})( Romanian CharactersSandTwith comma below )
-
-
Re-save your file
P.P.S. :
You are really lucky, whose mother tongue is English ! You have to worry, very little, about all these encoding problems ;-))
-
Thank you for your time and answers, gentlemen, wasn’t expecting such support on this forum.
-
Hello!
I do not want to create a new topic because my problem is pretty much the same. But I would like to get a short answer, I’m not interested in the character coding stuff.So, I have a lot of .cpp files, all of them are ANSI and to my luck, each comment made in Korean language. For example, there is a comment: “ÇöŔç Ŕ§ÄˇżˇĽ »çżëÇŇ Ľö ľř˝Ŕ´Ď´Ů.”
(This means “Not available at this location.” if I change the character encoding to Windows-949 but it is not important now.)Few notepad++ patches before I was able to search in my source files for special encoded characters, but nowadays I can’t.
So my question is, what can I do to fix the search?
I do not want to install an older version of notepad just because of this, but I think I must. What happened? Why not working correctly the search anymore? What can I do?
Thank you in advance!