"Find in files" special characters not working anymore
-
Hello.
I’ve been Search and Replacing in a bunch of files for years till one day.
I have TV shows subtitles containing special characters that don’t show up on my TV, so I replace those with normal letters and they’re a lot, so I need to batch.
The chars don’t show as they should in notepad++ either, though I’m on UTF-8, they are ș ț Ș Ț and I was replacing them like this:
º -> s
þ -> t
ª -> S
Þ -> TFind in current file works just fine on my characters. Find and replace in files works just fine on normal letters.
For example if I search for º in the open file it finds it just fine. Not in Find in Files, I get 0 results. But if look for any normal letter it works as it should.
I’ve reinstalled hoping some setting blew up but no luck, probably something on my part :(
Notepad++ v7.7.1 (64-bit)
Build time : Jun 16 2019 - 21:24:47
Path : C:\Program Files\Notepad++\notepad++.exe
Admin mode : OFF
Local Conf mode : OFF
OS : Windows 10 (64-bit)
Plugins : AutoSave.dll BetterMultiSelection.dll Explorer.dll mimeTools.dll NppConverter.dll NppToolBucket.dll PreviewHTML.dll -
Hello, @Pro-Bg and All,
You said, in your post :
The chars don’t show as they should in notepad++ either, though I’m on UTF-8, they are ș ț Ș Ț and I was …
So, seemingly, you refer to the
4
characters, below :From the Latin Extended-B Unicode Script [ 0180 – 024F ] : | 0218 | Letter Ș | LATIN CAPITAL LETTER S WITH COMMA BELOW | | 0219 | Letter ș | LATIN SMALL LETTER S WITH COMMA BELOW | | 021A | Letter Ț | LATIN CAPITAL LETTER T WITH COMMA BELOW | | 021B | Letter ț | LATIN SMALL LETTER T WITH COMMA BEL |
See, to that purpose :
http://www.unicode.org/charts/PDF/U0180.pdf
However, after some searches and from the characters you see, effectively, in your file ( characters
ª º Þ and þ
), due to an erroneous encoding, I suppose that you refer, instead, to these4
characters :From the Latin Extended-A Unicode Script [ 0100 - 017F ] : | 015E | Letter Ş | LATIN CAPITAL LETTER S WITH CEDILLA | | 015F | Letter ş | LATIN SMALL LETTER S WITH CEDILLA | | 0162 | Letter Ţ | LATIN CAPITAL LETTER T WITH CEDILLA | | 0163 | Letter ţ | LATIN SMALL LETTER T WITH CEDILLA |
See, to that purpose :
http://www.unicode.org/charts/PDF/U0100.pdf
Note that if you intend to copy /paste some characters from these
PDF
files, of the Unicode Consortium, I advice you to download them, first. Just because, depending of your browser, some characters, although well displayed, may not be correctly pasted :-((So, before going any further, which kind of characters are you referring to ? Indeed, depending of the set of characters used , we should need a different font, which properly handles these characters and correctly displays their glyphs !
See you later,
Best Regards,
guy038
-
Sorry, I was looking at another file and presumed all are UTF-8, but noticed later that they’re ANSI. That how the subbers made them, I have no idea.
And yes, those are the Romanian letters, with comma below, not cedilla, but subbers in my country follow their own rules…
I’ll upload two of the subtitles here https://gofile.io/?c=Hz4Uts because I don’t have enough privileges to upload in this topic.
I can do search and replace in the current open file and it works just fine, it’s just the function that finds in files that doesn’t seem to work…
-
Hi, @Pro-Bg and All,
Firstly, when you begin to ask about characters representation and/or code, the best is to ask yourself : Does my operating system contains a font which can properly handle these characters and correctly displays their glyphs ?
Now, unfortunately, these Romanian
4
charactersȘ
,ș
,Ț
andț
, of Unicode code-point0218
,0219
,021A
and021B
, are handled by very few proportional fonts and, AFAIK, by the monospaced fontConsolas
only !So I advice you to use the
Consolas
font, which should be part of your system… OnWindows 7
, its version is5.22
and fromWindows 8
, its version is5.32
and contains2,735
glyphsFrom within notepad++ :
-
Select the
Settings > Style Configurator >
option -
Select
Global styles
in the Language drop-down list -
Select
Default style
in the Style drop-down list -
In the Font Style area, choose the
Consolas
font, from the drop-down list of fonts -
Click on the
Save & Close
button
Remark :
In Notepad++, comparing the glyphs of these
4
Romanian characters ( with comma below ) with their equivalent chars ( with a cedilla ), with theConsolas
font, I noticed, when maximum zoom is used, that :-
Regarding the letter
S
ands
, the cedilla seems closer to the bottom of character than the comma :Ș ș Ş ş
-
Regarding the letter
T
andt
, the character appearance seems rather identical :Ț ț Ţ ţ
Secondly, I don’t see any reason which could explain that the search/replacement would work when using the
Replace
dialog and NOT with theFind in Files
dialog !Two solutions :
-
Open the
Replace
dialog-
SEARCH
(\x{0218})|(\x{0219})|(\x{021A})|(\x{021B})
-
REPLACE
(?1S)(?2s)(?3T)(?4t)
-
Untick the
Match whole word only
, if necessary -
Tick the
Match case
box option ( Important ) -
Tick the
Wrap around
box option -
Select the
Regular expression
radio expression mode -
Click on the
Replace All
button
-
-
Open the
Find in Files
dialog-
SEARCH
(\x{0218})|(\x{0219})|(\x{021A})|(\x{021B})
-
REPLACE
(?1S)(?2s)(?3T)(?4t)
-
Type in the correct file type in the
Filters:
zone -
Type in the correct absolute path name to your file, in the
Directory:
zone or click on theFollow current doc.
box option -
Choose, optionally, the
In all sub-folders
box option, if you need to browse a file tree -
Untick the
Match whole word only
, if necessary -
Tick the
Match case
box option ( Important ) -
Select the
Regular expression
radio expression mode -
Click on the
Replace in Files
button -
Valid the Are you sure? dialog
-
Notes :
-
In search, any of these
4
characters\x{####}
are stored in groups, from1
to4
, due to the embedded parentheses()
-
In replacement, due to the conditional replacement syntax
(?#....)
, where#
is the number of the matched group, the appropriate standard replacement letter,S
,s
,T
ort
is just rewritten !
Cheers,
guy038
-
-
I’m going to assume the solution @guy038 posted will work, because they usually are (or, at least, they are moving in the direction of working for whoever asked the question, because Guy doesn’t stop until they do work).
However, before he posted, I had started down a non-regex road; I think it will be useful, so even after Guy’s post, I continued to write it up.
@Pro-Bg said:
…noticed later that they’re ANSI. That how the subbers made them, …
And yes, those are the Romanian letters,When you said that, I took a look at the files. When you open them with Preferences > Settings > Misc > ☑ Autodetect character encoding enabled, they detect as “ANSI”, and those characters show up as you originally posted. Since you said “Romanian”, I assumed maybe it was really a Central or Eastern European encoding used, rather than the default “ANSI” Western European encoding.
So I went to **Encoding > Character Sets > Eastern European > **: Choosing ISO 8859-2 appeared to work. But while writing this up, I realized that Romanian can be considered Central European as well, so I tried choosing … > Central European > OEM 852, which made those characters box-drawing, so that was obviously wrong. … > Central European > Windows 1250 appeared to convert those to the right characters as well.
I don’t know all the differences between ISO 8859-2 and Windows 1250 – ah, per Wikipedia, “Windows-1250 is similar to ISO-8859-2 and has all the printable characters it has and more. However a few of them are rearranged”. You would have to know more about the files to determine which of those encodings they really are; though my guess, if they’re for subtitles, then they were done with the ISO 8859-2, not the Microsoft-centric Windows-1250.
So really, in Notepad++, instead of doing a search-replace, all you need to do is to change the Encoding > Character Set to the appropriate one (probably ISO 8859-2, but maybe Windows-1250). After doing that, so it’s displayed properly, you should be able to read and edit the file to your heart’s content. If you’re going to be editing the file multiple times, I would suggest Encoding > Convert to UTF-8-BOM, so it will change the encoded single-byte Romanian characters to their UTF-8 multi-byte encoding, with the BOM character inserted at the beginning of the file. Once you save after the conversion, then the next time you open the file with Notepad++, it will properly interpret it as UTF-8, and all the characters will be interpreted and displayed correctly.
As far as subtitles go: I’m guessing what prompted this is that the subtitles were showing up wrong in your video player of choice. My guess is that it was because your player didn’t know / couldn’t guess the right encoding for the file, so used ANSI like Notepad++ did. I don’t know whether your player handles UTF-8 better than a random encoding… but if it does, then maybe saving the file after converting to UTF-8-BOM will make it work right in your player. You might be able to google for your player’s name and “encoding” or “utf-8” or “unicode”, to find out which encoding it assumes or prefers.
However, if you have a lot of files, Notepad++ might not be the most efficient for batch-converting the encoding.
The superuser answer that I referenced in my post in another thread links to a version oficonv
for Windows, which should be able to automate the conversion from ISO 8859-2 to UTF-8.iconv -f ISO-8859-2 -t utf-8 sourcefile.srt > outfile.srt
To get that to do all files in a given directory, open a cmd.exe prompt in that directory, and run
FOR %f in (*.srt) do @( iconv -f ISO-8859-2 -t utf-8 "%f" > "%~nf.utf8%~xf" )
When I ran that on the marco polo files you showed us for download, it did properly convert them to utf-8.
-
Hello, @Pro-Bg, @peterjones and All,
Thanks to @peterjones, I understood that I simply forgot to act in the right order :-(( So, @Pro-Bg, just forget the second part of my previous post, where I described the regex S/R, which is wrong :-((
So, I, first, downloaded your archive and extracted the
Marco Polo S01E01 The Wayfarer 720p BluRay DTS x264-EbP.srt
fileWhen opening your file, in Notepad++, I get an
ANSI
encoded file. BTW, I also tried to untick theSettings > Preferences > MISC > Autodetect character encoding
option. Luckily, after re-opening Notepad++ and loading your file, its encoding have not been changed and was stillANSI
!I renamed your file with a shorter name and chose the
.txt
extension. So, from now on, your initial file will be namedTest.txt
!I’m about to show you
3
different methods to solve the @Pro-Bg’s problem. Note that the first one is just Peter’s solution !
FIRST method :
- I used the iconv utility, as suggested by Peter, running the command, below, in a DOS console window :
iconv -f ISO-8859-2 -t UTF-8 Test.txt > Test_ICONV.txt
Indeed, the result is fine and the
4
charactersª
,º
,Þ
andþ
were correctly translated in the4
charsŞ
,ş
,Ţ
andţ
:-))Remark :
If we assume that your file was, initially, a
Windows-1250
encoded file and that we run the command, below :iconv -f WINDOWS-1250 -t UTF-8 Test.txt > Test_2.txt
One can easily verify that the two output files are quite identical. So, regarding this file, these two encodings are equivalent. Nice !
Note :
Be aware, however, that the
4
charactersŞ
,ş
,Ţ
andţ
, in the output file, are letters with a cedilla and not the Romanian letters with a comma below :Ș
,ș
,Ț
andț
!
SECOND method :
-
Open a new file (
Ctrl + N
) -
If your default encoding, for new files, is not
ANSI
, select the first optionEncoding > ANSI
for this empty file. Note that , as your file is empty, you could, either, run the optionEncoding > Convert to ANSI
=> The
ANSI
encoding should be displayed in the status bar-
Now, copy / paste the contents of the
Test.txt
file, in this new file -
Then, run one of the two options :
-
Encoding > Character Sets > Central European > Windows-1250
-
Encoding > Character Sets > Eastern European > ISO 8859-2
-
-
A small window, with title Lose Undo Ability Waning pops up : You should save the current modification. All the saved modifications can not be undone. Continue ?
-
Choose the default choice, clicking on the
Yes
button -
The Save as dialog then occurs. So, save this new file as , let’s say,
Test_NPP.txt
=> Note that the
Windows-1250
( orISO 8859-2
) encoding is shown in the status bar- Then select the
Encoding > Convert to UTF-8
option ( Do not choose theUTF-8
only option ! )
=> This time, the
UTF-8
encoding is displayed in the status bar- Save the modifications (
Ctrl + S
)
The nice thing is that the
Test_NPP.txt
file, built from within N++ and theTest_ICONV.txt
file, output of theiconv
Dos command, are strictly identical !
THIRD method ( a bit longer ! ) :
-
Open a new file (
Ctrl + N
) -
If your default encoding, for new files, is not
ANSI
, select the first optionEncoding > ANSI
for this empty file
=> The
ANSI
encoding should be displayed in the status bar-
Now, copy / paste the contents of
Test.txt
, in this new file -
First, we’ll try to get rid of standard characters, in order to identify which characters would have a different byte sequence, when migrated to
UTF-8
. This concerns, principally, characters with code-point above\x7F
. So : -
Suppression of any
ASCII
character, with code in the[ 0 - 127 ]
range :-
SEARCH
[\x00-\x7f]+
-
REPLACE
Leave EMPTY
-
-
Let only one character per line :
-
SEARCH
.
-
REPLACE
$0\r\n
-
-
Run the
Edit > Line Operations > Sort Lines Lexicographically Ascending
option -
Run the
Edit > Line Operations > Remove Consecutive Duplicate Lines
option
=> You’re left with a tiny list of
9
charactersª º Ã Î Þ â ã î þ
-
Run the
Encoding > Character Sets > Central European > Windows-1250
option -
A small window, with title Lose Undo Ability Waning pops up : You should save the current modification. All the saved modifications can not be undone. Continue ?
-
Choose the default choice, clicking on the
Yes
button -
The Save as dialog occurs. So, save this new file, anywhere, with a dummy name
=> The
Windows-1250
encoding is shown in the status bar- The tiny list have been changed into these
9
following charactersŞ ş Ă Î Ţ â ă î ţ
, rewritten, below, with their codes :
Characters Ş ş Ă Î Ţ â ă î ţ In Windows-1250 00AA 00BA 00C3 00CE 00DE 00E2 00E3 00EE 00FE ( Unicode value 015E 015f 0102 00CE 0162 00E2 0103 00EE 0163 )
Refer, to that purpose, to the link :
https://en.wikipedia.org/wiki/Windows-1250
After examination of the different Unicode values, we can eliminate the
3
charactersÎ
,â
andî
, which are identical in the two encodings ( Note that they correspond to the characters with an Unicode value under\x0100
)-
Open a new file (
Ctrl + N
) -
If your default encoding, for new files, is not
ANSI
, select the first optionEncoding > ANSI
for this empty file
=> The
ANSI
encoding should be displayed in the status bar-
Now, copy / paste the contents of
Test.txt
, in this new file -
Run the
Encoding > Convert to UTF-8
option ( Do not choose theUTF-8
only option ! )
=> The
UTF-8
encoding is, now, displayed in the status bar-
Perform the following regex S/R :
-
SEARCH
(\x{00AA})|(\x{00BA})|(\x{00C3})|(\x{00DE})|(\x{00E3})|(\x{00FE})
-
REPLACE
(?1\x{015E})(?2\x{015F})(?3\x{0102})(?4\x{0162})(?5\x{0103})(?6\x{0163})
-
=>
733
replacements done- Save this new file and name it, let’s say,
Test_REGEX.txt
Again, the nice thing is that the
Test_REGEX.txt
file, built from within N++, with a regex S/R, and theTest_ICONV.txt
file, output of theiconv
Dos command, are strictly identical, too !Best Regards,
guy038
P.S. :
Now, @Pro-Bg, if you really want to see the Romanian
Ș
,ș
,Ț
andț
letters, with comma below :-
In N++, open, either, the
Test_ICONV.txt
,Test_NPP.txt
orTest_REGEX
output file, ( identicalUTF-8
encoded files ! ) -
Perform this last regex S/R :
-
SEARCH
(\x{015E})|(\x{015F})|(\x{0162})|(\x{0163})
( CharactersS
andT
with cedilla ) -
REPLACE
(?1\x{0218})(?2\x{0219})(?3\x{021A})(?4\x{021B})
( Romanian CharactersS
andT
with comma below )
-
-
Re-save your file
P.P.S. :
You are really lucky, whose mother tongue is English ! You have to worry, very little, about all these encoding problems ;-))
-
Thank you for your time and answers, gentlemen, wasn’t expecting such support on this forum.
-
Hello!
I do not want to create a new topic because my problem is pretty much the same. But I would like to get a short answer, I’m not interested in the character coding stuff.So, I have a lot of .cpp files, all of them are ANSI and to my luck, each comment made in Korean language. For example, there is a comment: “ÇöŔç Ŕ§ÄˇżˇĽ »çżëÇŇ Ľö ľř˝Ŕ´Ď´Ů.”
(This means “Not available at this location.” if I change the character encoding to Windows-949 but it is not important now.)Few notepad++ patches before I was able to search in my source files for special encoded characters, but nowadays I can’t.
So my question is, what can I do to fix the search?
I do not want to install an older version of notepad just because of this, but I think I must. What happened? Why not working correctly the search anymore? What can I do?
Thank you in advance!