Search in folder (encoding)
-
Hello, @mayson-kword, @Peterjones and All,
I cannot reproduce your issue !
-
I pasted the string
“ты знаешь о” phrase
in a a newUTF-8
encoded file -
In addition, I saved it, in the directory of my portable
v7.9.2
installation, asты знаешь.txt
-
As you can verify in the first picture, this file is not presently opened
-
Searching in any file , through all levels of the
D:\@@\792
folder, does find the line of your file with some Russian characters !
Best regards
guy038
-
-
@Mayson-Kword said in Search in folder (encoding):
Changes in theese options do nothing with search in folder.
This is curious as the way I believe it works (and has to work) is that N++ opens each file that matches the Find in File specification for Directory and Filters into a tab that isn’t shown to the user (if the file it is going to search is not already open).
That tab is opened the same way in all other respects except for visibility as a normal tab.
At least this is my knowledge about it – which is limited. :-)Can you provide your Debug Info found on the ? menu?
-
@guy038, you can read my posts some more, search doesn’t work only if there is unknown data in file. This post. My apologies.
@Alan-Kilborn, of course.
Notepad++ v7.9.1 (64-bit)
Build time : Nov 2 2020 - 01:07:46
Path : D:\Programs\Notepad++\notepad++.exe
Admin mode : OFF
Local Conf mode : ON
OS Name : Windows 10 Pro (64-bit)
OS Version : 2004
OS Build : 19041.685
Current ANSI codepage : 1251
Plugins : AutoCodepage.dll DSpellCheck.dll JSMinNPP.dll MarkdownViewerPlusPlus.dll mimeTools.dll NppConverter.dll NppExec.dll NppExport.dll XMLTools.dllMaybe I also should provide my example file?
-
@Mayson-Kword said in Search in folder (encoding):
Maybe I also should provide my example file?
Yes, in general the more you can provide to help someone reproduce, the better!
AutoCodepage.dll
I wonder if this plugin is interacting in some way?
-
@Alan-Kilborn, here they are, 3 files. 01 is original, 02 is cut, 03 is minimal. Word for search is “Мисти” (correct but not working) or “Мисти” (incorrect but working), here is my search result.
Also plugin AutoCodepage was my try to solve this problem, but it didn’t help. However, it works pretty nice when I open files, not search in them. The problem remains both with and without it.
Thank you for so much attention.
-
Hi, @mayson-kword, @Peterjones, @alan-kilborn and All,
First, thanks for your files that I could download without any problem
Now, file encodings are really a puzzle for everybody and it’s a bit difficult to do pertinent tests because :
-
From your Debug-Info, your current
ANSI
codepage is1251
-
From my Debug-Info, my current
ANSI
codepage is1252
For instance, when one opens the
jackie_default_01.json
file, which is anANSI
encoded file, each of the8
occurrences of the Russian wordМисти
, found in lines13, 14, 21, 22, 35, 123, 127 and 128
, are encoded as :-
МиÑти
in my configuration, asANSI
represents really theWindows-1252
encoding -
Мисти
, with your configuration, asANSI
represents really theWindows-1251
encoding
Refer to the table, below, and the following links :
https://en.wikipedia.org/wiki/Windows-1251
https://en.wikipedia.org/wiki/Windows-1252
https://www.unicode.org/charts/PDF/U0400.pdf
•--------------•-------------•-------------•-------------•-------------• | М | и | с | т | и | •--------------•-------------•-------------•-------------•-------------• | CAPITAL EM | SMALL I | SMALL ES | SMALL TE | SMALL I | CYRILLIC letters •--------------•-------------•-------------•-------------•-------------• | U+041C | U+0438 | U+0441 | U+0442 | U+0438 | UNICODE code-points •--------------•-------------•-------------•-------------•-------------• | D0 9C | D0 B8 | D1 81 | D1 82 | D0 B8 | BYTE values of the characters with an UTF-8 encoding •--------------•-------------•-------------•-------------•-------------• | Р њ | Р ё | С Ѓ | С ‚ | Р ё | Characters displayed, AFTER "Encoding > Character sets > Cyrillic > Windows-1251" •--------------•-------------•-------------•-------------•-------------• | Ð œ | Ð ¸ | Ñ HOP | Ñ ‚ | Ð ¸ | Characters displayed, AFTER "Encoding > Character sets > Cyrillic > Windows-1252" •--------------•-------------•-------------•-------------•-------------•
It’s important to understand that, when you 're using the
ANSI
,UTF-8
, …Character sets
option, your file contents do not change at all . Notepad++ just re-interprets the file bytes as it would represent this new encoding !
So I suppose that if you want to change the
UTF-8
encoding of a file to your currentANSI
encoding ( soWindows-1251
) you should :-
Run the option
Encoding > Convert to ANSI
-
Save the modifications
As you can see the Russian word
Мисти
is still displayed ( but with theANSI
encoding ) and the search of the stringМисти
should work as expected !Refer the table, below, and the link :
https://en.wikipedia.org/wiki/Windows-1251
•--------------•-------------•-------------•-------------•-------------• | М | и | с | т | и | •--------------•-------------•-------------•-------------•-------------• | CAPITAL EM | SMALL I | SMALL ES | SMALL TE | SMALL I | CYRILLIC letters •--------------•-------------•-------------•-------------•-------------• | U+041C | U+0438 | U+0441 | U+0442 | U+0438 | UNICODE code-points •--------------•-------------•-------------•-------------•-------------• | D0 9C | D0 B8 | D1 81 | D1 82 | D0 B8 | BYTE values with an UTF-8 encoding •--------------•-------------•-------------•-------------•-------------• | CC | E8 | F1 | F2 | E8 | BYTE values, of the word "Мисти", AFTER "Encoding > Convert to ANSI" •--------------•-------------•-------------•-------------•-------------•
This time, as you have used a
Encoding > Convert ...
option, the file did change and the present byte values of the characters are replaced with the byte values of these characters in this new encoding !Best Regards,
guy038
-
-
Hi @mayson-kword and All,
Sorry, in my previous post I said :
МиÑти
in your configuration asANSI
represents really theWindows-1251
encoding
Мисти
, with my configuration asANSI
represents really theWindows-1252
encoding
In fact, the correct phrasing is the opposite :
-
МиÑти
in my configuration asANSI
represents really theWindows-1252
encoding -
Мисти
, with your configuration asANSI
represents really theWindows-1251
encoding
Note that I also modified my previous post to be exact !
BR
guy038
-
@guy038, converting file to ANSI works for my search, but it’s not a solution.
- As I said in my first post, I have a lot of files - 3222 (in all sub-folders) at this moment. I can’t open them all and convert to ANSI, then edit, then convert back to UTF8 (because I need them in UTF8).
- Even if I had few files converting them to UTF8 and back to ANSI is not a solution because the result is not equal to original file. See the image. I need no such structure changes because theese files will be put in a specific archive and used in game. Ah, it’s all very fragile.
Some users can fix this problem, of course. Convert file to ANSI and back to UTF8 / remove problematic symbols / open all files in current session. But I can’t do such thing because converting and such editing changes my files too much, also there are too many files to open them all.
Look at this post once again. The difference between case 2 and case 3 is only one “NUL” symbol - and it makes N++ not be able to find anything I want. I can provide more files to compare. Search works wierd if file contains wrong data - Pic 1 and Pic 2.
-
@Mayson-Kword said in Search in folder (encoding):
Even if I had few files converting them to UTF8 and back to ANSI is not a solution
Sorry for mistake.
I mean “Even if I had few files converting them to ANSI and back to UTF8 is not a solution”.
Can’t edit post after 3 minutes. -
What did you mean by:
Also plugin AutoCodepage was my try to solve this problem, but it didn’t help. However, it works pretty nice when I open files
Do you have problems detecting encoding in general? If so there is no reason to expect search to work for non-ascii strings.
Try saving as UTF8 with BOM. The BOM makes detection of UTF8 files explicit. -
@gstavi said in Search in folder (encoding):
Do you have problems detecting encoding in general?
Yes and no, it only happens with some rare files if I don’t use “autodetect character encoding” feature. Sometimes even with it on. In that case N++ open them using ANSI, so I can’t read anything and I need to select encoding manually. AutoCodepage with my own rules just do this work automatically instead of me.
However there is an idea in your words. After some investigations I can suggest that N++ can’t search as expected in file using “Find in files” if file is not currently opened and if N++ can’t autodetect its encoding. Am I right?
Also using UTF8 with BOM solves this problem in a best way, I think. No data loss at least. Is there any way to automatically convert thousands of files from UTF8 w/o BOM to UTF8 with BOM?
-
Adding BOM is not conversion it is just adding 3 bytes (for utf-8 bom) at the beginning of the file.
https://stackoverflow.com/questions/3127436/adding-bom-to-utf-8-filesThe 2 risks are:
- Adding BOM into already “BOMed” file.
- Adding UTF-8 BOM into file with another encoding (e.g. UCS-2).
I never needed to use AutoCodepage and don’t know what hook from Notepad++ it uses to apply its functionality but maybe either its author or Notepad++ main developers can “fix” the problem by ensuring that it is called during search as well.
-
Ok, using files with BOM solves my problem because I can easily check files for BOM and change them if needed using Python scripting. However, N++ still doesn’t search properly in closed file if fails to autodetect its encoding.
Thank you all for your participation, that’s all.
-
Hi @mayson-kword and All,
I was able to download your
txt.zip
archive and correctly extract your two filesexample file 01.txt
andexample file 02.txt
With these two files opened in N++
v7.9.2
, I got two occurrences of the stringони должны
searching in all opened tabs of the current session. So, I could not, again reproduce the issue !After a while, I realized that, for a correct search, you need to tick the option
Settings > Preferences > MISC. > Autodetect character encoding
If this option is not checked, the string
они должны
is found in the fileexample file 02.txt
only, which does not contains an ending\x00
, whatever the files are opened or not in current N++ session
- When this option is enabled, and the two files opened in current session, a click on the
Find All in All Opened Documents
, of theFind
dialog, produces :
- When this option is enabled, and your two files not opened, a click on the
Find All
button, of theFind in Files
dialog, produces :
Now, about the possibility of changing an
UTF-8
to anUTF-8-BOM
encoded file, this is very easy, from withinNotepad++
-
Open the Find in Files dialog (
Ctrl + Shift + F
) -
SEARCH
\A
-
REPLACE
\x{FEFF}
-
Select the
Regular expression
search mode -
Select the folder containing all your
UTF-8
files which have to be modified -
Select the appropriate files filter
-
Click on the
Replace All
button and confirm the dialog
Voilà !
If you want, first, to test this technique :
-
Open an
UTF-8
file ( In the status bar, you should seeUTF-8
( and notUTF-8-BOM
) -
Open the Replace dialog (
Ctrl + H
) -
Tick the
Wrap around
version -
Select the
Regular expression
search mode -
Click, ONCE only, on the Replace All button ( Do not click, previously, on the
Find Next
button or any other button ! )
=> Message
Replace All: 1 occurrence was replaced in entire file
-
Save, immediately, the modifications with
Ctrl + S
( Just note that it still mentions theUTF-8
encoding, in the status bar ) -
Close this file (
Ctrl + W
) -
Re-open the file (
Ctrl + Shift + T
)
=> In the status bar, we can verify, this time, the indication
UTF-8-BOM
, which proves that we are dealing, now, with a real UTF-8 file with aBOM.
Best Regards
guy038
- When this option is enabled, and the two files opened in current session, a click on the
-
@guy038 said in Search in folder (encoding):
about the possibility of changing an UTF-8 to an UTF-8-BOM encoded file, this is very easy, from within Notepad++
One thing to note about this is that it is a one-way operation.
You can’t go the other way (UTF-8-BOM —> UTF-8) using a similar technique (a N++ replacement).
@guy038, am I right about it? -
Hi, @alan-kilborn,
Totally exact, Alan. Indeed, the
BOM
structure is not considered as part of file contents by Notepad++ !Note also, that I found out a bug, relative to the
\A
assertion, while testing my method to add theBOM
with a regex S/RLet me some minutes to expose the problem and you tell me if I must create an issue for such a behaviour !
BR
guy038
-
Hi, @alan-kilborn and All,
Sorry for the wait, I need to eat a little bit !
Alan, I think that you already spoke about a similar behaviour, but I cannot remember the exact post
Just follow all these steps to see the issue !
-
Open a new tab (
Ctrl + N
) -
Type the three letters
bar
, only -
Save this new tab as
Test.txt
-
Open the
Replace
dialog (Ctrl + H
) -
SEARCH
\A
-
REPLACE
foo
-
Tick on the
Wrap around
option -
Select the
Regular expression
search mode -
Click on the
Replace All
button
=> As expected, the file contents are changed into the string
foobar
!Now :
-
Undo the modifications (
Ctrl + Z
) -
Re-open the
Replace
dialog (Ctrl + H
) -
SEARCH
\A
( Verify that text is indeed\A
) -
REPLACE
foo
-
Tick on the
Wrap around
option -
Select the
Regular expression
search mode -
Click, first, on the
Find Next
button ( Important )
=> The classical call tip appears, saying
zero length match
- Now, click on the
Replace All
button
=> This time, no replacement occurs, even of you click, again, on the
Replace All
button- Even if you switch to an other tab and switch back to the
Test.txt
file
=> The same regex S/R, as above, with, only, a click on the
Replace All
button does not work anymore ! And you always get the messageReplace All: 0 occurrences were replaced in entire file
-
In order to get the expected behaviour, you must :
-
Close this file (
Ctrl + W
) -
Re-open the Test.txt file (
Ctrl + Shift + T
)
or
- Close and re-start Notepad++, of course !
After these operations, a click on the
Replace All
button is, again, functional and do add the stringfoo
, right before the stringbar
!
Here is my debug -info :
Notepad++ v7.9.2 (32-bit) Build time : Dec 31 2020 - 03:58:36 Path : D:\@@\792\notepad++.exe Admin mode : OFF Local Conf mode : ON OS Name : Microsoft Windows XP (32-bit) OS Build : 2600.0 Current ANSI codepage : 1252 Plugins : DSpellCheck.dll mimeTools.dll NppConverter.dll NppExport.dll
Best Regards
guy038
-
-
I can reproduce that ; nice instructions!
You should open a real issue on that.
It’s a weird one.
I thought it might have something to do with the call-tip being active, but no, it doesn’t seem to have any bearing on it.Allan, I think that you already spoke about a similar behaviour, but I cannot remember the exact post
I don’t remember this at all, but at my advanced age…
I’m sure you could find the post, point me to it, and it would be like a stranger had written it. :-) -
You cannot reproduce the issue because N++ autodetect encoding properly for most of files if using feature “Autodetect character encoding”. But there are some files that N++ cannot interpret right even with this option on. My file “jackie_default_01.json” is one of them. Also 3221 more files are one of them.
In case if you didn’t see my words, I can repeat myself from this post. N++ doesn’t search properly in closed file if fails to autodetect its encoding. Maybe I should repeat this once more in a different way? Conditions to reproduce the issue are:
- File is not opened.
- N++ cannot autodetect its encoding.
You said “I could not, again reproduce the issue” just because you don’t meet the second condition.
-
Hello, @alan-kilborn and All,
Ah, here we are, I found out this related post :
https://community.notepad-plus-plus.org/topic/19456/regex-replace-doesn-t-work/4
Remember : At this time ( May 2020 ), @uhf7 still did not change the behaviour of the
look_behind
structures. This was done in October and functional with theV7.9.1
version !https://github.com/notepad-plus-plus/notepad-plus-plus/pull/9008
So, in the log text, below, the OP wanted to change the
_
between date and time, with a single space character2020-01-01_00:02:13 MQTT2_DVES_834483 ENERGY_Power: 0 2020-01-01_00:02:13 MQTT2_DVES_834483 ENERGY_Voltage: 268 2020-01-01_00:02:13 MQTT2_DVES_834483 Time: 2020-01-01T01:02:12 2020-01-01_00:02:13 MQTT2_DVES_834483 ENERGY_ApparentPower: 0 2020-01-01_00:02:13 MQTT2_DVES_834483 ENERGY_Today: 0.000 2020-01-01_00:02:13 MQTT2_DVES_834483 ENERGY_Current: 0.000 2020-01-01_00:02:13 MQTT2_DVES_834483 ENERGY_ReactivePower: 0 2020-01-01_00:02:13 MQTT2_DVES_834483 ENERGY_Factor: 0.00 2020-01-01_00:02:13 MQTT2_DVES_834483 ENERGY_Period: 0
And the OP asked why the following regex
(?<=\d)_(?=\d)
did not work when using the Replace button, onlyLuckily, since the
v7.9.1
version you can, first, get the first match with a click on theFind Next
button, then hit, repeatedly, theReplace
button to do one replacement at a time. Many thanks @Uhf7 for this improvement ;-))
So, Alan, I thought that the behaviour of the
\A
assertion was related to the search buffer. Indeed, as soon as the zero length match, standing for the very beginning of current file, is matched, by a click on theFind Next
button, it’s just as if the regex engine has forgotten everything about the current position, which should be found because of theWrap around
option which loops the search from the end to the start of current file !?As it’s about midnight, in France, I’ll create a GitHub issue tomorrow !
BR
guy038