Community
    • Login

    Search in folder (encoding)

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    41 Posts 7 Posters 4.2k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Mayson KwordM
      Mayson Kword @PeterJones
      last edited by Mayson Kword

      @PeterJones, not related at all.

      • I’m using “Find in files”, not “Find all in all opened documents” or “Find all in current document". I have no one file opened.
      • Length of files doesn’t matter, I’ve tried a document with 59 symbols in total.
      • My search returns “0 hits in 0 files” while in issue 8034 N++ returns at least a filename. Vitalii Dovgan says:

      Notepad++ is not able to show this line in the Search Results correctly
      Notepad++ does not jump to the matching word when double-clicking this line in the Search Results

      In my case there are no any lines in results.

      1 Reply Last reply Reply Quote 0
      • Mayson KwordM
        Mayson Kword
        last edited by

        Some investigations. It seems my problem is a bit more complicated.

        Pic 1 - Search “Мисти” doesn’t work. “Мисти” is ok - 2 results.
        Pic 2 - Search “Мисти” still not working. “Мисти” is ok.
        Pic 3 - Search “Мисти” now works perfectly, but “Мисти” is not.

        Case 3 is expected behavior, I suppose. So “Find in files” function actually search in ANSI encoding (and finds nothing of course) if file contains this strange data.

        PeterJonesP 1 Reply Last reply Reply Quote 1
        • PeterJonesP
          PeterJones @Mayson Kword
          last edited by

          @Mayson-Kword said in Search in folder (encoding):

          So “Find in files” function actually search in ANSI encoding

          I was vaguely remembering something like that discussed earlier, but I cannot find the relevant discussion and/or issue: sorry. (I thought I’d found it with previous links, but apparently I misread.)

          1 Reply Last reply Reply Quote 0
          • guy038G
            guy038
            last edited by guy038

            Hello, @mayson-kword, @Peterjones and All,

            I cannot reproduce your issue !

            • I pasted the string “ты знаешь о” phrase in a a new UTF-8 encoded file

            • In addition, I saved it, in the directory of my portable v7.9.2 installation, as ты знаешь.txt

            • As you can verify in the first picture, this file is not presently opened

            • Searching in any file , through all levels of the D:\@@\792 folder, does find the line of your file with some Russian characters !


            ef63d3da-7804-4f94-80fe-9b318dea1694-image.png

            7fbebbc8-3acc-431e-860f-a3c0d8bd1fd6-image.png

            Best regards

            guy038

            Mayson KwordM 1 Reply Last reply Reply Quote 2
            • Alan KilbornA
              Alan Kilborn @Mayson Kword
              last edited by

              @Mayson-Kword said in Search in folder (encoding):

              Changes in theese options do nothing with search in folder.

              This is curious as the way I believe it works (and has to work) is that N++ opens each file that matches the Find in File specification for Directory and Filters into a tab that isn’t shown to the user (if the file it is going to search is not already open).
              That tab is opened the same way in all other respects except for visibility as a normal tab.
              At least this is my knowledge about it – which is limited. :-)

              Can you provide your Debug Info found on the ? menu?

              1 Reply Last reply Reply Quote 1
              • Mayson KwordM
                Mayson Kword @guy038
                last edited by Mayson Kword

                @guy038, you can read my posts some more, search doesn’t work only if there is unknown data in file. This post. My apologies.

                @Alan-Kilborn, of course.

                Notepad++ v7.9.1 (64-bit)
                Build time : Nov 2 2020 - 01:07:46
                Path : D:\Programs\Notepad++\notepad++.exe
                Admin mode : OFF
                Local Conf mode : ON
                OS Name : Windows 10 Pro (64-bit)
                OS Version : 2004
                OS Build : 19041.685
                Current ANSI codepage : 1251
                Plugins : AutoCodepage.dll DSpellCheck.dll JSMinNPP.dll MarkdownViewerPlusPlus.dll mimeTools.dll NppConverter.dll NppExec.dll NppExport.dll XMLTools.dll

                Maybe I also should provide my example file?

                Alan KilbornA 1 Reply Last reply Reply Quote 0
                • Alan KilbornA
                  Alan Kilborn @Mayson Kword
                  last edited by

                  @Mayson-Kword said in Search in folder (encoding):

                  Maybe I also should provide my example file?

                  Yes, in general the more you can provide to help someone reproduce, the better!

                  AutoCodepage.dll

                  I wonder if this plugin is interacting in some way?

                  Mayson KwordM 1 Reply Last reply Reply Quote 1
                  • Mayson KwordM
                    Mayson Kword @Alan Kilborn
                    last edited by

                    @Alan-Kilborn, here they are, 3 files. 01 is original, 02 is cut, 03 is minimal. Word for search is “Мисти” (correct but not working) or “Мисти” (incorrect but working), here is my search result.

                    Also plugin AutoCodepage was my try to solve this problem, but it didn’t help. However, it works pretty nice when I open files, not search in them. The problem remains both with and without it.

                    Thank you for so much attention.

                    1 Reply Last reply Reply Quote 0
                    • guy038G
                      guy038
                      last edited by guy038

                      Hi, @mayson-kword, @Peterjones, @alan-kilborn and All,

                      First, thanks for your files that I could download without any problem

                      Now, file encodings are really a puzzle for everybody and it’s a bit difficult to do pertinent tests because :

                      • From your Debug-Info, your current ANSI codepage is 1251

                      • From my Debug-Info, my current ANSI codepage is 1252

                      For instance, when one opens the jackie_default_01.json file, which is an ANSI encoded file, each of the 8 occurrences of the Russian word Мисти, found in lines 13, 14, 21, 22, 35, 123, 127 and 128, are encoded as :

                      • Мисти in my configuration, as ANSI represents really the Windows-1252 encoding

                      • Мисти, with your configuration, as ANSI represents really the Windows-1251 encoding


                      Refer to the table, below, and the following links :

                      https://en.wikipedia.org/wiki/Windows-1251

                      https://en.wikipedia.org/wiki/Windows-1252

                      https://www.unicode.org/charts/PDF/U0400.pdf

                      •--------------•-------------•-------------•-------------•-------------•
                      |      М       |     и       |    с        |    т        |     и       |
                      •--------------•-------------•-------------•-------------•-------------•
                      |  CAPITAL EM  |   SMALL I   |   SMALL ES  |  SMALL TE   |   SMALL I   |    CYRILLIC letters
                      •--------------•-------------•-------------•-------------•-------------•
                      |    U+041C    |   U+0438    |   U+0441    |   U+0442    |   U+0438    |    UNICODE code-points
                      •--------------•-------------•-------------•-------------•-------------•
                      |   D0   9C    |   D0   B8   |   D1   81   |   D1   82   |   D0   B8   |    BYTE values of the characters with an UTF-8 encoding
                      •--------------•-------------•-------------•-------------•-------------•
                      |   Р    њ     |   Р    ё    |   С    Ѓ    |   С    ‚    |   Р    ё    |    Characters displayed, AFTER "Encoding > Character sets > Cyrillic > Windows-1251"
                      •--------------•-------------•-------------•-------------•-------------•
                      |   Ð    œ     |   Ð    ¸    |   Ñ    HOP  |   Ñ    ‚    |   Ð    ¸    |    Characters displayed, AFTER "Encoding > Character sets > Cyrillic > Windows-1252"
                      •--------------•-------------•-------------•-------------•-------------•
                      

                      It’s important to understand that, when you 're using the ANSI, UTF-8, …Character sets option, your file contents do not change at all . Notepad++ just re-interprets the file bytes as it would represent this new encoding !


                      So I suppose that if you want to change the UTF-8 encoding of a file to your current ANSI encoding ( so Windows-1251 ) you should :

                      • Run the option Encoding > Convert to ANSI

                      • Save the modifications

                      As you can see the Russian word Мисти is still displayed ( but with the ANSI encoding ) and the search of the string Мисти should work as expected !

                      Refer the table, below, and the link :

                      https://en.wikipedia.org/wiki/Windows-1251

                      •--------------•-------------•-------------•-------------•-------------•
                      |      М       |     и       |    с        |    т        |     и       |
                      •--------------•-------------•-------------•-------------•-------------•
                      |  CAPITAL EM  |   SMALL I   |   SMALL ES  |  SMALL TE   |   SMALL I   |    CYRILLIC letters
                      •--------------•-------------•-------------•-------------•-------------•
                      |    U+041C    |   U+0438    |   U+0441    |   U+0442    |   U+0438    |    UNICODE code-points
                      •--------------•-------------•-------------•-------------•-------------•
                      |   D0   9C    |   D0   B8   |   D1   81   |   D1   82   |   D0   B8   |    BYTE values with an UTF-8 encoding
                      •--------------•-------------•-------------•-------------•-------------•
                      |      CC      |     E8      |     F1      |     F2      |    E8       |    BYTE values, of the word "Мисти", AFTER "Encoding > Convert to ANSI"
                      •--------------•-------------•-------------•-------------•-------------•
                      

                      This time, as you have used a Encoding > Convert ... option, the file did change and the present byte values of the characters are replaced with the byte values of these characters in this new encoding !

                      Best Regards,

                      guy038

                      Mayson KwordM 1 Reply Last reply Reply Quote 0
                      • guy038G
                        guy038
                        last edited by

                        Hi @mayson-kword and All,

                        Sorry, in my previous post I said :

                        • Мисти in your configuration as ANSI represents really the Windows-1251 encoding
                        • Мисти, with my configuration as ANSI represents really the Windows-1252 encoding

                        In fact, the correct phrasing is the opposite :

                        • Мисти in my configuration as ANSI represents really the Windows-1252 encoding

                        • Мисти, with your configuration as ANSI represents really the Windows-1251 encoding

                        Note that I also modified my previous post to be exact !

                        BR

                        guy038

                        1 Reply Last reply Reply Quote 0
                        • Mayson KwordM
                          Mayson Kword @guy038
                          last edited by

                          @guy038, converting file to ANSI works for my search, but it’s not a solution.

                          1. As I said in my first post, I have a lot of files - 3222 (in all sub-folders) at this moment. I can’t open them all and convert to ANSI, then edit, then convert back to UTF8 (because I need them in UTF8).
                          2. Even if I had few files converting them to UTF8 and back to ANSI is not a solution because the result is not equal to original file. See the image. I need no such structure changes because theese files will be put in a specific archive and used in game. Ah, it’s all very fragile.

                          Some users can fix this problem, of course. Convert file to ANSI and back to UTF8 / remove problematic symbols / open all files in current session. But I can’t do such thing because converting and such editing changes my files too much, also there are too many files to open them all.

                          Look at this post once again. The difference between case 2 and case 3 is only one “NUL” symbol - and it makes N++ not be able to find anything I want. I can provide more files to compare. Search works wierd if file contains wrong data - Pic 1 and Pic 2.

                          Mayson KwordM 1 Reply Last reply Reply Quote 0
                          • Mayson KwordM
                            Mayson Kword @Mayson Kword
                            last edited by

                            @Mayson-Kword said in Search in folder (encoding):

                            Even if I had few files converting them to UTF8 and back to ANSI is not a solution

                            Sorry for mistake.
                            I mean “Even if I had few files converting them to ANSI and back to UTF8 is not a solution”.
                            Can’t edit post after 3 minutes.

                            1 Reply Last reply Reply Quote 0
                            • gstaviG
                              gstavi
                              last edited by

                              What did you mean by:

                              Also plugin AutoCodepage was my try to solve this problem, but it didn’t help. However, it works pretty nice when I open files

                              Do you have problems detecting encoding in general? If so there is no reason to expect search to work for non-ascii strings.
                              Try saving as UTF8 with BOM. The BOM makes detection of UTF8 files explicit.

                              Mayson KwordM 1 Reply Last reply Reply Quote 0
                              • Mayson KwordM
                                Mayson Kword @gstavi
                                last edited by Mayson Kword

                                @gstavi said in Search in folder (encoding):

                                Do you have problems detecting encoding in general?

                                Yes and no, it only happens with some rare files if I don’t use “autodetect character encoding” feature. Sometimes even with it on. In that case N++ open them using ANSI, so I can’t read anything and I need to select encoding manually. AutoCodepage with my own rules just do this work automatically instead of me.

                                However there is an idea in your words. After some investigations I can suggest that N++ can’t search as expected in file using “Find in files” if file is not currently opened and if N++ can’t autodetect its encoding. Am I right?

                                Also using UTF8 with BOM solves this problem in a best way, I think. No data loss at least. Is there any way to automatically convert thousands of files from UTF8 w/o BOM to UTF8 with BOM?

                                1 Reply Last reply Reply Quote 0
                                • gstaviG
                                  gstavi
                                  last edited by gstavi

                                  Adding BOM is not conversion it is just adding 3 bytes (for utf-8 bom) at the beginning of the file.
                                  https://stackoverflow.com/questions/3127436/adding-bom-to-utf-8-files

                                  The 2 risks are:

                                  • Adding BOM into already “BOMed” file.
                                  • Adding UTF-8 BOM into file with another encoding (e.g. UCS-2).

                                  I never needed to use AutoCodepage and don’t know what hook from Notepad++ it uses to apply its functionality but maybe either its author or Notepad++ main developers can “fix” the problem by ensuring that it is called during search as well.

                                  1 Reply Last reply Reply Quote 0
                                  • Mayson KwordM
                                    Mayson Kword
                                    last edited by Mayson Kword

                                    Ok, using files with BOM solves my problem because I can easily check files for BOM and change them if needed using Python scripting. However, N++ still doesn’t search properly in closed file if fails to autodetect its encoding.

                                    Thank you all for your participation, that’s all.

                                    1 Reply Last reply Reply Quote 0
                                    • guy038G
                                      guy038
                                      last edited by

                                      Hi @mayson-kword and All,

                                      I was able to download your txt.zip archive and correctly extract your two files example file 01.txt and example file 02.txt

                                      With these two files opened in N++ v7.9.2, I got two occurrences of the string они должны searching in all opened tabs of the current session. So, I could not, again reproduce the issue !

                                      After a while, I realized that, for a correct search, you need to tick the option Settings > Preferences > MISC. > Autodetect character encoding

                                      If this option is not checked, the string они должны is found in the file example file 02.txt only, which does not contains an ending \x00, whatever the files are opened or not in current N++ session


                                      • When this option is enabled, and the two files opened in current session, a click on the Find All in All Opened Documents, of the Find dialog, produces :

                                      649590fe-4e98-44e3-b8ca-94180ec3f996-image.png

                                      • When this option is enabled, and your two files not opened, a click on the Find All button, of the Find in Files dialog, produces :

                                      51b51f60-ed11-4833-8f56-379c89520f64-image.png


                                      Now, about the possibility of changing an UTF-8 to an UTF-8-BOM encoded file, this is very easy, from within Notepad++

                                      • Open the Find in Files dialog ( Ctrl + Shift + F )

                                      • SEARCH \A

                                      • REPLACE \x{FEFF}

                                      • Select the Regular expression search mode

                                      • Select the folder containing all your UTF-8 files which have to be modified

                                      • Select the appropriate files filter

                                      • Click on the Replace All button and confirm the dialog

                                      Voilà !


                                      If you want, first, to test this technique :

                                      • Open an UTF-8 file ( In the status bar, you should see UTF-8 ( and not UTF-8-BOM )

                                      • Open the Replace dialog ( Ctrl + H )

                                      • Tick the Wrap around version

                                      • Select the Regular expression search mode

                                      • Click, ONCE  only, on the Replace All button ( Do not click, previously, on the Find Next button or any other button ! )

                                      => Message Replace All: 1 occurrence was replaced in entire file

                                      • Save, immediately, the modifications with Ctrl + S ( Just note that it still mentions the UTF-8 encoding, in the status bar )

                                      • Close this file ( Ctrl + W )

                                      • Re-open the file ( Ctrl + Shift + T )

                                      => In the status bar, we can verify, this time, the indication UTF-8-BOM, which proves that we are dealing, now, with a real UTF-8 file with a BOM.

                                      Best Regards

                                      guy038

                                      Alan KilbornA 1 Reply Last reply Reply Quote 2
                                      • Alan KilbornA
                                        Alan Kilborn @guy038
                                        last edited by

                                        @guy038 said in Search in folder (encoding):

                                        about the possibility of changing an UTF-8 to an UTF-8-BOM encoded file, this is very easy, from within Notepad++

                                        One thing to note about this is that it is a one-way operation.
                                        You can’t go the other way (UTF-8-BOM —> UTF-8) using a similar technique (a N++ replacement).
                                        @guy038, am I right about it?

                                        1 Reply Last reply Reply Quote 1
                                        • guy038G
                                          guy038
                                          last edited by

                                          Hi, @alan-kilborn,

                                          Totally exact, Alan. Indeed, the BOM structure is not considered as part of file contents by Notepad++ !

                                          Note also, that I found out a bug, relative to the \A assertion, while testing my method to add the BOM with a regex S/R

                                          Let me some minutes to expose the problem and you tell me if I must create an issue for such a behaviour !

                                          BR

                                          guy038

                                          1 Reply Last reply Reply Quote 1
                                          • guy038G
                                            guy038
                                            last edited by guy038

                                            Hi, @alan-kilborn and All,

                                            Sorry for the wait, I need to eat a little bit !

                                            Alan, I think that you already spoke about a similar behaviour, but I cannot remember the exact post

                                            Just follow all these steps to see the issue !

                                            • Open a new tab ( Ctrl + N )

                                            • Type the three letters bar, only

                                            • Save this new tab as Test.txt

                                            • Open the Replace dialog ( Ctrl + H )

                                            • SEARCH \A

                                            • REPLACE foo

                                            • Tick on the Wrap around option

                                            • Select the Regular expression search mode

                                            • Click on the Replace All button

                                            => As expected, the file contents are changed into the string foobar !

                                            Now :

                                            • Undo the modifications ( Ctrl + Z )

                                            • Re-open the Replace dialog ( Ctrl + H )

                                            • SEARCH \A ( Verify that text is indeed \A )

                                            • REPLACE foo

                                            • Tick on the Wrap around option

                                            • Select the Regular expression search mode

                                            • Click, first, on the Find Next button ( Important )

                                            => The classical call tip appears, saying zero length match

                                            • Now, click on the Replace All button

                                            => This time, no replacement occurs, even of you click, again, on the Replace All button

                                            • Even if you switch to an other tab and switch back to the Test.txt file

                                            => The same regex S/R, as above, with, only, a click on the Replace All button does not work anymore ! And you always get the message Replace All: 0 occurrences were replaced in entire file


                                            • In order to get the expected behaviour, you must :

                                            • Close this file ( Ctrl + W )

                                            • Re-open the Test.txt file ( Ctrl + Shift + T )

                                            or

                                            • Close and re-start Notepad++, of course !

                                            After these operations, a click on the Replace All button is, again, functional and do add the string foo, right before the string bar !


                                            Here is my debug -info :

                                            Notepad++ v7.9.2   (32-bit)
                                            Build time : Dec 31 2020 - 03:58:36
                                            Path : D:\@@\792\notepad++.exe
                                            Admin mode : OFF
                                            Local Conf mode : ON
                                            OS Name : Microsoft Windows XP (32-bit) 
                                            OS Build : 2600.0
                                            Current ANSI codepage : 1252
                                            Plugins : DSpellCheck.dll mimeTools.dll NppConverter.dll NppExport.dll 
                                            

                                            Best Regards

                                            guy038

                                            Alan KilbornA 1 Reply Last reply Reply Quote 1
                                            • First post
                                              Last post
                                            The Community of users of the Notepad++ text editor.
                                            Powered by NodeBB | Contributors