Community
    • Login

    Search in folder (encoding)

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    41 Posts 7 Posters 4.2k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • guy038G
      guy038
      last edited by

      Hi, @alan-kilborn,

      Totally exact, Alan. Indeed, the BOM structure is not considered as part of file contents by Notepad++ !

      Note also, that I found out a bug, relative to the \A assertion, while testing my method to add the BOM with a regex S/R

      Let me some minutes to expose the problem and you tell me if I must create an issue for such a behaviour !

      BR

      guy038

      1 Reply Last reply Reply Quote 1
      • guy038G
        guy038
        last edited by guy038

        Hi, @alan-kilborn and All,

        Sorry for the wait, I need to eat a little bit !

        Alan, I think that you already spoke about a similar behaviour, but I cannot remember the exact post

        Just follow all these steps to see the issue !

        • Open a new tab ( Ctrl + N )

        • Type the three letters bar, only

        • Save this new tab as Test.txt

        • Open the Replace dialog ( Ctrl + H )

        • SEARCH \A

        • REPLACE foo

        • Tick on the Wrap around option

        • Select the Regular expression search mode

        • Click on the Replace All button

        => As expected, the file contents are changed into the string foobar !

        Now :

        • Undo the modifications ( Ctrl + Z )

        • Re-open the Replace dialog ( Ctrl + H )

        • SEARCH \A ( Verify that text is indeed \A )

        • REPLACE foo

        • Tick on the Wrap around option

        • Select the Regular expression search mode

        • Click, first, on the Find Next button ( Important )

        => The classical call tip appears, saying zero length match

        • Now, click on the Replace All button

        => This time, no replacement occurs, even of you click, again, on the Replace All button

        • Even if you switch to an other tab and switch back to the Test.txt file

        => The same regex S/R, as above, with, only, a click on the Replace All button does not work anymore ! And you always get the message Replace All: 0 occurrences were replaced in entire file


        • In order to get the expected behaviour, you must :

        • Close this file ( Ctrl + W )

        • Re-open the Test.txt file ( Ctrl + Shift + T )

        or

        • Close and re-start Notepad++, of course !

        After these operations, a click on the Replace All button is, again, functional and do add the string foo, right before the string bar !


        Here is my debug -info :

        Notepad++ v7.9.2   (32-bit)
        Build time : Dec 31 2020 - 03:58:36
        Path : D:\@@\792\notepad++.exe
        Admin mode : OFF
        Local Conf mode : ON
        OS Name : Microsoft Windows XP (32-bit) 
        OS Build : 2600.0
        Current ANSI codepage : 1252
        Plugins : DSpellCheck.dll mimeTools.dll NppConverter.dll NppExport.dll 
        

        Best Regards

        guy038

        Alan KilbornA 1 Reply Last reply Reply Quote 1
        • Alan KilbornA
          Alan Kilborn @guy038
          last edited by

          @guy038

          I can reproduce that ; nice instructions!
          You should open a real issue on that.
          It’s a weird one.
          I thought it might have something to do with the call-tip being active, but no, it doesn’t seem to have any bearing on it.

          Allan, I think that you already spoke about a similar behaviour, but I cannot remember the exact post

          I don’t remember this at all, but at my advanced age…
          I’m sure you could find the post, point me to it, and it would be like a stranger had written it. :-)

          1 Reply Last reply Reply Quote 2
          • Mayson KwordM
            Mayson Kword
            last edited by

            You cannot reproduce the issue because N++ autodetect encoding properly for most of files if using feature “Autodetect character encoding”. But there are some files that N++ cannot interpret right even with this option on. My file “jackie_default_01.json” is one of them. Also 3221 more files are one of them.

            In case if you didn’t see my words, I can repeat myself from this post. N++ doesn’t search properly in closed file if fails to autodetect its encoding. Maybe I should repeat this once more in a different way? Conditions to reproduce the issue are:

            • File is not opened.
            • N++ cannot autodetect its encoding.

            You said “I could not, again reproduce the issue” just because you don’t meet the second condition.

            Mayson KwordM 1 Reply Last reply Reply Quote 0
            • guy038G
              guy038
              last edited by guy038

              Hello, @alan-kilborn and All,

              Ah, here we are, I found out this related post :

              https://community.notepad-plus-plus.org/topic/19456/regex-replace-doesn-t-work/4

              Remember : At this time ( May 2020 ), @uhf7 still did not change the behaviour of the look_behind structures. This was done in October and functional with the V7.9.1 version !

              https://github.com/notepad-plus-plus/notepad-plus-plus/pull/9008


              So, in the log text, below, the OP wanted to change the _ between date and time, with a single space character

              2020-01-01_00:02:13 MQTT2_DVES_834483 ENERGY_Power: 0
              2020-01-01_00:02:13 MQTT2_DVES_834483 ENERGY_Voltage: 268
              2020-01-01_00:02:13 MQTT2_DVES_834483 Time: 2020-01-01T01:02:12
              2020-01-01_00:02:13 MQTT2_DVES_834483 ENERGY_ApparentPower: 0
              2020-01-01_00:02:13 MQTT2_DVES_834483 ENERGY_Today: 0.000
              2020-01-01_00:02:13 MQTT2_DVES_834483 ENERGY_Current: 0.000
              2020-01-01_00:02:13 MQTT2_DVES_834483 ENERGY_ReactivePower: 0
              2020-01-01_00:02:13 MQTT2_DVES_834483 ENERGY_Factor: 0.00
              2020-01-01_00:02:13 MQTT2_DVES_834483 ENERGY_Period: 0
              

              And the OP asked why the following regex (?<=\d)_(?=\d) did not work when using the Replace button, only

              Luckily, since the v7.9.1 version you can, first, get the first match with a click on the Find Next button, then hit, repeatedly, the Replace button to do one replacement at a time. Many thanks @Uhf7 for this improvement ;-))


              So, Alan, I thought that the behaviour of the \A assertion was related to the search buffer. Indeed, as soon as the zero length match, standing for the very beginning of current file, is matched, by a click on the Find Next button, it’s just as if the regex engine has forgotten everything about the current position, which should be found because of the Wrap around option which loops the search from the end to the start of current file !?

              As it’s about midnight, in France, I’ll create a GitHub issue tomorrow !

              BR

              guy038

              1 Reply Last reply Reply Quote 0
              • Mayson KwordM
                Mayson Kword @Mayson Kword
                last edited by

                Here is my video with the issue. https://www.youtube.com/watch?v=RSQpqvL6cHI

                PeterJonesP 1 Reply Last reply Reply Quote 0
                • PeterJonesP
                  PeterJones @Mayson Kword
                  last edited by

                  @Mayson-Kword ,

                  I am sorry to be the bearer of bad news: The file you are showing in that video does not appear to be a text file.

                  5bf1d27e-7b8b-4c4d-a541-799d00eec18c-image.png

                  With all those black-boxed characters, it appears to be a binary file that happens to have some of its information look like text. Notepad++ cannot be expected to handle any arbitrary binary data that you throw at it, whether or not that binary data happens to contain some number of text strings. The non-text bytes interfere with its ability to do the job Notepad++ was designed to do: edit text. In some situations, you might be able to abuse the application to edit the text included in the binary file, but you have pushed beyond that.

                  Just because something can be loaded in Notepad++, and even just because you can read some of the text from that file in Notepad++, does not mean that the underlying file is actually a plaintext file. For example, when I look at notepad++.exe in Notepad++, I can find areas that look quite similar to what is shown in your video:
                  e795c012-abdd-45fd-bcb3-72848945df5e-image.png – it’s got plain text, that I can absolutely read, and be confident that it was intended to be text… but that’s really an excerpt of bytes from an executable file, not from a text file.

                  Expecting a meaningful search or search-and-replace result when using a text editor to edit non-text files is a rather unfair expectation, in my opinion.

                  see also this faq

                  You said “I could not, again reproduce the issue” just because you don’t meet the second condition.

                  Guy was showing results from the files you provided! Don’t complain to him for not meeting your condition when you supplied the file.

                  Getting back to my main point: while the video you showed does not appear to be a plain text file, the screenshots Guy showed of your example files does appear to be text. (I don’t download arbitrary zip files from users I don’t already trust, so I cannot verify myself that they are nothing more than plain text). But, if they really are text like it appears, then it is more reasonable to expect Notepad++ to be able to handle them.

                  In that case, if you wanted to share files that show the problem you are encountering, then share those, and maybe Guy or some other brave soul will download an arbitrary .zip and try to replicate your problem. If you believe that one or both of the files from that already-shared zip do show the problem, then Guy’s assessment disagrees with you, and you’ll have to explain again exactly how to replicate the problem with the files you shared.

                  Mayson KwordM 1 Reply Last reply Reply Quote 0
                  • Mayson KwordM
                    Mayson Kword @PeterJones
                    last edited by

                    @PeterJones, got it. If my files are not text files, so N++ cannot understand them properly and then there is no issue at all. Thank you, my apologies for being rude. But also I want to mention that I really provided problem file - “jackie_default_01.json” from “json.zip” archive. It’s equal to file in video. So your words “if you wanted to share files that show the problem you are encountering, then share those” sounds strange. Guy’s assessment disagrees with me because Guy ignores “jackie_default_01.json” file for some reason.

                    1 Reply Last reply Reply Quote 0
                    • Mayson KwordM
                      Mayson Kword
                      last edited by

                      Ah, if you think only one file is not enough, here 5 more files for any brave soul.

                      1 Reply Last reply Reply Quote 0
                      • guy038G
                        guy038
                        last edited by guy038

                        Hi, @mayson-kword @peterjones, @alan-kilborn and All,

                        I did some tests and I draw these conclusions :

                        For a non pure text file, which can contain many control codes, in reverse video, including the NUL character :

                        • Any line ends with either the first Windows CRLF , the Unix LF or the MAC CR control code(s)

                        • In the Find result panel, any control C1 or Control C2 character, except for the NUL, the LF and the CR characters are simply displayed as standard characters

                        • In the Find result panel, any line containing the search string is displayed till the first NUL char met

                        • Thus, if any line, as defined above, just begins with a NUL character, nothing is displayed, although it did find the search string for that line !


                        Demonstration :

                        667a93bc-1ff8-44e7-ba28-208551953028-image.png


                        I also verified that this behaviour occurs with ANSI or any UNICODE encoding and does not depend on the type of EOL characters, too !

                        So, @mayson-kword, unless you decide to work on a copy of your non-text files, in order to delete all the \x00 characters, it’s seems impossible to correctly get all the lines in the Find result window :-((

                        Best regards

                        guy038

                        Mayson KwordM 1 Reply Last reply Reply Quote 1
                        • Mayson KwordM
                          Mayson Kword @guy038
                          last edited by

                          @guy038, thank you a lot. There is no way to improve autodetect encoding feature, but you’ve done as much as possible, including advice to use BOM, that solves my issue very well.

                          Have a nice day, all of you are great.

                          1 Reply Last reply Reply Quote 1
                          • gstaviG
                            gstavi
                            last edited by

                            Notepad++ assumes that a file has encoding, meaning, the entire content of the file is text (Unicode symbols) using a single encoding. Notepad++ does not try to support files where every paragraph has different encoding or files that are essentially binary with pieces of “text” at some encoding embedded here and there.

                            Having said that there are 2 major ways that Notepad++ could improve upon users experience in that regard that neither should be difficult to implement:

                            • If a specific encoding is not autodetected on opening a file Notepad++ will default to ansi encoding (that should be called ascii encoding). That was reasonable 20 years ago. It is unreasonable today. Utf-8 should be the default and since it is also backward compatible to ascii it should not hurt users.
                            • Notepad++ really needs the feature in the settings of “assume all files are of encoding XXX” where XXX is selected from a combo box. My guess is that a vast majority of Notepad++ users have all their relevant files in a single encoding and they don’t need for Notepad++ to autodetect it (guess) it if they can just tell it once.
                            1 Reply Last reply Reply Quote 4
                            • EkopalypseE
                              Ekopalypse
                              last edited by

                              @gstavi

                              No, afaik utf8 cannot replace ANSI code pages easily.
                              For example the byte c4 is Ä in cp1252 and Д in cp1251
                              and invalid in utf8.

                              But I agree, npp should have the possibility to let the user
                              force an encoding and it is, probably, a good idea to use utf8
                              as the default.

                              gstaviG 1 Reply Last reply Reply Quote 1
                              • guy038G
                                guy038
                                last edited by guy038

                                Hello, @ekopalypse, and All,

                                You said :

                                For example the byte c4 is Ä in cp1252 and Д in cp1251
                                and invalid in utf8

                                Eko, I not agree with that statement : a C4 byte can be found in an UTF-8 file as it is the first byte of a 2-Bytes coding sequence of the characters from Ā ( U+0100, coded as C4 80 ) till Ŀ ( U+013F, coded as C4 BF )

                                Refer to the link and the table below :

                                https://en.wikipedia.org/wiki/UTF-8#Codepage_layout

                                •-------•-------•--------------------------------------------------------•
                                | Start |  End  |                   Description                          |
                                •-------•-------•--------------------------------------------------------•
                                |   00  |   7F  |  UNIQUE byte of a 1-byte sequence ( ASCII character )  |
                                |   80  |   BF  |  CONTINUATION byte of a sequence  ( from 1ST to 3RD )  |
                                |   C0  |   C1  |  FORBIDDEN values                                      |
                                |   C2  |   DF  |  FIRST byte of a 2-bytes sequence                      |
                                |   E0  |   EF  |  FIRST byte of a 3-bytes sequence                      |
                                |   F0  |   F4  |  FIRST byte of a 4-bytes sequence                      |
                                |   F5  |   FF  |  FORBIDDEN values                                      |
                                •-------•-------•--------------------------------------------------------•
                                

                                I think that your reasoning is correct if we take, for instance, the individual C1 byte, which is :

                                • The Á character, in a Windows-1250/1252/1254/1258 encoded file

                                • The Б character, in a Windows-1251 encoded file

                                • The Α character, in a Windows-1253 encoded file

                                • The ֱ character, in a Windows-1255 encoded file

                                • The ء character, in a Windows-1256 encoded file

                                • The Į character, in a Windows-1257 encoded file

                                …

                                • Always forbidden in an UTF-8 or UTF-8-BOM encoded file

                                Best Regards,

                                guy038

                                1 Reply Last reply Reply Quote 0
                                • guy038G
                                  guy038
                                  last edited by

                                  Hi, All,

                                  As promised, this issue on GitHub, concerning the \A assertion !

                                  BR

                                  guy038

                                  1 Reply Last reply Reply Quote 1
                                  • EkopalypseE
                                    Ekopalypse
                                    last edited by

                                    @guy038

                                    Hi Guy, how are doing? I hope you are doing well.

                                    I replied that ANSI/ASCII can be replaced by utf-8.
                                    ASCII can, but ANSI cannot.
                                    My example was to show why it can’t be replaced.
                                    Yes, C4 is valid as long as it is followed by another byte that forms a valid utf8 character.
                                    Alone it is invalid.

                                    1 Reply Last reply Reply Quote 1
                                    • gstaviG
                                      gstavi @Ekopalypse
                                      last edited by

                                      @Ekopalypse said in Search in folder (encoding):

                                      No, afaik utf8 cannot replace ANSI code pages easily.

                                      The terminology is confusing in general and Notepad++ is not helping.
                                      There are modern encodings which can represent ANY Unicode symbol with various multibyte schemes.
                                      There are legacy ascii encoding that can represent up to 256 symbols.
                                      Every ascii encoding comes with a code page that defines different symbols for the range 128-255.
                                      The symbols for 0-127 in ascii encoding (and utf8) are always the same. Let’s call them “plain English”.

                                      Ascii encodings should die. Notepad++ must open them but should discourage people from creating new ones by forcing an explicit choice to do so.
                                      People that choose one of the modern encodings save themselves trouble later.
                                      And for the many many people who can’t understand the concept of encoding Notepad++ should help by choosing the right default.

                                      Notepad++ default “ANSI encoding” is ascii encoding with some arbitrary code page.
                                      Generally using ascii encoding without defining an explicit code page is equivalent to saying “I only care about plain English and don’t give a fuck about range 128-255”.

                                      Other “code pages” or “Character Sets” are not relevant to Notepad++ default. Users who want them need to either select them manually or let the autodetect guess it. Does it even work? How accurate is guessing of a code page?

                                      For people who are ok with “ANSI”, the majority belong to the "don’t give a fuck about 128-255 and they will be OK with utf8.
                                      A minority that actually use “ANSI” and adds to the document symbols from the default code page will need to select it explicitly or hope that autodetect works. But they better off switch to a modern encoding anyway.
                                      Even if the solution will not be 100% backward compatible it will benefit much more people than it would hurt.

                                      1 Reply Last reply Reply Quote 1
                                      • EkopalypseE
                                        Ekopalypse
                                        last edited by

                                        @gstavi said in Search in folder (encoding):

                                        I agree with most of what you said, but I think there is a misunderstanding here about ANSI. (maybe it’s me)
                                        It’s true, ANSI is used as a type of encoding, which it is not.
                                        Instead, it is just an abbreviation for the codepage that was used to set up the operating system.
                                        For one person it’s cp1252, for another it’s cp1251, and for the next it’s something else, and so on.
                                        But GetACP returns this “setup” encoding and that is,
                                        I assume, the one that is/was used by Windows users and is used by npp.
                                        I think that makes sense.
                                        Nevertheless, I think using unicode and especially utf8 makes more sense these days.

                                        gstaviG 1 Reply Last reply Reply Quote 1
                                        • gstaviG
                                          gstavi @Ekopalypse
                                          last edited by

                                          @Ekopalypse said in Search in folder (encoding):

                                          I think that makes sense.

                                          It is a legitimate decision. And it makes sense … and in my (very personal) opinion it is awful.
                                          Its bad for interoperability because transferring a file between 2 computers could end up badly.

                                          But my personal dislike is because I work on multilingual operating system where the other language is right-to-left Hebrew.
                                          And it is unimaginably annoying when some application decides to do me a favor and adjust itself without asking.
                                          I never want to see Hebrew on my computer unless I explicitly asked for it. The OS is obviously setup with English as primary language but FUCKING OneNote still decides to suddenly arrange pages right-to-left because the time zone is for Israel. And it feels random, unfixable and takes the control from me.

                                          Since users don’t explicitly choose codepage when they setup their system, using GetACP is just a guess. And if it misfires, users will not understand why because they are unaware that a choice was made for them. Don’t guess on behalf of the user if it can be avoided.

                                          Side story: as can be expected I am sometimes the “tech person” for friends and family. I strongly refuse to service installations that are fully Hebrew. If you will ever open regedit on a Hebrew Windows and see all the English content aligned right-to-left you would lose all respect to Microsoft.

                                          EkopalypseE 1 Reply Last reply Reply Quote 2
                                          • EkopalypseE
                                            Ekopalypse @gstavi
                                            last edited by

                                            @gstavi said in Search in folder (encoding):

                                            If you will ever open regedit on a Hebrew Windows and see all the English content aligned right-to-left you would lose all respect to Microsoft.

                                            Maybe I should give it a try to finally be persuaded to switch to Linux :-D

                                            1 Reply Last reply Reply Quote 0
                                            • First post
                                              Last post
                                            The Community of users of the Notepad++ text editor.
                                            Powered by NodeBB | Contributors