Community
    • Login

    find in files unicode text

    Scheduled Pinned Locked Moved General Discussion
    17 Posts 6 Posters 12.4k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Tudor RanetiT
      Tudor Raneti
      last edited by

      I search the text:
      "
      fisierul “sesizare comisia pentru libertăți civile, justiție și afaceri interne si probe.pdf” anexat la prezenta
      "
      with Total Commander. It finds it in multiple doc files. The text obviously contains diacritics and such

      I search the same text with Notepad++, it doesn’t find it. Total Commander also doesn’t find it, unless I tick it’s “Unicode” check box in the search dialog

      Anyway, I installed Notepad++ 7.2, same story. BTW notepad++ often goes “not responding” on these searches where Total commander doesn’t, in fact allowing me to stop the search anytime

      P.S. There’s no such thing as unicode text huh? Then I guess both me and google are halucinating:
      https://www.google.ro/search?q=unicode+text&ie=utf-8&oe=utf-8&client=firefox-b&gws_rd=cr&ei=4PJfWLeqMs7jwQKpzrzwDQ

      Claudia FrankC gstaviG 2 Replies Last reply Reply Quote 0
      • Claudia FrankC
        Claudia Frank @Tudor Raneti
        last edited by

        @Tudor-Raneti

        I will give it a try.

        What is unicode?
        First, Unicode is a stanard which assigns a unique number to every character.
        Character not text.

        What is encoding?
        The sequence of bytes representing the number. One might use one byte per number,
        as long as it is possible, where as others would use 2 or 3 or 4 bytes per number.

        Why isn’t there such thing as unicode text?
        Because this implies that the text (up to here we know characters are encoded) is encoded
        in unicode but actually it is encoded in utf-8, utf-16 etc…

        The same text, is stored different when using different codec, even so the unicode number
        (code point) might be the same.

        So when can those text be found?
        When searching for text encoded in the same way as stored in file.

        I didn’t check npp source code but I assume the find function uses the same encoding as
        defined in current document. Maybe I’m wrong?

        Cheers
        Claudia

        Btw. searching for unicode and text in google doesn’t prove validity.
        First, it will search for unicode and/or text but not necessarily both values.
        Enclosing both words would have decreased the results drastically but as said
        doesn’t prove its validity.

        1 Reply Last reply Reply Quote 0
        • Tudor RanetiT
          Tudor Raneti
          last edited by

          Nonsense and no answers

          Claudia FrankC 1 Reply Last reply Reply Quote 0
          • Claudia FrankC
            Claudia Frank @Tudor Raneti
            last edited by

            @Tudor-Raneti

            well, if I’m wrong I would appreciate if you can clarify the nonsense.
            If not for me maybe others could benefit from it, otherwise
            it is nonsense to argument something is nonsense and don’t provide
            facts why.

            An answer is within the nonsense text - maybe you didn’t read completely.

            Cheers
            Claudia

            1 Reply Last reply Reply Quote 1
            • Scott SumnerS
              Scott Sumner
              last edited by

              Cheers to Claudia for even trying to help the truly rude! As a helper to so many with problems here…I would listen to every word provided!

              1 Reply Last reply Reply Quote 0
              • Tudor RanetiT
                Tudor Raneti
                last edited by

                May I please know in which way is Claudia Frank and Scott Sumner related to Notepad++?

                Claudia FrankC 1 Reply Last reply Reply Quote 0
                • Claudia FrankC
                  Claudia Frank @Tudor Raneti
                  last edited by

                  @Tudor-Raneti

                  May I please know in which way is Claudia Frank and Scott Sumner related to Notepad++?

                  Being npp users with the passion to help others.

                  Cheers
                  Claudia

                  1 Reply Last reply Reply Quote 0
                  • guy038G
                    guy038
                    last edited by guy038

                    Hi, Tudor Raneti

                    Please, slow down ! Your attitude is a bit offensive for people who just wants to help you ! Anyway, I’ll give it an other try, too !


                    From your text, below :

                    fisierul "sesizare comisia pentru libertăți civile, justiție și afaceri interne si probe.pdf" anexat la prezenta
                    

                    I, simply, deduced that you’re a Roumanian subject and that your text could be translated, in English, by the approximate text ( Google translation ), below :

                    file "notification Committee on Civil Liberties, Justice and Home Affairs and probe.pdf" attached to this
                    

                    With the default N++ font ( Courrier New ), two characters, of your alphabet, cannot be correctly displayed and are changed into the replacement character ( a small square character )

                    There are :

                    • The LATIN SMALL LETTER S WITH COMMA BELOW ( Romanian ), with code-point = x{0219}

                    • The LATIN SMALL LETTER T WITH COMMA BELOW ( Romanian ), with code-point = \x{021b}

                    Refer to the link , below :

                    http://www.unicode.org/charts/PDF/U0180.pdf

                    Do not confuse these two characters with these two others, which have a cedilla, instead of the comma, below the character and which are correctly displayed with the Courrier New font :

                    • Letter ş , latin small letter s with cedilla, with code-point = \x{015F}

                    • Letter ţ , latin small letter t with cedilla, with code-point = \x{0163}

                    Refer to the link :

                    http://www.unicode.org/charts/PDF/U0100.pdf


                    In a new tab, of my local N++ 7.2 configuration, which have the default Unicode encoding UTF-8 ( without BOM )

                    • I, first, copied your text, a couple of times

                    • Then, I select one copy of your text

                    • I, immediately, opened the Find dialog ( Ctrl + F ) => The Find what field is already filled, with your text !

                    • Finally, I clicked, a couple of times, on the Find Next button

                    No problem, it DID detect, successively, all the occurrences of your text :-))

                    This test has been correct :

                    • With or without the Match whole word only option

                    • With or without the Match case option

                    • With or without the . matches newline option

                    • In normal, extended and regular expression search mode

                    So…, may be, the current encoding of your file is not an Unicode encoding ? Just look, at the bottom right of the status bar !

                    Cheers,

                    guy038

                    1 Reply Last reply Reply Quote 0
                    • gstaviG
                      gstavi @Tudor Raneti
                      last edited by

                      P.S. There’s no such thing as unicode text huh? Then I guess both me and google are halucinating:
                      https://www.google.ro/search?q=unicode+text&ie=utf-8&oe=utf-8&client=firefox-b&gws_rd=cr&ei=4PJfWLeqMs7jwQKpzrzwDQ

                      Google can find any idiocy on the net.
                      If you want to start understanding Unicode read this.
                      https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/

                      Although the previous interaction does not show you as the learning type.

                      1 Reply Last reply Reply Quote 1
                      • Andrew DunnA
                        Andrew Dunn
                        last edited by

                        Come now lets be civil.

                        To search, first go to your top menu and left click on Encoding. Then select an encoding. Now open your Find dialog and try it.

                        I suppose you’re asking for a small encoding list selection option in the Find dialog, correct? Or perhaps an option to search for multiple encodings at once? I suppose that’s faster than running multiple independent searches. Might eat up a lot of GUI space though.

                        1 Reply Last reply Reply Quote 0
                        • gstaviG
                          gstavi
                          last edited by gstavi

                          Out of curiosity I did browse the code to see how “find in files” works in Notepad++. As far as I understood:

                          1. NPP scans the directory tree and creates a list of all files that match the Filters.
                          2. For each file in the list NPP will:
                            – Load the ifile into NPP in “hidden” state. During that load NPP will guess the file’s unicode encoding. NPP may very well guess wrong.
                            – Search hidden file within NPP for required pattern.
                            – Add matching locations to ‘finder’.
                            – Close hidden file.

                          I did not go into the actual comparison but pattern should be matched to file using some common encoding. Either pattern is re-encoded into the guessed file encoding or during the scan each symbol is translated into its Unicode point of U+### and compared.
                          The bottom line is that if the initial encoding guess was wrong NPP will fail to find matches.

                          I use NPP mostly for UTF-8/ASCII files and sometimes for UTF-16 (Microsofts default). NPP auto detect these well.
                          I will not be surprised if rarer (country specific) encodings are often misdetected.

                          In any case this “find in files” scheme is very slow and I hardly use it. I prefer ‘greping’ externally.
                          It would be nice to implement for code developers a much more efficient UTF-8 only find in files that does not bother to load each file into Notepad++. Possibly in a plugin.

                          1 Reply Last reply Reply Quote 0
                          • First post
                            Last post
                          The Community of users of the Notepad++ text editor.
                          Powered by NodeBB | Contributors