Community
    • Login

    Unicode problem !

    Scheduled Pinned Locked Moved General Discussion
    12 Posts 5 Posters 733 Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • guy038G
      guy038
      last edited by guy038

      Hello, All,

      I’m in the process of cleaning up my Python scripts and slightly modifying a few others, and I’ve just come across something very annoying about Notepad++'s python scripts, which had escaped me until now because almost all the posts and replies on this forum are in English :-((

      For example, suppose I have the following command:

      s = notepad.prompt('Enter SEARCH regex:', '', '')

      How can I ensure that if I enter, say, the French word été, I get s = 'été' and not s = '\xe9t\xe9' !!!

      Maybe, I miss something obvious ! Anyway, thanks for your advice!

      Best Regards,

      guy038

      Alan KilbornA 1 Reply Last reply Reply Quote 1
      • Alan KilbornA
        Alan Kilborn @guy038
        last edited by

        @guy038 said in Unicode problem !:

        été

        It works fine for me, but then again, I’m using PythonScript3, which, being based on Python 3, is natively unicode.

        92d15c7a-fac6-41f3-a4f2-058b90e6c73a-image.png

        1 Reply Last reply Reply Quote 2
        • Alan KilbornA
          Alan Kilborn
          last edited by Alan Kilborn

          …but… what is the real issue you are experiencing with this?

          Meaning, \xe9t\xe9 really is été, it is just expressed differently when you view it a certain way. So there’s no loss of data…so I’m just wondering why this is a problem for you?

          For example, here’s two ways to view it; one shows what I presume you want to see, and one shows what your complaint was about:

          7345de8d-2eaf-44f9-9371-3ac8aa50c675-image.png

          It’s chopped off in the screenshot, but the msgbox’s argument is 's:' + s.

          1 Reply Last reply Reply Quote 1
          • guy038G
            guy038
            last edited by guy038

            Hi, @alan-kilborn and All,

            Here is my problem :

            Let’s start with this simple python script, below :

            editor.beginUndoAction()
            
            editor.documentStart()
            Start = editor.getCurrentPos()
            End   = editor.getTextLength()
            
            s = notepad.prompt('Enter SEARCH regex:', '', '')
            
            if s != None and len(s) > 0:
            
                r = notepad.prompt('Enter REPLACE regex:', '', '')
                if r != None:
            
                    editor.rereplace(s, r, 0, Start, End)
            
            editor.endUndoAction()
            

            And create a new tab, containing only these three lines :

            fourth
            quatrième
            quatrieme
            

            As you can, see the word quatrième is the correct french word for the english word fourth and, in the third line, it’s just the same word without the accent on letter e !

            • Now, run my above script

            • For search = fourth and replace = 4th => No trouble, the replacement occurs. Logic, as all input text is just ASCII text

            • For search = quatrième and replace 4ème => No replacement !?

            • For search = quatrieme and replace = 4ème => a replacement occurs but displays the string 4xe8me, with xe8 in inverse video, instead of the french 4ème ?!

            • For search = quatrieme and replace = 4eme => No trouble again, the replacement occurs, as only ascii chars have been used, again ! However, it’s far from correct French :-))
              BR

            guy038

            Mark OlsonM CoisesC 2 Replies Last reply Reply Quote 2
            • guy038G
              guy038
              last edited by guy038

              @alan-kilborn and All,

              I forgot to mention that I’m cleaning up the Python scripts on my old XP laptop, which is stuck on Notepad++ v.7.9.2. So, I cannot install something superior to Python script v1.5.4 !

              May be all this issue will be over, after moving these scripts on my new Windows 10 laptop and the installation of Python Script v3.0.16 !

              BR

              guy038

              Alan KilbornA 1 Reply Last reply Reply Quote 0
              • Mark OlsonM
                Mark Olson @guy038
                last edited by Mark Olson

                @guy038

                For search = fourth and replace = 4th => No trouble, the replacement occurs. Logic, as all input text is just ASCII text
                …

                For search = quatrieme and replace = 4eme => No trouble again, the replacement occurs, as only ascii chars have been used, again ! However, it’s far from correct French :-))

                I can’t replicate any of the problems in those bullet points; I suspect it may have something to do with the language your computer is using, or possibly the fact that you’re using an older PythonScript version.

                I personally hate working with diacritics and umlauts because, e.g., è can be represented as (\x300 (COMBINING GRAVE ACCENT) followed by ASCII e) OR literal è (\xe8), and Notepad++ treats them differently even though they are identical graphemes.

                1 Reply Last reply Reply Quote 0
                • Alan KilbornA
                  Alan Kilborn @guy038
                  last edited by

                  @guy038 said in Unicode problem !:

                  after moving these scripts on my new Windows 10 laptop and the installation of Python Script v3.0.16

                  I would recommend this.
                  Dealing with ANSI encoding issues is a general pain, and that’s what you have when you want to use “special” characters with PS2 (and earlier).

                  But you may still have the trouble @Mark-Olson points out, even with PS3. I know you love regex, but working with “special” characters and regex is, shall we say, challenging. :-(

                  1 Reply Last reply Reply Quote 2
                  • guy038G
                    guy038
                    last edited by guy038

                    Hello, @mark-olson, @alan-kilborn and All,

                    You’re probably right about my Python version which must be too old ( v1.5.4 ) !

                    You said :

                    I personally hate working with diacritics and umlauts

                    I can understand your point of view. However, sorry but I"m French and I would like to deal with these accentuated chars as well ! Do you know, for instance, that even the very very old qbasic.exe program of Microsoft does handle them correctly !!

                    If you still have an old 32-bits machine, just run this BASIC one-line program LINE INPUT "Votre réponse : ;r$: PRINT r$ => Everything is correctly displayed, even the label ! To be rigorous, on reflection, my qbasic version use the Multilingual Latin_1 850 encoding which is still far from Unicode, I’ll give you that !

                    But, nowadays, with the Unicode standard and the universal UTF-8 encoding, I suppose that any computer language should handle these Unicode chars properly !


                    Luckily, with the help of regexes, I can by-pass the Python script completely and, as said @alan-kilborn, just run the following S/R, even in Normal mode :

                    • SEARCH quatrième

                    • REPLACE 4ème

                    Best regards,

                    guy038

                    CoisesC 1 Reply Last reply Reply Quote 1
                    • CoisesC
                      Coises @guy038
                      last edited by

                      @guy038 said in Unicode problem !:

                      For search = quatrieme and replace = 4ème => a replacement occurs but displays the string 4xe8me, with xe8 in inverse video, instead of the french 4ème ?!

                      It sounds as if your text is in utf-8, but Python is working in Windows-1252, ISO-8859-1 or ISO-8859-15. I presume the problem goes away if your file is “ANSI” rather than Unicode?

                      I know nothing about Python, but is there some way to tell it to work in utf-8 instead of the current Windows codepage?

                      1 Reply Last reply Reply Quote 1
                      • Mark OlsonM
                        Mark Olson
                        last edited by

                        Probably relevant useful info is here:
                        https://docs.python.org/3/howto/unicode.html

                        Don’t have time to comment on which sections are most relevant here

                        PeterJonesP 1 Reply Last reply Reply Quote 1
                        • PeterJonesP
                          PeterJones @Mark Olson
                          last edited by PeterJones

                          @Mark-Olson said in Unicode problem !:

                          Probably relevant useful info is here:
                          https://docs.python.org/3/howto/unicode.html

                          Actually, since @guy038 is still on PythonScript 1.5.4, it’s using Python 2, not Python 3, so the right URL is https://docs.python.org/2.7/howto/unicode.html , because, as @Alan-Kilborn pointed out, Python 3 did a fundamental change in how it handles unicode strings between 2 and 3.

                          1 Reply Last reply Reply Quote 2
                          • CoisesC
                            Coises @guy038
                            last edited by

                            @guy038 said in Unicode problem !:

                            But, nowadays, with the Unicode standard and the universal UTF-8 encoding, I suppose that any computer language should handle these Unicode chars properly !

                            TLDR: Unicode is hard.

                            You know the joke that goes, “There’s a sort of person who sees a problem and thinks, ‘I know! I’ll use regular expressions!’ Now that person has two problems.”? One could say the same of Unicode.

                            Modern Windows is natively Unicode… but it’s UTF-16. Nearly everything else (including Scintilla) that uses Unicode does it as UTF-8.

                            And most programming languages — as well as most regular expression implementations! — think of “characters” as fixed-length entities in memory. By choosing UTF-16, Windows at least makes handling the Basic Multilingual Plane fairly straightforward, so long as you don’t mind converting nearly everyone else’s version of Unicode to Windows’ version. OK for short strings… not so good when you have megabytes or gigabytes of file data.

                            C++ had some limited Unicode-related functions in its standard library, but has now deprecated even those and just says one should use an appropriate library.

                            Even if you’re willing to convert to UTF-32 (which means quadrupling the storage requirements of ASCII characters), so that every Unicode code point is the same size in memory, not all Unicode characters are a single Unicode code point. The complexity just keeps increasing, and increasing… the hope that Unicode would “simplify” using all the world’s scripts has not exactly been realized.

                            When I wanted to incorporate Boost.Regex directly into my Columns++ plugin, I first thought to use its Unicode facilities. Then I saw that while Boost.Regex is a header-only library (if you don’t know C++, read that as “easy to include in a project without changing the way the project is structured”), its Unicode facilities require ICU4C — and if there’s any simple, straightforward way to incorporate that monstrosity into an otherwise simple, straightforward project, I couldn’t find it. I settled for the same “hack” that Notepad++ itself uses: convert UTF-8 to UTF-16 on the fly, use Windows’ support of UTF-16 where needed, and let the chips fall where they may for anyone bold enough to use combining characters or characters outside the basic multilingual plane.

                            Unicode is so complex, so full of details (What characters are “the same” when you ignore case? Well, that depends on the current locale…), that support is a project in itself, and attempts to be reasonably complete (ICU) are gigantic and constantly being updated.

                            1 Reply Last reply Reply Quote 4
                            • First post
                              Last post
                            The Community of users of the Notepad++ text editor.
                            Powered by NodeBB | Contributors