Community
    • Login

    How to normalize fancy Unicode text back to regular text?

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    27 Posts 7 Posters 4.2k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • CoisesC
      Coises @mkupper
      last edited by

      @mkupper said in How to normalize fancy Unicode text back to regular text?:

      Some characters do change. Copy/paste the following into a UTF-8 encoded tab or file. It should be the same as when you see here on the forums.

      I stand corrected. I did not at all expect that to happen. It’s my understanding of “convert to ANSI” that is confused. I apologize.

      Very strange:

      Open Notepad++, convert empty tab to UTF-8, copy your text, paste into tab, I see all the characters.

      Copy your text, open Notepad++, convert empty tab to UTF-8, paste into tab, I see only ASCII characters.

      I have no idea what is going on here.

      1 Reply Last reply Reply Quote 1
      • Mark OlsonM
        Mark Olson @Alan Kilborn
        last edited by Mark Olson

        @Alan-Kilborn said in How to normalize fancy Unicode text back to regular text?:

        Such a script could act on selected text when the script is run, and replace that text with the normalized text…pretty simple concept.

        Might as well just make the script now, save others time.

        As noted in the docstring of the code, the most obvious difference between NFKD and NFKC seems to be treatment of characters with combining diacritics or umlauts or what have you. Which form is better seems really context-dependent to me; if you’re sorting text, you probably want ö to be an o and then an umlaut (so that ö sorts after o and before p as expected), but if you’re doing regular expression search, you might prefer it to be a single character.

        '''
        requires PythonScript v3 or higher: https://github.com/bruderstein/PythonScript
        ref: https://community.notepad-plus-plus.org/topic/25285/how-to-normalize-fancy-unicode-text-back-to-regular-text/17
        docs: https://docs.python.org/3.10/library/unicodedata.html
        '''
        import unicodedata
        from Npp import *
        
        def normalize(text):
            '''
            NFKC stands for normalization form compatibility decomposition
                with subsequent canonical composition.
            NFKD works similarly AFAIK; it may be a bit faster, but it has some weird 
                behaviors like breaking ö into two characters: ASCII "o" and then ̈
                whereas NFKC combines those two into a single character.
            '''
            return unicodedata.normalize('NFKC', text)
        
        selstart = editor.getSelectionStart()
        selend = editor.getSelectionEnd()
        
        if selstart == selend:
            text = editor.getText()
            editor.setText(normalize(text))
        else:
            text = editor.getSelText()
            editor.replaceSel(normalize(text))
        
        1 Reply Last reply Reply Quote 6
        • Mark OlsonM Mark Olson referenced this topic on
        • guy038G
          guy038
          last edited by guy038

          Hi, @alan-kilborn,

          I completely agree with your last assumption and that why I had already upvoted @peterjones’s post and I now upvote to @mark-olson’s solution too !

          BR

          guy038

          1 Reply Last reply Reply Quote 0
          • Dean-CorsoD
            Dean-Corso
            last edited by

            Hi guys,

            thanks again for your help. Really nice from you all.

            @PeterJones

            Thanks for hint about the python script versions. I did download the latest pre version as you but could not make the same steps like you did to enter your example lines. Got some errors trying to exec the print command (getting expand error on for statement etc). Just did enter same as you. Maybe some space issue or something not sure. But good to know that I needed to use a higher python 3x version so I was still using the older 2x version.

            @Mark-Olson

            Thank you for that example script. I tried that one and it seems to work. Great! The results are very good for me and its working for some of those different symbol styles (not all) to get a rid of those symbol text at all or some mixed plain text with symbol text etc. I mean the script works same like those few websites I found to normalize the symbol text to plain text. That’s very good and I don’t need to use those websites anymore and that was one of my goals. Would be good when npp could make a build in function for that in any future releases if possible.

            PeterJonesP 1 Reply Last reply Reply Quote 0
            • PeterJonesP
              PeterJones @Dean-Corso
              last edited by

              @Dean-Corso said in How to normalize fancy Unicode text back to regular text?:

              Got some errors trying to exec the print command (getting expand error on for statement etc). Just did enter same as you. Maybe some space issue or something not sure.

              If you copy/pasted the PythonScript console results (including the version information) like I did above, I bet someone could tell you what happened

              1 Reply Last reply Reply Quote 1
              • Dean-CorsoD
                Dean-Corso
                last edited by

                @PeterJones

                Ok I tried again and now I get this out…

                Python 3.12.1 (tags/v3.12.1:2305ca5, Dec  7 2023, 22:03:25) [MSC v.1937 64 bit (AMD64)]
                Initialisation took 204ms
                Ready.
                >>> import unicodedata
                >>> strings = [   '𝖙𝖍𝖚𝖌 𝖑𝖎𝖋𝖊',   '𝓽𝓱𝓾𝓰 𝓵𝓲𝓯𝓮',   '𝓉𝒽𝓊𝑔 𝓁𝒾𝒻𝑒',   '𝕥𝕙𝕦𝕘 𝕝𝕚𝕗𝕖',   'thug life', '𝘏𝘦𝘭𝘭𝘰 𝘕𝘰𝘵𝘦𝘱𝘢𝘥 𝘱𝘭𝘶𝘴 𝘱𝘭𝘶𝘴 𝘤𝘰𝘮𝘮𝘶𝘯𝘪𝘵𝘺', '𝙃𝙚𝙡𝙡𝙤 𝙉𝙤𝙩𝙚𝙥𝙖𝙙 𝙥𝙡𝙪𝙨 𝙥𝙡𝙪𝙨 𝙘𝙤𝙢𝙢𝙪𝙣𝙞𝙩𝙮']
                >>> for x in strings:
                ...   print(unicodedata.normalize( 'NFKC', x), x)
                
                

                …but don’t see the printed output like you have. Did I miss anything to enter in this case?

                PS: About that error before, I see I forgot to enter another white space before last print command.

                PeterJonesP 1 Reply Last reply Reply Quote 0
                • PeterJonesP
                  PeterJones @Dean-Corso
                  last edited by PeterJones

                  @Dean-Corso ,

                  If your PythonScript console prompt is still ... instead of >>>, you will need to enter a blank line (no whitespace) to tell the console to end the loop. It won’t run the loop until you do.

                  1 Reply Last reply Reply Quote 2
                  • First post
                    Last post
                  The Community of users of the Notepad++ text editor.
                  Powered by NodeBB | Contributors