Community
    • Login

    Is there any way to convert code points to unicode

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    14 Posts 8 Posters 2.0k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • dinkumoilD
      dinkumoil @User 000
      last edited by

      @User-000

      You could create an HTML file that includes your code points as HTML entity numbers (i.e. you have to replace \ by &# at first). For example:

      <body>
          <p>
              &#72&#101&#108&#108&#111&#32&#87&#111&#114&#108&#100&#33
          </p>
      </body>
      

      Opening the resulting file in a web browser makes it possible to copy the plain text.

      Alan KilbornA 1 Reply Last reply Reply Quote 4
      • rdipardoR
        rdipardo @User 000
        last edited by

        @User-000,

        i want to convert it to “Hello World!”

        As suggested above, you can encode the text as numeric entities, then convert them right in Notepad++ with the HTML Tag plugin (*1):

        Your file doesn’t even have to be a recognized markup language (*2):


        (*1) In the GIFs I call the “Select tag contents only” command, then "Decode selected (HTML) entities.

        (*2) Exception: most named entities (e.g. &copy;) are unrecognized by the XML 1.0 specification; the plugin falls back to numeric entities in those cases.

        U 1 Reply Last reply Reply Quote 6
        • U
          User 000 @rdipardo
          last edited by

          @rdipardo Huge thanks that’s what i need

          1 Reply Last reply Reply Quote 1
          • Paul WormerP
            Paul Wormer @User 000
            last edited by Paul Wormer

            @User-000 If you don’t mind using the PythonScript plugin, then here is a simple script. You select with the mouse the string 72\101\...\33 and run the script below. It will insert the extended ASCII characters just before the selection.

            #72\101\108\108\111\32\87\155\114\108\100\33
            
            import struct           # Python 2
            from Npp import *
            s = editor.getSelText().split('\\') 
            ins_pos = editor.getSelectionStart() 
            out = ''
            for char in s:
                out+=struct.pack('B', int(char)).decode('cp850')
            editor.insertText(ins_pos, out )
            

            The test string in the comment will become #Hello Wørld!72\101\108\108\111\32\87\155\114\108\100\33 after selecting it and running this script.

            1 Reply Last reply Reply Quote 1
            • guy038G
              guy038
              last edited by guy038

              Hello, @user-000, @alan-kilborn, @dinkumoil, @rdipardo, @paul-wormer and All,

              From this python 3 file, I was able to extract the parts regarding the Text to Ascii and Ascii to Text conversions and adapt it as a Python 2.7 N++ script, Pheeew !

              However, my script below is certainly very weak as :

              • Results need to be read in the Python console only

              • Results with accentuated characters are weird, although correct ?

              So, I really need some insights from Python’s gurus, in our forum, to improve a bit that script !

              # Converts a String to ASCII Code Integers
              
              def asc(text):
                  return ''.join([str(ord(i)).zfill(3) for i in text])
              
              # Converts ASCII Code Integers to a String
              
              def txt(ascii):
                  ascii = str(ascii)
                  if len(ascii) % 3 != 0:
                      ascii = ascii.zfill(len(ascii) + (3 - (len(ascii) % 3)))
                  return ''.join(chr(int(ascii[i - 2] + ascii[i - 1] + ascii[i])) for i in range(len(ascii) - 1, 0, -3))[::-1]
              
              # Main Function
              
              def main():
              
                  while True:
              
                      option = notepad.prompt('Enter T for Ascii to Text conversion\nEnter A for Text to Ascii conversion\n','','')
              
                      if option not in ['T', 'A', '']:
                          continue
              
                      if option == '':
                        break
              
                      if option == 'T':
                          answer = notepad.prompt('Enter ASCII values of 3 DECIMAL digits : ','','')
                          print(txt(answer))
              
                      if option == 'A':
                          answer = notepad.prompt('Enter an ASCII TEXT string: ','','')
                          print(asc(answer))
              
              main()
              

              Best Regards,

              guy038

              P.S. :

              • I saw this article, on the Net, related to Powershell :

              In PowerShell 6 and later, the -replace operator also accepts a script block that performs the replacement. The script block runs once for every match.

              <String> -replace <regular-expression>, {<Script-block>}

              Within the script block, use the $_ automatic variable to access the input text being replaced and other useful information. This variable’s class type is System.Text.RegularExpressions.Match.

              The following example replaces each sequence of three digits with the equivalent character. The script block runs for each set of three digits that needs to be replaced.

              Input :

              "072101108108111" -replace "\d{3}", {return [char][int]$_.Value}

              Output :

              Hello

              To know your PS version, open the PS console, type $PSVersionTable and looks for the PSVersion value


              • I also found these two JavaScript programs, on Stack Overflow which could be of some interest :
              var StringTools = {
                stringToNumbersArray: function(data){
                  return data.split('').map(function(c){return c.charCodeAt(0);});
                },
                numbersArrayToString: function(arr){
                  return arr.map(function(n){return String.fromCharCode(n)}).join('');
                }
              }
              
              StringTools.stringToNumbersArray("Hello");
              // => [72, 101, 108, 108, 111]
              
              StringTools.numbersArrayToString([72, 101, 108, 108, 111])
              // => "Hello"
              

              And :

              function convertToNumber(str){
                var number = "";
                for (var i=0; i<str.length; i++){
                  charCode = ('000' + str[i].charCodeAt(0)).substr(-3);
                  number += charCode;
                }
                return number;
              }
              alert(convertToNumber("SO does my homework")); //console.log is better
              
              
              function convertToString(numbers){
                origString = "";
                numbers = numbers.match(/.{3}/g);
                for(var i=0; i < numbers.length; i++){
                  origString += String.fromCharCode(numbers[i]);
                }
                return origString;
              }
              alert(convertToString("083079032100111101115032109121032104111109101119111114107"));  //console.log is better
              
              Paul WormerP 1 Reply Last reply Reply Quote 0
              • Paul WormerP
                Paul Wormer @guy038
                last edited by

                @guy038 said in Is there any way to convert code points to unicode:

                Results with accentuated characters are weird, although correct ?

                The Python function call chr(n) returns the character with Unicode code point n. The block of Unicode code points 0x80-0x9F (128-159) is assigned to control characters, not to (accented) letters. In the block 0x80-0xFF (128-255) Unicode coincides with ISO/IEC 8859-1 (aka Latin-1).

                I introduced, for the fun of it, the Danish character ø in the Hello wørld! example above. In the CP850 encoding (aka extended ASCII) used, the character ø has code point 155. Hence, this code point correspond in Latin-1 to a (non-printable) control character.

                So, when you use the function chr you have to make sure that you stick to Latin-1. I don’t know if this explains the weirdness of your accented characters, but in any case different encodings can cause confusion and are a point of concern.

                EkopalypseE 1 Reply Last reply Reply Quote 2
                • EkopalypseE
                  Ekopalypse @Paul Wormer
                  last edited by

                  Replacing cp850 with mbcs, the configured ANSI code page of the current Windows setup is used.

                  1 Reply Last reply Reply Quote 1
                  • Mark OlsonM
                    Mark Olson
                    last edited by

                    I’d replace struct.pack('B', int(char)) with struct.pack('h', int(char)) and add a call to out = out.replace('\x00', '') at the end. 'h' allows you to handle any number less than 65536, which conveniently includes the entire Basic Multilingual Plane and thus allows you to use 'utf-16' as the encoding.

                    Putting it all together:

                    import re
                    import struct
                    from Npp import *
                    
                    ins_pos = editor.getSelectionStart()
                    selected_text = editor.getSelText()
                    nums = [int(n) for n in re.findall('\d+', selected_text)]
                    decoded = [struct.pack('h', n).decode('cp850') for n in nums]
                    replacement = ''.join(decoded).replace('\x00', '')
                    
                    editor.insertText(ins_pos, replacement)
                    
                    Paul WormerP 1 Reply Last reply Reply Quote 1
                    • Paul WormerP
                      Paul Wormer @Mark Olson
                      last edited by

                      @Mark-Olson Why do you deem it necessary to remove \x00? Where would it come from?

                      Paul WormerP 1 Reply Last reply Reply Quote 0
                      • Paul WormerP
                        Paul Wormer @Paul Wormer
                        last edited by Paul Wormer

                        @Paul-Wormer I believe I have the answer to my own question: in the 2 byte packing, indicated by pack('h', n), 2 trailing zeroes are added for numbers below 256. They show up as \x00, also after decode('cp850'). IMHO, it would probably have been clearer (also for future use) if Mark Olson had gone the 2-byte way consistently and had written decode('utf_16') in his Python snippet instead of decode('cp850'). There also would have been no need to remove \x00 then. The OP asked about 7-bit ASCII anyway and for this all codings are equal.

                        EkopalypseE 1 Reply Last reply Reply Quote 1
                        • EkopalypseE
                          Ekopalypse @Paul Wormer
                          last edited by

                          @Paul-Wormer

                          I would even go so far as to say that the given example is good for seeing where the problem lies with encodings, since it fits several, it could be ascii, cp1252, utf8… be.

                          1 Reply Last reply Reply Quote 1
                          • Alan KilbornA
                            Alan Kilborn @dinkumoil
                            last edited by Alan Kilborn

                            @dinkumoil said:

                            You could create an HTML file that includes your code points as HTML entity numbers

                            …and…

                            @rdipardo said:

                            you can encode the text as numeric entities, then convert them right in Notepad++ with the HTML Tag plugin

                            …and that’s all fine, but the OP ( @User-000 ) had a title on this posting of “Is there any way to convert code points to unicode”, so I found it curious that no example was provided that actually had some fun and uses any “fancy” unicode characters. (Of course, OP didn’t do this in his sample data, either…).

                            I didn’t try with the HTML Tag plugin, but I did try the “HTML file” approach with some “fancy” character data:

                            <body>
                                <p>
                                    &#72&#101&#108&#108&#111&#32&#87&#111&#114&#108&#100&#33&#x25BC&#x1F499
                                </p>
                            </body>
                            

                            and indeed it worked just fine to show the fancy-ness for characters U+25BC and U+1F499; here shown in Chrome:

                            547dfa37-ee07-4847-ada1-da6b98332cd1-image.png

                            1 Reply Last reply Reply Quote 1
                            • First post
                              Last post
                            The Community of users of the Notepad++ text editor.
                            Powered by NodeBB | Contributors