• Login
Community
  • Login

Custom lexer and Unicode UTF-8 text file content

Scheduled Pinned Locked Moved Notepad++ & Plugin Development
9 Posts 4 Posters 1.8k Views
Loading More Posts
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • B
    Bas de Reuver
    last edited by Sep 12, 2022, 9:25 PM

    I’m updating the CSV Lint plug-in which has a custom lexer for syntax highlighting, so Notepad++ adds colors to the data files.

    It works pretty well for the most part, however it still has a bug with special characters in combination with the Windows
    Unicode UTF8 setting, as described in this post

    In Windows 10 and Windows 11 there is a setting “Use Unicode UTF-8 for worldwide language support”, see Control Panel -> Clock and Region -> Region, tab “Administrative”, button “Change system locale”. When the Windows Unicode UTF-8 is enabled then Notepad++ codepage is set to 65001, when it’s disabled then it’s 1252 (= English + most European languages, at least on my laptop).

    When Unicode UTF-8 is enabled and the user opens a textfile with
    any non-ASCII characters (i.e. anything non-English ë á พสระ 你好 etc) Notepad++ will internally have a different text-encoding and this causes problems with syntax highlighting. See screenshots below.

    csvlint_unicode_syntaxhighlighting.png

    The problem is the way the lexer iterates through the text content. I won’t paste the complete code here but see a summary of the Lexer code below

    public static void Lex(IntPtr instance, UIntPtr start_pos, IntPtr length_doc, int init_style, IntPtr p_access)
            {
                // ..
                IDocumentVtable vtable = (IDocumentVtable)Marshal.PtrToStructure((IntPtr)idoc.VTable, typeof(IDocumentVtable));
    
                // allocate a buffer
                IntPtr buffer_ptr = Marshal.AllocHGlobal(length);
                vtable.GetCharRange(p_access, buffer_ptr, (IntPtr)start, (IntPtr)length);
    			
                //..
    
                // convert the buffer into a managed string
                string content = Marshal.PtrToStringAnsi(buffer_ptr, length);
                length = content.Length;
    			
                nt start = (int)start_pos;
    
                for (i = 0; i < length - 1; i++)
                {
                    // check for separator character
                    char cur = content[i];
                    if (cur == ',') end_col = i;
    
                    //..
    				
                    // if end of column found
                    if (bEndOfColumn)
                    {
                        // style the column
                        vtable.StartStyling(p_access, (IntPtr)(start));
                        vtable.SetStyleFor(p_access, (IntPtr)(end_col - start), (char)idx);
    
                        // etc.
    

    It probably should get the textfile content differently, but still be able to iterate through the character and byte positions somehow. I mean it should inspect each character to determine if it is the column separator, but at the same time should be able to find the correct parameters for the StartStyling and SetStyleFor functions.

    Any ideas on how to fix this?

    1 Reply Last reply Reply Quote 1
    • B
      Bas de Reuver
      last edited by Bas de Reuver Sep 12, 2022, 9:42 PM Sep 12, 2022, 9:39 PM

      Btw you can also see the different “text parsing”(?) by using the Hex editor plug-in. See screenshots below.

      Hex editor with Unicode UTF-8 disabled
      csvlint_hex_unicode_disabled.png

      Hex editor with Unicode UTF-8 enabled:
      csvlint_hex_unicode_enabled.png

      With Unicode UTF-8 disabled or enabled, the bytes of the text file are exactly the same in both. In the “Dump” part you can see extra characters when Unicode UTF-8 is disabled, and there are exactly 16 characters in one row. But when Unicode UT-8 is enabled, then the “Dump” displays the correct characters, although each row don’t always contain exactly 16 character.

      Also, you can see that the out-of-sync colors are caused by how many “special characters” have been encountered up to that point in the text. See screenshot below.

      csvlint_zoom_in.png

      This is the content of the text file:

      1001;zwölf Häuser;GERMAN;Twelve houses
      2002;Frühstückskäffchen;GERMAN;Little breakfast coffee
      3003;Käsesoßenrührlöffel;GERMAN;Cheese sauce stirring spoon
      4004;Äpfel;GERMAN;Apple
      5005;Mädchen;GERMAN;Girl
      6006;Chloë;FRENCH;Girl name
      7007;Alizée;FRENCH;Girl name
      8008;你好 (nǐ hǎo);CHINESE;Hello
      9009;再见 (zài jiàn);CHINESE;Good bye
      
      E 1 Reply Last reply Sep 13, 2022, 7:25 AM Reply Quote 1
      • G
        guy038
        last edited by Sep 12, 2022, 10:49 PM

        Hello @bas-de-reuver and All

        Just a remark : as for me, as I’m French, this parameter can be found in :

        Paramètres > Heure et Langue > LANGUE > Paramètres de la langue d'administration > Modifier les paramètres régionaux > Bêta : Utiliser le format Unicode UTF-8 pour une prise en charge des langues à l'échelle mondiale

        Best Regards,

        guy038

        1 Reply Last reply Reply Quote 1
        • E
          Ekopalypse @Bas de Reuver
          last edited by Sep 13, 2022, 7:25 AM

          @Bas-de-Reuver

          A guess, maybe try PtrToStringAuto instead of PtrToStringAnsi !?

          R 1 Reply Last reply Sep 14, 2022, 3:18 AM Reply Quote 0
          • R
            rdipardo @Ekopalypse
            last edited by Sep 14, 2022, 3:18 AM

            A guess, maybe try PtrToStringAuto instead of PtrToStringAnsi !?

            No. The problem is that the separator-finding loop is only counting the lowest byte of every character. Higher code points occupy multiple bytes in UTF-8 (see for example how line 2 is off by the exact number of umlauts: ö, ä). The CLR char type transparently covers up the difference, since everything is UTF-16 internally.

            @Bas-de-Reuver, you should be comparing bytes, not chars; like this:

            diff --git a/CSVLintNppPlugin/PluginInfrastructure/Lexer.cs b/CSVLintNppPlugin/PluginInfrastructure/Lexer.cs
            index 5e0c53d..0f8217b 100644
            --- a/CSVLintNppPlugin/PluginInfrastructure/Lexer.cs
            +++ b/CSVLintNppPlugin/PluginInfrastructure/Lexer.cs
            @@ -467,8 +467,18 @@ public static void Lex(IntPtr instance, UIntPtr start_pos, IntPtr length_doc, in
                         string content = Marshal.PtrToStringAnsi(buffer_ptr, length);
             
                         // TODO: fix this; this is just a quick & dirty way to prevent index overflow when Windows = code page 65001
            +            // Start by assuming an ANSI CP like 1252 => 1 byte / character
                         length = content.Length;
            +            var byteBuf = new char[length];
            +            content.CopyTo(0, byteBuf, 0, content.Length);
            +            byte[] contentBytes = System.Text.Encoding.Default.GetBytes(byteBuf);
             
            +            if (OEMEncoding.GetACP() == 65001)
            +            {
            +                // CP UTF-8 => 1 or more bytes per character; count them all!
            +                contentBytes = System.Text.Encoding.UTF8.GetBytes(byteBuf);
            +                length = contentBytes.Length;
            +            }
                         // column color index
                         int idx = 1;
                         bool isEOL = false;
            @@ -576,10 +586,10 @@ public static void Lex(IntPtr instance, UIntPtr start_pos, IntPtr length_doc, in
                             char quote_char = Main.Settings.DefaultQuoteChar;
                             bool whitespace = true; // to catch where value is just two quotes "" right at start of line
             
            -                for (i = 0; i < length - 1; i++)
            +                for (i = 0; i < contentBytes.Length - 1; i++)
                             {
            -                    char cur = content[i];
            -                    char next = content[i + 1];
            +                    byte cur = contentBytes[i];
            +                    byte next = contentBytes[i + 1];
             
                                 if (!quote)
                                 {
            @@ -640,7 +650,7 @@ public static void Lex(IntPtr instance, UIntPtr start_pos, IntPtr length_doc, in
                             vtable.StartStyling(p_access, (IntPtr)(start + start_col));
                             vtable.SetStyleFor(p_access, (IntPtr)(length - start_col), (char)idx);
                             // exception when csv AND separator character not colored AND file ends with separator so the very last value is empty
            -                if ( (separatorChar != '\0') && (!sepcol) && (content[length-1] == separatorChar) )
            +                if ( (separatorChar != '\0') && (!sepcol) && (contentBytes[length-1] == separatorChar) )
                             {
                                 // style empty value between columns
                                 vtable.StartStyling(p_access, (IntPtr)(start + i));
            @@ -943,4 +953,10 @@ public static IntPtr PropertyGet(IntPtr instance, IntPtr key)
                             return Marshal.StringToHGlobalAnsi($"{value}\0"); 
                     }
                 }
            +
            +    internal static class OEMEncoding
            +    {
            +        [DllImport("Kernel32.dll")]
            +        public static extern uint GetACP();
            +    }
             }
            
            
            R 1 Reply Last reply Sep 15, 2022, 4:19 AM Reply Quote 2
            • R
              rdipardo @rdipardo
              last edited by Sep 15, 2022, 4:19 AM

              @rdipardo said in Custom lexer and Unicode UTF-8 text file content:

              you should be comparing bytes, not chars

              See the fuller explanation in my PR comment: https://github.com/BdR76/CSVLint/pull/38

              E 1 Reply Last reply Sep 15, 2022, 8:52 AM Reply Quote 2
              • E
                Ekopalypse @rdipardo
                last edited by Sep 15, 2022, 8:52 AM

                @rdipardo

                I fully agree with your explanation and initially assumed that ANSI and UNICODE document should be treated differently, but this was not the case, which confused me. That now a UNICODE OS setup poses this problem confuses me even more.

                This is what an ansi and unicode look like on an ANSI setup OS

                302f8c33-b9d3-42f6-9a93-1f31171eec4f-image.png

                R 1 Reply Last reply Sep 15, 2022, 8:38 PM Reply Quote 2
                • R
                  rdipardo @Ekopalypse
                  last edited by rdipardo Sep 15, 2022, 8:40 PM Sep 15, 2022, 8:38 PM

                  As I’ve said before , there are two completely distinct settings involved here:

                  1. the system’s (“ANSI”) code page, controlling how GUI text is rendered (and, also, the default encoding of string data by the .NET Framework; see below)

                  2. the text encoding of the document, controlling how content bytes are saved to disk

                  In summary, this is a .NET problem (1.), not a Notepad++ problem.

                  C:\>type cp.csscript
                  System.Console.WriteLine("Default encoding of .NET strings: " + System.Text.Encoding.Default.EncodingName);
                  // vim: ft=cs
                  
                  C:\>reg query HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage /v ACP
                  
                  HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage
                      ACP    REG_SZ    1252
                  
                  C:\>csi cp.csscript
                  Default encoding of .NET strings: Western European (Windows)
                  
                  C:\>reg add HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage /v ACP /t REG_SZ /d 65001 /f
                  The operation completed successfully.
                  
                  C:\>csi cp.csscript
                  Default encoding of .NET strings: Unicode (UTF-8)
                  
                  B 1 Reply Last reply Sep 16, 2022, 11:22 AM Reply Quote 1
                  • B
                    Bas de Reuver @rdipardo
                    last edited by Bas de Reuver Sep 16, 2022, 11:24 AM Sep 16, 2022, 11:22 AM

                    @rdipardo thanks for clarifying the Windows settings and the PR, it looks very useful. One question though, if I understand correctly the code in the PR creates a copy of the file content into a managed byte array. But the text data files can often be quite large, like 100MB, so then it would make a copy of 100MB right?

                    That would basically double the memory usage and add performance overhead every time the syntax highlighting is applied to a csv file.

                    Is there maybe a way to use the buffer_ptr directly, like in this stackoverflow answer ? I know it’s “unsafe” code from the CLR perspective, as in using an unmanaged memory pointer, but it would probably perform better, so instead of:

                    string content = Marshal.PtrToStringAnsi(buffer_ptr, length);
                    

                    do something like

                    byte *byte_buffer = (byte *)buffer_ptr;
                    for (var idx=0; idx < length;idx++) //etc.
                    
                    1 Reply Last reply Reply Quote 0
                    4 out of 9
                    • First post
                      4/9
                      Last post
                    The Community of users of the Notepad++ text editor.
                    Powered by NodeBB | Contributors