Community
    • Login

    Syntax highlighting and file enconding and the Windows code page (1252 or 936)

    Scheduled Pinned Locked Moved Notepad++ & Plugin Development
    3 Posts 2 Posters 684 Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Bas de ReuverB
      Bas de Reuver
      last edited by Bas de Reuver

      The syntax highlighting in the CSV Lint plugin doesn’t work correctly in all cases. There is an issue when the Windows code page is 936 = Chinese character set and the text file is UTF8.

      I want to fix the issue but it’s not as easy as I thought. There are 3 different parts that need to be coordinated properly for the syntax highlighting to work correctly.

      1. Scintilla StartStyle / SetStyle parameters
      2. CSV Lint Lexer and iterating over the characters
      3. Windows code page and string encoding

      The first point, the syntax highlighting colors in Notepad++ are set using is Scintilla StartStyling and SetStyleFor methods.
      These functions require the byte position as parameters, this is the same as when you move the cursor around the text file and look at the Pos number at the bottom in the status bar. Notice how the test file has 36 characters but the length is 58 bytes, so some characters are 1 byte and some characters are more than 1 byte.

      When Windows is set to use code page 936 (Chinese), then the test files require different parameters to get the correct colors in the UTF8 file and the ANSI file.
      I tested it by setting the positions hardcoded and then looking at the results, see screenshots below.

      csvlint_code_page_936.png

      The second point is the way the Lexer iterates through the characters and searches for the separator character. The idea is to iterate through the characters, check if it is the next separator character and call the StartStyling and SetStyleFor functions accordingly. However as mentioned, the character count and the bytes/position count is different.

      csvlint_column_position_936.png

      Currently the plug-in gets the text range using GetCharRange, and then Marshal.PtrToStringAnsi and then ByteStream.GetBytes,
      see code here

      It’s probably due to the encoding when getting the bytestream in the custom function, but it’s not obvious to me what should be to change here, and wether that is the only thing that is needs to be changed.

      The third thing is the Windows code page and the internal string encoding, but I think that works mostly correct now after some changes in this commit by @rdipardo

      1 Reply Last reply Reply Quote 0
      • Mark OlsonM
        Mark Olson
        last edited by Mark Olson

        Without thinking too carefully about your requirements or whether this will actually help much, I came up with this function for calculating the number of UTF8 bytes in a range, adjusted slightly from something I use in JsonTools. For an entire string you’d use Encoding.UTF8.GetByteCount(string s), but the below function could maybe be adjusted to your needs:

        public static int UTF8BytesBetween(string text, int start, int end)
        {
            int utf8Bytes = end - start; // start by assuming pure ASCII
            for (int ii = start; ii < end; ii++)
            {
                char c = text[ii];
                if (c > 127) // not ASCII
                {
                    if (c < 2048 || (c >= 0xd800 && c <= 0xdfff))
                        // one char in a surrogate pair (e.g., half of an emoji)
                        // or just a low non-ASCII thing 
                        utf8Bytes++;
                    // some non-surrogate big char like most Chinese chars
                    else utf8Bytes += 2;
                }
            }
            return utf8Bytes;
        }
        
        1 Reply Last reply Reply Quote 1
        • Bas de ReuverB
          Bas de Reuver
          last edited by

          Wait a minute, I just realised that it is converting a bytebuffer to a string and then back to a byte buffer. It should just convert the content buffer straight to a byte array and work with that.

          I’ve tested it with the code page 936 and 1252 and it seems to work correctly in both cases.

          1 Reply Last reply Reply Quote 3
          • First post
            Last post
          The Community of users of the Notepad++ text editor.
          Powered by NodeBB | Contributors