• Login
Community
  • Login

Syntax highlighting and file enconding and the Windows code page (1252 or 936)

Scheduled Pinned Locked Moved Notepad++ & Plugin Development
3 Posts 2 Posters 852 Views
Loading More Posts
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • B
    Bas de Reuver
    last edited by Bas de Reuver Apr 15, 2023, 8:52 PM Apr 15, 2023, 8:52 PM

    The syntax highlighting in the CSV Lint plugin doesn’t work correctly in all cases. There is an issue when the Windows code page is 936 = Chinese character set and the text file is UTF8.

    I want to fix the issue but it’s not as easy as I thought. There are 3 different parts that need to be coordinated properly for the syntax highlighting to work correctly.

    1. Scintilla StartStyle / SetStyle parameters
    2. CSV Lint Lexer and iterating over the characters
    3. Windows code page and string encoding

    The first point, the syntax highlighting colors in Notepad++ are set using is Scintilla StartStyling and SetStyleFor methods.
    These functions require the byte position as parameters, this is the same as when you move the cursor around the text file and look at the Pos number at the bottom in the status bar. Notice how the test file has 36 characters but the length is 58 bytes, so some characters are 1 byte and some characters are more than 1 byte.

    When Windows is set to use code page 936 (Chinese), then the test files require different parameters to get the correct colors in the UTF8 file and the ANSI file.
    I tested it by setting the positions hardcoded and then looking at the results, see screenshots below.

    csvlint_code_page_936.png

    The second point is the way the Lexer iterates through the characters and searches for the separator character. The idea is to iterate through the characters, check if it is the next separator character and call the StartStyling and SetStyleFor functions accordingly. However as mentioned, the character count and the bytes/position count is different.

    csvlint_column_position_936.png

    Currently the plug-in gets the text range using GetCharRange, and then Marshal.PtrToStringAnsi and then ByteStream.GetBytes,
    see code here

    It’s probably due to the encoding when getting the bytestream in the custom function , but it’s not obvious to me what should be to change here, and wether that is the only thing that is needs to be changed.

    The third thing is the Windows code page and the internal string encoding, but I think that works mostly correct now after some changes in this commit by @rdipardo

    1 Reply Last reply Reply Quote 0
    • M
      Mark Olson
      last edited by Mark Olson Apr 15, 2023, 9:44 PM Apr 15, 2023, 9:43 PM

      Without thinking too carefully about your requirements or whether this will actually help much, I came up with this function for calculating the number of UTF8 bytes in a range, adjusted slightly from something I use in JsonTools. For an entire string you’d use Encoding.UTF8.GetByteCount(string s), but the below function could maybe be adjusted to your needs:

      public static int UTF8BytesBetween(string text, int start, int end)
      {
          int utf8Bytes = end - start; // start by assuming pure ASCII
          for (int ii = start; ii < end; ii++)
          {
              char c = text[ii];
              if (c > 127) // not ASCII
              {
                  if (c < 2048 || (c >= 0xd800 && c <= 0xdfff))
                      // one char in a surrogate pair (e.g., half of an emoji)
                      // or just a low non-ASCII thing 
                      utf8Bytes++;
                  // some non-surrogate big char like most Chinese chars
                  else utf8Bytes += 2;
              }
          }
          return utf8Bytes;
      }
      
      1 Reply Last reply Reply Quote 1
      • B
        Bas de Reuver
        last edited by Apr 15, 2023, 10:20 PM

        Wait a minute, I just realised that it is converting a bytebuffer to a string and then back to a byte buffer. It should just convert the content buffer straight to a byte array and work with that.

        I’ve tested it with the code page 936 and 1252 and it seems to work correctly in both cases.

        1 Reply Last reply Reply Quote 3
        1 out of 3
        • First post
          1/3
          Last post
        The Community of users of the Notepad++ text editor.
        Powered by NodeBB | Contributors