Characters Not Appearing Correctly

Jdl Jacob

Hello,

I downloaded Notepad++ so I can search for a string in multiple files at once, however the program isn’t displaying the files correctly. These files are basically level files from a game, and they have a lot of null bytes along with regular text. Microsoft Notepad is able to open these files correctly, and only a few of these files are able to be opened correctly in Notepad++ so that the strings are visible. The only solution I’ve found so far is to go to Encoding>Character Sets and select anything in there, but I really shouldn’t have to do that to every file. Some of my friends who use Notepad++ do not need to do that for these files, so it appears to be something on my end? If it helps, I’m using Windows 8.1.

Thanks.

Klaus Lehmann

hi jacob
You can manipulate scilexer.dll

here’s a code, it will replace hex00 with a better view in hex00 ;-)

code
in decimal:
search for: \000\078\085\076\000\083\079\072\000
replace with: \000\048\000\000\000\083\079\072\000

in hex:
-file lexer.dlz “\x00\x4e\x55\x4c\x00\x53\x4f\x48\x00\x00\x30\x00\x00\x00\x53\x4f\x48\x00_”

I think You can do this with perl.exe!

yours klaus

guy038

Hello Jdl Jacob, Klauss and All,

From what you said, you try to extract valid strings, without control characters, from an executable program’s code, to achieve further searches on these strings, don’t you ? It’s a general well-known problem !

A first approach would be to simply delete any C0 control characters ( from \x00 to \x1f ). However, this method is too restrictive. Indeed, executive programs may contain UNICODE strings, built with their true Unicode code-point, with the UCS-2 Little Endian encoding ( Two bytes, with the least significant byte first )

For instance, taking the string Test One, in two lines, the ANSI encoding give us the logical list of bytes \x54\x65\x73\x74\x0D\x0A\x4F\x6E\x65 as the same UNICODE string would give the list of bytes \x54\x00\x65\x00\x73\x00\x74\x00\x0D\x00\x0A\x00\x4F\x00\x6E\x00\x65\x00

So, you can’t simply search for the string Test One, avoiding UNICODE strings. In our example, in addition to the classical search of the string Test One, in two lines, with the Match case option checked, the search of \x54\x00\x65\x00\x73\x00\x74\x00\x0D\x00\x0A\x00\x4F\x00\x6E\x00\x65\x00 would give a second and valid match !

Now, you should understand the need to NOT delete the NUL character ( \x00 ), too soon, as it’s part of UNICODE strings, which may contain useful information, too.

Let us try, from an example executable file, to extract all pertinent strings. Of course, I haven’t any level game file, but we can simply use a copy of the Notepad++.exe file. I think that you’ll just have to follow the same method for your specific level game files !

IMPORTANT : For this example, I will use the 6.8 version

So, first al all, copy your Notepad++.exe file and rename it as TEST.txt.

From now on, we’re going to perform some successive Searches/Replacements ( CTRL + H ), on the Test.txt file, using regular expressions.

For all the S/R, below, I suppose that :

The cursor location is the very beginning of the Test.txt file
The Regular expression radio button is checked, in the Replace dialog
The . matches newline option is UNCHECKED, in the Replace dialog
All other options are unchecked

The initial state of the Test.txt file is 9 194 lines long, for 2 054 656 bytes

1. As, the classical EOL characters \n and \r haven’t any signification in an executable file, we, first, normalize all kinds of End of Line characters to the string \r\n, to get a classical Windows text file ! So, SEARCH = \r\0\n\0|\r\n|\r|\n and REPLACEMENT = \r\n
1. We must, now, delete any character which is different from, either, a standard character, an EOL character or the NUL character. So, SEARCH = [^\0\n\r\x20-\x7e]+ and REPLACEMENT = NOTHING
1. As any NUL character is normally separated from an other NUL character, by a standard character, in UNICODE strings, we can, therefore, change any consecutive list of NUL characters ( except the first one, which may be the last byte of an UNICODE string ) by an EOL, to easily notice all valid ANSI or UNICODE strings. So, SEARCH = (?<=\0)\0+ and REPLACEMENT = \r\n

=> Now, the Test.txt file is 64 341 lines long for 817 228 bytes. All lines have a Windows EOL and it contains, mostly, standard characters, and, also, some NUL characters ( \x00 ). But, you’ll notice that, from now on, there are no more sequence of two consecutive NUL characters !

1. As the NUL characters, placed in the replacement part, are NEVER re-written ( Bug ! ), we simply have to change any NUL character by a specific character, which is NOT part of standard characters. I chose the Bullet, of ANSI code = \x95, or \x{2022} ( its UNICODE code-point ), in an NON-ASCII file. So, SEARCH = \0 and REPLACEMENT = \x95
1. Now, we’ll try to isolate and mark the different UNICODE strings. However, we must take care of a special case, where an ANSI string is followed by an UNICODE string, with, only, one NUL ( or Bullet ) character, as a separator.

For instance, assuming that the symbol Ø stands for the NUL character, the sequence TestØTØeØsØtØ must be decoded as an ANSI string Test, followed by the same UNICODE string, with a NUL character as separator ( and NOT as the ANSI string tes, immediately followed by the UNICODE string tTest )

=> SEARCH = (?<![\x20-\x7e][\x20-\x7e])(?:[\x20-\x7e]\x95){3,} and REPLACEMENT = \r\n\x93$0\x94\r\n

Note : I suppose that any valid UNICODE string must contain, at least, three characters. Then, we search for a minimum of 3 sequences standard + NUL characters, ONLY IF it’s NOT preceded by 2 standard characters

In the replacement part, we re-write the entire search match $0, surrounded by the double quotation marks ( \x93 or \x{201c} and \x94 or \x{201d} ), and preceded and followed by an EOL.

1. As the UNICODE strings are, now, clearly identified, we can get rid of the Bullet character ( which represented the NUL symbol ), inside a “…” sequence, for an easier reading ! So SEARCH = (?=.*\x94)\x95 and REPLACEMENT = NOTHING

Note : At each position, where it matches the Bullet, the look-ahead regex structure verifies if it exists, further, in the current line, a closing double quotation mark ( \x94 ). By this way, we’re sure that the deleted bullet was, indeed, part of an UNICODE string, ONLY !

=> After these 3 other S/R, the Test.txt file is, now, 75 625 lines long for a size of 796 660 bytes. The UNICODE strings are correctly extracted.

1. We can, from now on, delete the remaining bullet characters, located between the ANSI strings and replace them by an EOL, to clearly see the ANSI strings. So, SEARCH = \x95 and REPLACEMENT = \r\n
1. With all the EOL characters successively added, we must clean up the file ! We’re going to suppress any empty line, or containing ONLY BLANK characters. So, SEARCH = ^ *\R and REPLACEMENT = NOTHING
1. Finally, as for the UNICODE strings, we’ll delete any ANSI string, containing less than 3 characters. So, SEARCH = ^.{1,2}\R and REPLACEMENT = NOTHING

After, these 9 S/R, you get a Test.txt file, of 52 830 lines long, for a total of 566 652 bytes ! But, looking through this file, it easy enough to detect that valuable strings are located in two main zones : from line 36866 to line 41392 and from line 50857 to line 52094. Once these two parts isolated, you’ll still have to manually delete some non-pertinent strings.

I personally got a Test.txt file of 3667 lines. If you dispose of an hexadecimal editor;, you may translate some of these strings or sentences, in your mother language. Don’t forget to rewrite any UNICODE string according to the UCS-2 Little Endian encoding, with the same length !

For instance, there are, in lines 39500 and 39501 of Test.txt file, the UNICODE strings “OVR” and “INS”, giving the writing mode of text ( Insertion or Overwriting ), in the N++ status bar. In my mother French language, these two strings are INS and RFP. So I could change, in a copy of Notepad++.exe, the string \x4f\x00\x56\x00\x52\x00 by the sequence \x52\x00\x46\x00\x50\x00. Et voilà !

I hope, Jdl Jacob, that you could use these same above S/R, for your specific needs, on level games files. You could, even, shorten it a bit, if your file doesn’t have any UNICODE string. The S/R, of numbers 4, 5 and 6 would, then, be useless. In that case, the 7th S/R needs a small change : SEARCH = \0 and REPLACEMENT = \r\n

Best Regards,

guy038