Hello Jdl Jacob, Klauss and All,
From what you said, you try to extract valid strings, without control characters, from an executable program’s code, to achieve further searches on these strings, don’t you ? It’s a general well-known problem !
A first approach would be to simply delete any C0 control characters ( from \x00 to \x1f ). However, this method is too restrictive. Indeed, executive programs may contain UNICODE strings, built with their true Unicode code-point, with the UCS-2 Little Endian encoding ( Two bytes, with the least significant byte first )
For instance, taking the string Test One, in two lines, the ANSI encoding give us the logical list of bytes \x54\x65\x73\x74\x0D\x0A\x4F\x6E\x65 as the same UNICODE string would give the list of bytes \x54\x00\x65\x00\x73\x00\x74\x00\x0D\x00\x0A\x00\x4F\x00\x6E\x00\x65\x00
So, you can’t simply search for the string Test One, avoiding UNICODE strings. In our example, in addition to the classical search of the string Test One, in two lines, with the Match case option checked, the search of \x54\x00\x65\x00\x73\x00\x74\x00\x0D\x00\x0A\x00\x4F\x00\x6E\x00\x65\x00 would give a second and valid match !
Now, you should understand the need to NOT delete the NUL character ( \x00 ), too soon, as it’s part of UNICODE strings, which may contain useful information, too.
Let us try, from an example executable file, to extract all pertinent strings. Of course, I haven’t any level game file, but we can simply use a copy of the Notepad++.exe file. I think that you’ll just have to follow the same method for your specific level game files !
IMPORTANT : For this example, I will use the 6.8 version
So, first al all, copy your Notepad++.exe file and rename it as TEST.txt.
From now on, we’re going to perform some successive Searches/Replacements ( CTRL + H ), on the Test.txt file, using regular expressions.
For all the S/R, below, I suppose that :
The cursor location is the very beginning of the Test.txt file
The Regular expression radio button is checked, in the Replace dialog
The . matches newline option is UNCHECKED, in the Replace dialog
All other options are unchecked
The initial state of the Test.txt file is 9 194 lines long, for 2 054 656 bytes
As, the classical
EOL characters
\n and
\r haven’t any signification in an
executable file, we, first, normalize all kinds of
End of Line characters to the string
\r\n, to get a classical
Windows text file ! So, SEARCH =
\r\0\n\0|\r\n|\r|\n and REPLACEMENT =
\r\n
We must, now,
delete any character which is
different from, either, a
standard character, an
EOL character or the
NUL character. So, SEARCH =
[^\0\n\r\x20-\x7e]+ and REPLACEMENT =
NOTHING
As any
NUL character is normally separated from an other
NUL character, by a
standard character, in UNICODE strings, we can, therefore, change any
consecutive list of
NUL characters ( except the
first one, which may be the
last byte of an UNICODE string ) by an
EOL, to easily notice all valid ANSI or UNICODE strings. So, SEARCH =
(?<=\0)\0+ and REPLACEMENT =
\r\n
=> Now, the Test.txt file is 64 341 lines long for 817 228 bytes. All lines have a Windows EOL and it contains, mostly, standard characters, and, also, some NUL characters ( \x00 ). But, you’ll notice that, from now on, there are no more sequence of two consecutive NUL characters !
As the
NUL characters, placed in the
replacement part, are
NEVER re-written ( Bug ! ), we simply have to change any
NUL character by a
specific character, which is NOT part of standard characters. I chose the
Bullet, of ANSI code =
\x95, or \x{2022} ( its UNICODE code-point ), in an
NON-ASCII file. So, SEARCH =
\0 and REPLACEMENT =
\x95
Now, we’ll try to
isolate and
mark the different UNICODE strings. However, we must take care of a
special case, where an ANSI string is
followed by an UNICODE string, with, only,
one NUL ( or Bullet ) character, as a
separator.
For instance, assuming that the symbol Ø stands for the NUL character, the sequence TestØTØeØsØtØ must be decoded as an ANSI string Test, followed by the same UNICODE string, with a NUL character as separator ( and NOT as the ANSI string tes, immediately followed by the UNICODE string tTest )
=> SEARCH = (?<![\x20-\x7e][\x20-\x7e])(?:[\x20-\x7e]\x95){3,} and REPLACEMENT = \r\n\x93$0\x94\r\n
Note : I suppose that any valid UNICODE string must contain, at least, three characters. Then, we search for a minimum of 3 sequences standard + NUL characters, ONLY IF it’s NOT preceded by 2 standard characters
In the replacement part, we re-write the entire search match $0, surrounded by the double quotation marks ( \x93 or \x{201c} and \x94 or \x{201d} ), and preceded and followed by an EOL.
As the UNICODE strings are, now,
clearly identified, we can get rid of the
Bullet character ( which represented the
NUL symbol ), inside a
“…” sequence, for an easier reading ! So SEARCH =
(?=.*\x94)\x95 and REPLACEMENT =
NOTHING
Note : At each position, where it matches the Bullet, the look-ahead regex structure verifies if it exists, further, in the current line, a closing double quotation mark ( \x94 ). By this way, we’re sure that the deleted bullet was, indeed, part of an UNICODE string, ONLY !
=> After these 3 other S/R, the Test.txt file is, now, 75 625 lines long for a size of 796 660 bytes. The UNICODE strings are correctly extracted.
We can, from now on, delete the remaining
bullet characters, located between the ANSI strings and replace them by an
EOL, to clearly see the ANSI strings. So, SEARCH =
\x95 and REPLACEMENT =
\r\n
With all the
EOL characters successively added, we must clean up the file ! We’re going to suppress any
empty line, or containing
ONLY BLANK characters. So, SEARCH =
^ *\R and REPLACEMENT =
NOTHING
Finally, as for the UNICODE strings, we’ll delete any ANSI string, containing
less than
3 characters. So, SEARCH =
^.{1,2}\R and REPLACEMENT =
NOTHING
After, these 9 S/R, you get a Test.txt file, of 52 830 lines long, for a total of 566 652 bytes ! But, looking through this file, it easy enough to detect that valuable strings are located in two main zones : from line 36866 to line 41392 and from line 50857 to line 52094. Once these two parts isolated, you’ll still have to manually delete some non-pertinent strings.
I personally got a Test.txt file of 3667 lines. If you dispose of an hexadecimal editor;, you may translate some of these strings or sentences, in your mother language. Don’t forget to rewrite any UNICODE string according to the UCS-2 Little Endian encoding, with the same length !
For instance, there are, in lines 39500 and 39501 of Test.txt file, the UNICODE strings “OVR” and “INS”, giving the writing mode of text ( Insertion or Overwriting ), in the N++ status bar. In my mother French language, these two strings are INS and RFP. So I could change, in a copy of Notepad++.exe, the string \x4f\x00\x56\x00\x52\x00 by the sequence \x52\x00\x46\x00\x50\x00. Et voilà !
I hope, Jdl Jacob, that you could use these same above S/R, for your specific needs, on level games files. You could, even, shorten it a bit, if your file doesn’t have any UNICODE string. The S/R, of numbers 4, 5 and 6 would, then, be useless. In that case, the 7th S/R needs a small change : SEARCH = \0 and REPLACEMENT = \r\n
Best Regards,
guy038