Intricacies of Regex Replace via Python Script

M Andre Z Eckenrode

New to Python Script (and Python itself) but been using Notepad++ for about 10 years now, and regular expressions (in mostly basic but occasionally semi-advanced ways) for about 4 years. I’ve recorded a few regex-replace macros in Notepad++, using the built-in macro record/play functionality, but as everybody here no doubt knows, they are absolute bears to understand and edit, so I looked for other options and thought I’d give Python Script a try. But I’m running into problems, even though I don’t think I’ve been trying anything particularly complicated. Example:

I am often exporting folder content lists as text from my file manager of choice, Directory Opus, often for folders populated by multiple images (jpg, png, etc.). DOpus lets me opt to print additional information about the files, and I choose to have it print file size and resolution, which are separated from the filename and from each other by tabs:

Test.png<TAB>398<TAB>740 x 2065 x 1

For my usual purposes, I want the file list entries to look like this:

“Test.png” (398) [740 x 2065 x 1]

Using NPP’s built-in replace dialog, I’ve successfully accomplished what I wanted with the following strings:

Find: ^(.*)?\.(jpg|jpeg|png)\t(.*)\t(.*)

Replace: “\1.\2” $\3$ [\4]

When I recorded that using NPP’s built-in macro recorder, it recorded the replace string as follows:

“\1.\2” $\3$ [\4]

Now trying to translate the above strings for use via Python Script. The sample script “Python Regex Replacements.py” that came with Python Script contains the following example code:

editor.rereplace(r"([A-Z]{3})\1", r"\1")

While the Python Script help does include the following example in the Introduction section:

editor.pyreplace(r"^Code: ([A-Z]{4,8})", r"The code is \1")

…in Editor Object > Helper Methods, it states the following:

Editor.pyreplace(search, replace[, count[, flags[, startLine[, endLine]]]]) This method has been removed from version 1.0. It was last present in version 0.9.2.0

In any case, I tried translating my own find and replace strings for Python Script use as follows:

editor.rereplace(r"^(.*)?\.(jpg|jpeg|png)\t(.*)\t(.*)",r" “\1.\2” $\3$ [\4]")

That gave me the following result:

â€œTest.pngâ€ (398) [740 x 2065 x 1]

I also tried it WITHOUT the ‘r’ preceding the opening double-quote marks, with result as shown:

â€œSOH.STXâ€ (ETX) [EOT]

(SOH/STX/ETX/EOT being control characters represented in NPP by those letter combos in white, on a black background.)

Note that most of the text I’m processing is plain ANSI, and typographical single- and double-quote marks are included in the ANSI character set. Nevertheless, I did find reference in the Python Script help to using ‘u’ to designate unicode strings, and I’d like my script to still work for the occasional unicode text file anyway, so I tried this code:

editor.rereplace(u"^(.*)?\.(jpg|jpeg|png)\t(.*)\t(.*)",u" “\1.\2” $\3$ [\4]")

Result:

“SOH.STX” (ETX) [EOT]

So I finally got the typographical quote marks to come through, but the filename, size and resolution are still bungled. What am I doing wrong?

Ekopalypse

@M-Andre-Z-Eckenrode

You cannot simply glue double quotes(").
Python knows 4 different ways to denote a string

"this is a string"

'as well as this'

"""this is a string also but has special meaning and features"""

'''like this one'''

If you have to construct a string which uses one of the delimiters,
then you can either use two different ones or you have to escape it.

'This is a string which contains a " in it'

or

"This is a string which contains a \" in it"

The r in front of a string means raw string which basically informs python
to use the string literally and not to treat escape sequences.
"Hello\tWorld" would mean Hello followed by a tab followed by World,
whereas r"Hello\tWorld" would create a string literally and not replacing \t with a tab sign.

I hope this clarifies your question, if not let me know.
And yes, pyreplace should NOT be used anymore.

Ekopalypse

@M-Andre-Z-Eckenrode

just in case I wasn’t clear enough, that would be the code in python

editor.rereplace(r'^(.*)?\.(jpg|jpeg|png)\t(.*)\t(.*)', r'"\1.\2" \(\3\) [\4]')

M Andre Z Eckenrode

This post is deleted!

M Andre Z Eckenrode

This post is deleted!

M Andre Z Eckenrode

@Ekopalypse

Thanks for your reply.

You cannot simply glue double quotes(").

But I’m not “glueing” the normal (straight) double-quote marks, I’m doing the typographical (curly) ones (ASCII values 147 & 148, according to NPP’s character panel).

The r in front of a string means raw string which basically informs python
to use the string literally and not to treat escape sequences.

That explanation doesn’t seem to be borne out by my previous experimental usage, or by Python Script’s installed help document, unless I’m missing it in there somewhere. The sample script “Python Regex Replacements”, included with Python Script, uses the following line of code:

editor.rereplace(r"([A-Z]{3})\1", r"\1")

I don’t think that would do its intended work if the r caused the strings to be seen as literal.

Also, you preceded both strings with the r in your kindly suggested python code:

editor.rereplace(r'^(.*)?\.(jpg|jpeg|png)\t(.*)\t(.*)', r'"\1.\2" \(\3\) [\4]')

If the r makes them literal, how would THEY work?

In any case, all of this seems to be a moot point for me now. For some unknown reason, none of my attempts to use, or tweak and use, my own script are doing anything at all in NPP. I have no clue why. I’ve tried restarting NPP, and the computer, but still no luck. The sample script “Python Regex Replacements” works, but not mine. If anybody has any ideas, I’m open.

Ekopalypse

@M-Andre-Z-Eckenrode

Ok, I made a mistake to assume that “ and ” are meant to be " because your posted script

editor.rereplace(r"^(.*)?\.(jpg|jpeg|png)\t(.*)\t(.*)", r" “\1.\2” \(\3\) [\4]")

works and because it didn’t on your side I made this wrong assumption.

Concerning the raw string notation, it is needed because the boost::regex engine expects,
for example to return the first match, the literal string \1 and not SOH, so python needs to be informed
not to convert it by either using the r'\1' notation or by escaping '\\1'

So why isn’t your initial example not working?
Maybe there is an error in your script. Open the console as I did
and run your script. Does it show an error?

Ekopalypse

@M-Andre-Z-Eckenrode

I guess I, finally, understood the issue.
Your code

editor.rereplace(r"^(.*)?\.(jpg|jpeg|png)\t(.*)\t(.*)",r" “\1.\2” \(\3\) [\4]")

was executed against an ANSI encoded buffer and therefore resulted in â€œTest.pngâ€ (398) [740 x 2065 x 1]
where as in an utf8 encoded buffer it would have returned “Test.png” (398) [740 x 2065 x 1]

So in order to make it work for both, utf8 and ANSI you might want to change it to

editor.rereplace(r"^(.*)?\.(jpg|jpeg|png)\t(.*)\t(.*)", ur" “\1.\2” \(\3\) [\4]")

Alan Kilborn

The UTF-8 / ANSI thing. Hmmm.
The answer for me would be to avoid the whole ANSI thing entirely, if possible (I’m not clear if this is possible/impossible in the OP’s situation). Usually, if you’re in total control of the files for your own purposes and you are specifically choosing ANSI as the encoding, I’d wonder why.
People (especially “old timers”) seem to confuse ANSI and ASCII which are different things, and not realize that that ASCII (what they really mean when they say ANSI) is in reality fully represented in UTF-8.
Go UTF-8!
Hopefully the above comments are on-target. I’m far from an expert, being one of the “old timers”.

Otherwise:

I’m kind of “struck” by one line in one of the OP’s post.

@M-Andre-Z-Eckenrode said:

If the r makes them literal, how would THEY work?

I think this is a common point of confusion.
All that the r"" in Python does is to save the user from having to type (and look at) doubled backslashes.
It has nothing to do with regex AT ALL.

M Andre Z Eckenrode

This post is deleted!

M Andre Z Eckenrode

@Ekopalypse

Open the console as I did and run your script. Does it show an error?

I’ve now modified and expanded my code to encompass more use cases, as such:

editor.rereplace(r"^(.*)?\.(?i)(jpe?g|png|gif|tif{1,2}|bmp)\t(.*)\t(.*)", ur"  “\1.\2” \(\3\) [\4]")

NPP running, Python console open, and code above run from the console’s Run line, it works like a charm, does exactly what it’s supposed to do.

NPP running, Python console open, but code now run from file via menu Plugins > Python Script > Scripts > IMG+Size+Resolution, error message received:

SyntaxError: Non-ASCII character '\x93' in file C:\Users\Administrator\AppData\Roaming\Notepad++\plugins\Config\PythonScript\scripts\IMG+Size+Resolution.py on line 1, but no encoding declared; see http :// www.python.org / peps / pep-0263 . html for details

After consulting the online document specified above, which recommends and discusses various encoding declaration schemes (ascii, latin-1, iso-8859-15, utf-8, etc.) for Python scripts, I put # coding: latin-1 at the top of my script file, and attempted another run, which gets me:

File "C:\Program Files (x86)\Notepad++\plugins\PythonScript\lib\encodings\cp1252.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode character u'\x93' in position 2: character maps to <undefined>

Changed encoding declaration in file to Windows-1252, and that works, thankfully. But a question remains for me: Why does it work with an explicit encoding declaration from the console’s Run line, but require the declaration when run from script file?

Thanks for your helpful suggestions, Ekopalypse.

@Alan-Kilborn

you are specifically choosing ANSI as the encoding, I’d wonder why.

Because 99.9% of the text I’m dealing with is adequately expressed in ANSI, and Unicode files are larger than ANSI. Granted, they’re just text files and not particularly large ones either way, but I’m a conservationist at heart and am compelled to avoid wasting space. :-)

All that the r"" in Python does is to save the user from having to type (and look at) doubled backslashes.

Thanks for the explanation, though still not clear on why doubled backslashes would be necessary without r"".

M Andre Z Eckenrode

Why does it work with an explicit encoding declaration from the console’s Run line…

That was meant to say, “Why does it work WITHOUT an explicit encoding declaration from the console’s Run line…”

Ekopalypse

@M-Andre-Z-Eckenrode said in Intricacies of Regex Replace via Python Script:

“Why does it work WITHOUT

If you run a script from the menu, then there is no other C++
interaction then telling the python interpreter to run that specific file and python assume it to be utf8 encoded if not told otherwise.

If you run code from the run textbox then PS ensures it is utf8 encoded.

though still not clear on why doubled backslashes would be necessary without r""

Because certain sequences do have a different meanings, like
in a regular replace action the regex engine wants to have
a string \1 or $1 but python would convert \1 to SOH
before it reaches the boost regex engine.

Alan Kilborn

@M-Andre-Z-Eckenrode said in Intricacies of Regex Replace via Python Script:

Because 99.9% of the text I’m dealing with is adequately expressed in ANSI, and Unicode files are larger than ANSI. Granted, they’re just text files and not particularly large ones either way, but I’m a conservationist at heart and am compelled to avoid wasting space. :-)

I don’t think that is a compelling reason at all.
But, democracy rules, so do what you like. :-)
But, as this thread sort of shows, a complete understanding of encoding is an important thing.
For myself, I’m not 100% of the way there, but hopefully getting better all the time