Intricacies of Regex Replace via Python Script



  • This post is deleted!


  • @Ekopalypse

    Thanks for your reply.

    You cannot simply glue double quotes(").

    But I’m not “glueing” the normal (straight) double-quote marks, I’m doing the typographical (curly) ones (ASCII values 147 & 148, according to NPP’s character panel).

    The r in front of a string means raw string which basically informs python
    to use the string literally and not to treat escape sequences.

    That explanation doesn’t seem to be borne out by my previous experimental usage, or by Python Script’s installed help document, unless I’m missing it in there somewhere. The sample script “Python Regex Replacements”, included with Python Script, uses the following line of code:

    editor.rereplace(r"([A-Z]{3})\1", r"\1")
    

    I don’t think that would do its intended work if the r caused the strings to be seen as literal.

    Also, you preceded both strings with the r in your kindly suggested python code:

    editor.rereplace(r'^(.*)?\.(jpg|jpeg|png)\t(.*)\t(.*)', r'"\1.\2" \(\3\) [\4]')
    

    If the r makes them literal, how would THEY work?

    In any case, all of this seems to be a moot point for me now. For some unknown reason, none of my attempts to use, or tweak and use, my own script are doing anything at all in NPP. I have no clue why. I’ve tried restarting NPP, and the computer, but still no luck. The sample script “Python Regex Replacements” works, but not mine. If anybody has any ideas, I’m open.



  • @M-Andre-Z-Eckenrode

    Ok, I made a mistake to assume that and are meant to be " because your posted script

    editor.rereplace(r"^(.*)?\.(jpg|jpeg|png)\t(.*)\t(.*)", r" “\1.\2” \(\3\) [\4]")
    

    works and because it didn’t on your side I made this wrong assumption.

    Concerning the raw string notation, it is needed because the boost::regex engine expects,
    for example to return the first match, the literal string \1 and not SOH, so python needs to be informed
    not to convert it by either using the r'\1' notation or by escaping '\\1'

    So why isn’t your initial example not working?
    Maybe there is an error in your script. Open the console as I did
    and run your script. Does it show an error?



  • @M-Andre-Z-Eckenrode

    I guess I, finally, understood the issue.
    Your code

    editor.rereplace(r"^(.*)?\.(jpg|jpeg|png)\t(.*)\t(.*)",r" “\1.\2” \(\3\) [\4]")
    

    was executed against an ANSI encoded buffer and therefore resulted in “Test.png†(398) [740 x 2065 x 1]
    where as in an utf8 encoded buffer it would have returned “Test.png” (398) [740 x 2065 x 1]

    So in order to make it work for both, utf8 and ANSI you might want to change it to

    editor.rereplace(r"^(.*)?\.(jpg|jpeg|png)\t(.*)\t(.*)", ur" “\1.\2” \(\3\) [\4]")
    



  • The UTF-8 / ANSI thing. Hmmm.
    The answer for me would be to avoid the whole ANSI thing entirely, if possible (I’m not clear if this is possible/impossible in the OP’s situation). Usually, if you’re in total control of the files for your own purposes and you are specifically choosing ANSI as the encoding, I’d wonder why.
    People (especially “old timers”) seem to confuse ANSI and ASCII which are different things, and not realize that that ASCII (what they really mean when they say ANSI) is in reality fully represented in UTF-8.
    Go UTF-8!
    Hopefully the above comments are on-target. I’m far from an expert, being one of the “old timers”.

    Otherwise:

    I’m kind of “struck” by one line in one of the OP’s post.

    @M-Andre-Z-Eckenrode said:

    If the r makes them literal, how would THEY work?

    I think this is a common point of confusion.
    All that the r"" in Python does is to save the user from having to type (and look at) doubled backslashes.
    It has nothing to do with regex AT ALL.



  • This post is deleted!


  • @Ekopalypse

    Open the console as I did and run your script. Does it show an error?

    I’ve now modified and expanded my code to encompass more use cases, as such:

    editor.rereplace(r"^(.*)?\.(?i)(jpe?g|png|gif|tif{1,2}|bmp)\t(.*)\t(.*)", ur"  “\1.\2” \(\3\) [\4]")
    

    NPP running, Python console open, and code above run from the console’s Run line, it works like a charm, does exactly what it’s supposed to do.

    NPP running, Python console open, but code now run from file via menu Plugins > Python Script > Scripts > IMG+Size+Resolution, error message received:

    SyntaxError: Non-ASCII character '\x93' in file C:\Users\Administrator\AppData\Roaming\Notepad++\plugins\Config\PythonScript\scripts\IMG+Size+Resolution.py on line 1, but no encoding declared; see http :// www.python.org / peps / pep-0263 . html for details
    

    After consulting the online document specified above, which recommends and discusses various encoding declaration schemes (ascii, latin-1, iso-8859-15, utf-8, etc.) for Python scripts, I put # coding: latin-1 at the top of my script file, and attempted another run, which gets me:

    File "C:\Program Files (x86)\Notepad++\plugins\PythonScript\lib\encodings\cp1252.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_table)
    UnicodeEncodeError: 'charmap' codec can't encode character u'\x93' in position 2: character maps to <undefined>
    

    Changed encoding declaration in file to Windows-1252, and that works, thankfully. But a question remains for me: Why does it work with an explicit encoding declaration from the console’s Run line, but require the declaration when run from script file?

    Thanks for your helpful suggestions, Ekopalypse.

    @Alan-Kilborn

    you are specifically choosing ANSI as the encoding, I’d wonder why.

    Because 99.9% of the text I’m dealing with is adequately expressed in ANSI, and Unicode files are larger than ANSI. Granted, they’re just text files and not particularly large ones either way, but I’m a conservationist at heart and am compelled to avoid wasting space. :-)

    All that the r"" in Python does is to save the user from having to type (and look at) doubled backslashes.

    Thanks for the explanation, though still not clear on why doubled backslashes would be necessary without r"".



  • Why does it work with an explicit encoding declaration from the console’s Run line…

    That was meant to say, “Why does it work WITHOUT an explicit encoding declaration from the console’s Run line…”



  • @M-Andre-Z-Eckenrode said in Intricacies of Regex Replace via Python Script:

    “Why does it work WITHOUT

    If you run a script from the menu, then there is no other C++
    interaction then telling the python interpreter to run that specific file and python assume it to be utf8 encoded if not told otherwise.

    If you run code from the run textbox then PS ensures it is utf8 encoded.

    though still not clear on why doubled backslashes would be necessary without r""

    Because certain sequences do have a different meanings, like
    in a regular replace action the regex engine wants to have
    a string \1 or $1 but python would convert \1 to SOH
    before it reaches the boost regex engine.



  • @M-Andre-Z-Eckenrode said in Intricacies of Regex Replace via Python Script:

    Because 99.9% of the text I’m dealing with is adequately expressed in ANSI, and Unicode files are larger than ANSI. Granted, they’re just text files and not particularly large ones either way, but I’m a conservationist at heart and am compelled to avoid wasting space. :-)

    I don’t think that is a compelling reason at all.
    But, democracy rules, so do what you like. :-)
    But, as this thread sort of shows, a complete understanding of encoding is an important thing.
    For myself, I’m not 100% of the way there, but hopefully getting better all the time


Log in to reply