Community
    • Login

    Intricacies of Regex Replace via Python Script

    Scheduled Pinned Locked Moved Notepad++ & Plugin Development
    14 Posts 3 Posters 2.2k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • M Andre Z EckenrodeM
      M Andre Z Eckenrode
      last edited by

      New to Python Script (and Python itself) but been using Notepad++ for about 10 years now, and regular expressions (in mostly basic but occasionally semi-advanced ways) for about 4 years. I’ve recorded a few regex-replace macros in Notepad++, using the built-in macro record/play functionality, but as everybody here no doubt knows, they are absolute bears to understand and edit, so I looked for other options and thought I’d give Python Script a try. But I’m running into problems, even though I don’t think I’ve been trying anything particularly complicated. Example:

      I am often exporting folder content lists as text from my file manager of choice, Directory Opus, often for folders populated by multiple images (jpg, png, etc.). DOpus lets me opt to print additional information about the files, and I choose to have it print file size and resolution, which are separated from the filename and from each other by tabs:

      Test.png<TAB>398<TAB>740 x 2065 x 1

      For my usual purposes, I want the file list entries to look like this:

      “Test.png” (398) [740 x 2065 x 1]

      Using NPP’s built-in replace dialog, I’ve successfully accomplished what I wanted with the following strings:

      Find: ^(.*)?\.(jpg|jpeg|png)\t(.*)\t(.*)

      Replace: “\1.\2” \(\3\) [\4]

      When I recorded that using NPP’s built-in macro recorder, it recorded the replace string as follows:

      &#x201C;\1.\2&#x201D; \(\3\) [\4]

      Now trying to translate the above strings for use via Python Script. The sample script “Python Regex Replacements.py” that came with Python Script contains the following example code:

      editor.rereplace(r"([A-Z]{3})\1", r"\1")

      While the Python Script help does include the following example in the Introduction section:

      editor.pyreplace(r"^Code: ([A-Z]{4,8})", r"The code is \1")

      …in Editor Object > Helper Methods, it states the following:

      Editor.pyreplace(search, replace[, count[, flags[, startLine[, endLine]]]]) This method has been removed from version 1.0. It was last present in version 0.9.2.0

      In any case, I tried translating my own find and replace strings for Python Script use as follows:

      editor.rereplace(r"^(.*)?\.(jpg|jpeg|png)\t(.*)\t(.*)",r" “\1.\2” \(\3\) [\4]")

      That gave me the following result:

      “Test.png†(398) [740 x 2065 x 1]

      I also tried it WITHOUT the ‘r’ preceding the opening double-quote marks, with result as shown:

      “SOH.STX†(ETX) [EOT]

      (SOH/STX/ETX/EOT being control characters represented in NPP by those letter combos in white, on a black background.)

      Note that most of the text I’m processing is plain ANSI, and typographical single- and double-quote marks are included in the ANSI character set. Nevertheless, I did find reference in the Python Script help to using ‘u’ to designate unicode strings, and I’d like my script to still work for the occasional unicode text file anyway, so I tried this code:

      editor.rereplace(u"^(.*)?\.(jpg|jpeg|png)\t(.*)\t(.*)",u" “\1.\2” \(\3\) [\4]")

      Result:

      “SOH.STX” (ETX) [EOT]

      So I finally got the typographical quote marks to come through, but the filename, size and resolution are still bungled. What am I doing wrong?

      EkopalypseE 2 Replies Last reply Reply Quote 1
      • EkopalypseE
        Ekopalypse @M Andre Z Eckenrode
        last edited by

        @M-Andre-Z-Eckenrode

        You cannot simply glue double quotes(").
        Python knows 4 different ways to denote a string

        "this is a string"
        
        'as well as this'
        
        """this is a string also but has special meaning and features"""
        
        '''like this one'''
        

        If you have to construct a string which uses one of the delimiters,
        then you can either use two different ones or you have to escape it.

        'This is a string which contains a " in it'
        

        or

        "This is a string which contains a \" in it"
        

        The r in front of a string means raw string which basically informs python
        to use the string literally and not to treat escape sequences.
        "Hello\tWorld" would mean Hello followed by a tab followed by World,
        whereas r"Hello\tWorld" would create a string literally and not replacing \t with a tab sign.

        I hope this clarifies your question, if not let me know.
        And yes, pyreplace should NOT be used anymore.

        M Andre Z EckenrodeM 2 Replies Last reply Reply Quote 2
        • EkopalypseE
          Ekopalypse @M Andre Z Eckenrode
          last edited by Ekopalypse

          @M-Andre-Z-Eckenrode

          just in case I wasn’t clear enough, that would be the code in python

          editor.rereplace(r'^(.*)?\.(jpg|jpeg|png)\t(.*)\t(.*)', r'"\1.\2" \(\3\) [\4]')
          
          1 Reply Last reply Reply Quote 1
          • M Andre Z EckenrodeM
            M Andre Z Eckenrode @Ekopalypse
            last edited by

            This post is deleted!
            M Andre Z EckenrodeM 1 Reply Last reply Reply Quote 0
            • M Andre Z EckenrodeM
              M Andre Z Eckenrode @M Andre Z Eckenrode
              last edited by

              This post is deleted!
              1 Reply Last reply Reply Quote 0
              • M Andre Z EckenrodeM
                M Andre Z Eckenrode @Ekopalypse
                last edited by

                @Ekopalypse

                Thanks for your reply.

                You cannot simply glue double quotes(").

                But I’m not “glueing” the normal (straight) double-quote marks, I’m doing the typographical (curly) ones (ASCII values 147 & 148, according to NPP’s character panel).

                The r in front of a string means raw string which basically informs python
                to use the string literally and not to treat escape sequences.

                That explanation doesn’t seem to be borne out by my previous experimental usage, or by Python Script’s installed help document, unless I’m missing it in there somewhere. The sample script “Python Regex Replacements”, included with Python Script, uses the following line of code:

                editor.rereplace(r"([A-Z]{3})\1", r"\1")
                

                I don’t think that would do its intended work if the r caused the strings to be seen as literal.

                Also, you preceded both strings with the r in your kindly suggested python code:

                editor.rereplace(r'^(.*)?\.(jpg|jpeg|png)\t(.*)\t(.*)', r'"\1.\2" \(\3\) [\4]')
                

                If the r makes them literal, how would THEY work?

                In any case, all of this seems to be a moot point for me now. For some unknown reason, none of my attempts to use, or tweak and use, my own script are doing anything at all in NPP. I have no clue why. I’ve tried restarting NPP, and the computer, but still no luck. The sample script “Python Regex Replacements” works, but not mine. If anybody has any ideas, I’m open.

                EkopalypseE 2 Replies Last reply Reply Quote 0
                • EkopalypseE
                  Ekopalypse @M Andre Z Eckenrode
                  last edited by Ekopalypse

                  @M-Andre-Z-Eckenrode

                  Ok, I made a mistake to assume that “ and ” are meant to be " because your posted script

                  editor.rereplace(r"^(.*)?\.(jpg|jpeg|png)\t(.*)\t(.*)", r" “\1.\2” \(\3\) [\4]")
                  

                  works and because it didn’t on your side I made this wrong assumption.

                  Concerning the raw string notation, it is needed because the boost::regex engine expects,
                  for example to return the first match, the literal string \1 and not SOH, so python needs to be informed
                  not to convert it by either using the r'\1' notation or by escaping '\\1'

                  So why isn’t your initial example not working?
                  Maybe there is an error in your script. Open the console as I did
                  and run your script. Does it show an error?

                  M Andre Z EckenrodeM 2 Replies Last reply Reply Quote 1
                  • EkopalypseE
                    Ekopalypse @M Andre Z Eckenrode
                    last edited by

                    @M-Andre-Z-Eckenrode

                    I guess I, finally, understood the issue.
                    Your code

                    editor.rereplace(r"^(.*)?\.(jpg|jpeg|png)\t(.*)\t(.*)",r" “\1.\2” \(\3\) [\4]")
                    

                    was executed against an ANSI encoded buffer and therefore resulted in “Test.png†(398) [740 x 2065 x 1]
                    where as in an utf8 encoded buffer it would have returned “Test.png” (398) [740 x 2065 x 1]

                    So in order to make it work for both, utf8 and ANSI you might want to change it to

                    editor.rereplace(r"^(.*)?\.(jpg|jpeg|png)\t(.*)\t(.*)", ur" “\1.\2” \(\3\) [\4]")
                    

                    1 Reply Last reply Reply Quote 2
                    • Alan KilbornA
                      Alan Kilborn
                      last edited by

                      The UTF-8 / ANSI thing. Hmmm.
                      The answer for me would be to avoid the whole ANSI thing entirely, if possible (I’m not clear if this is possible/impossible in the OP’s situation). Usually, if you’re in total control of the files for your own purposes and you are specifically choosing ANSI as the encoding, I’d wonder why.
                      People (especially “old timers”) seem to confuse ANSI and ASCII which are different things, and not realize that that ASCII (what they really mean when they say ANSI) is in reality fully represented in UTF-8.
                      Go UTF-8!
                      Hopefully the above comments are on-target. I’m far from an expert, being one of the “old timers”.

                      Otherwise:

                      I’m kind of “struck” by one line in one of the OP’s post.

                      @M-Andre-Z-Eckenrode said:

                      If the r makes them literal, how would THEY work?

                      I think this is a common point of confusion.
                      All that the r"" in Python does is to save the user from having to type (and look at) doubled backslashes.
                      It has nothing to do with regex AT ALL.

                      1 Reply Last reply Reply Quote 2
                      • M Andre Z EckenrodeM
                        M Andre Z Eckenrode @Ekopalypse
                        last edited by

                        This post is deleted!
                        1 Reply Last reply Reply Quote 0
                        • M Andre Z EckenrodeM
                          M Andre Z Eckenrode @Ekopalypse
                          last edited by

                          @Ekopalypse

                          Open the console as I did and run your script. Does it show an error?

                          I’ve now modified and expanded my code to encompass more use cases, as such:

                          editor.rereplace(r"^(.*)?\.(?i)(jpe?g|png|gif|tif{1,2}|bmp)\t(.*)\t(.*)", ur"  “\1.\2” \(\3\) [\4]")
                          

                          NPP running, Python console open, and code above run from the console’s Run line, it works like a charm, does exactly what it’s supposed to do.

                          NPP running, Python console open, but code now run from file via menu Plugins > Python Script > Scripts > IMG+Size+Resolution, error message received:

                          SyntaxError: Non-ASCII character '\x93' in file C:\Users\Administrator\AppData\Roaming\Notepad++\plugins\Config\PythonScript\scripts\IMG+Size+Resolution.py on line 1, but no encoding declared; see http :// www.python.org / peps / pep-0263 . html for details
                          

                          After consulting the online document specified above, which recommends and discusses various encoding declaration schemes (ascii, latin-1, iso-8859-15, utf-8, etc.) for Python scripts, I put # coding: latin-1 at the top of my script file, and attempted another run, which gets me:

                          File "C:\Program Files (x86)\Notepad++\plugins\PythonScript\lib\encodings\cp1252.py", line 12, in encode
                          return codecs.charmap_encode(input,errors,encoding_table)
                          UnicodeEncodeError: 'charmap' codec can't encode character u'\x93' in position 2: character maps to <undefined>
                          

                          Changed encoding declaration in file to Windows-1252, and that works, thankfully. But a question remains for me: Why does it work with an explicit encoding declaration from the console’s Run line, but require the declaration when run from script file?

                          Thanks for your helpful suggestions, Ekopalypse.

                          @Alan-Kilborn

                          you are specifically choosing ANSI as the encoding, I’d wonder why.

                          Because 99.9% of the text I’m dealing with is adequately expressed in ANSI, and Unicode files are larger than ANSI. Granted, they’re just text files and not particularly large ones either way, but I’m a conservationist at heart and am compelled to avoid wasting space. :-)

                          All that the r"" in Python does is to save the user from having to type (and look at) doubled backslashes.

                          Thanks for the explanation, though still not clear on why doubled backslashes would be necessary without r"".

                          M Andre Z EckenrodeM 1 Reply Last reply Reply Quote 1
                          • M Andre Z EckenrodeM
                            M Andre Z Eckenrode @M Andre Z Eckenrode
                            last edited by

                            Why does it work with an explicit encoding declaration from the console’s Run line…

                            That was meant to say, “Why does it work WITHOUT an explicit encoding declaration from the console’s Run line…”

                            EkopalypseE 1 Reply Last reply Reply Quote 0
                            • EkopalypseE
                              Ekopalypse @M Andre Z Eckenrode
                              last edited by

                              @M-Andre-Z-Eckenrode said in Intricacies of Regex Replace via Python Script:

                              “Why does it work WITHOUT

                              If you run a script from the menu, then there is no other C++
                              interaction then telling the python interpreter to run that specific file and python assume it to be utf8 encoded if not told otherwise.

                              If you run code from the run textbox then PS ensures it is utf8 encoded.

                              though still not clear on why doubled backslashes would be necessary without r""

                              Because certain sequences do have a different meanings, like
                              in a regular replace action the regex engine wants to have
                              a string \1 or $1 but python would convert \1 to SOH
                              before it reaches the boost regex engine.

                              1 Reply Last reply Reply Quote 2
                              • Alan KilbornA
                                Alan Kilborn
                                last edited by

                                @M-Andre-Z-Eckenrode said in Intricacies of Regex Replace via Python Script:

                                Because 99.9% of the text I’m dealing with is adequately expressed in ANSI, and Unicode files are larger than ANSI. Granted, they’re just text files and not particularly large ones either way, but I’m a conservationist at heart and am compelled to avoid wasting space. :-)

                                I don’t think that is a compelling reason at all.
                                But, democracy rules, so do what you like. :-)
                                But, as this thread sort of shows, a complete understanding of encoding is an important thing.
                                For myself, I’m not 100% of the way there, but hopefully getting better all the time

                                1 Reply Last reply Reply Quote 2
                                • First post
                                  Last post
                                The Community of users of the Notepad++ text editor.
                                Powered by NodeBB | Contributors