Community
    • Login

    Intricacies of Regex Replace via Python Script

    Scheduled Pinned Locked Moved Notepad++ & Plugin Development
    14 Posts 3 Posters 2.2k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • EkopalypseE
      Ekopalypse @M Andre Z Eckenrode
      last edited by

      @M-Andre-Z-Eckenrode

      You cannot simply glue double quotes(").
      Python knows 4 different ways to denote a string

      "this is a string"
      
      'as well as this'
      
      """this is a string also but has special meaning and features"""
      
      '''like this one'''
      

      If you have to construct a string which uses one of the delimiters,
      then you can either use two different ones or you have to escape it.

      'This is a string which contains a " in it'
      

      or

      "This is a string which contains a \" in it"
      

      The r in front of a string means raw string which basically informs python
      to use the string literally and not to treat escape sequences.
      "Hello\tWorld" would mean Hello followed by a tab followed by World,
      whereas r"Hello\tWorld" would create a string literally and not replacing \t with a tab sign.

      I hope this clarifies your question, if not let me know.
      And yes, pyreplace should NOT be used anymore.

      M Andre Z EckenrodeM 2 Replies Last reply Reply Quote 2
      • EkopalypseE
        Ekopalypse @M Andre Z Eckenrode
        last edited by Ekopalypse

        @M-Andre-Z-Eckenrode

        just in case I wasn’t clear enough, that would be the code in python

        editor.rereplace(r'^(.*)?\.(jpg|jpeg|png)\t(.*)\t(.*)', r'"\1.\2" \(\3\) [\4]')
        
        1 Reply Last reply Reply Quote 1
        • M Andre Z EckenrodeM
          M Andre Z Eckenrode @Ekopalypse
          last edited by

          This post is deleted!
          M Andre Z EckenrodeM 1 Reply Last reply Reply Quote 0
          • M Andre Z EckenrodeM
            M Andre Z Eckenrode @M Andre Z Eckenrode
            last edited by

            This post is deleted!
            1 Reply Last reply Reply Quote 0
            • M Andre Z EckenrodeM
              M Andre Z Eckenrode @Ekopalypse
              last edited by

              @Ekopalypse

              Thanks for your reply.

              You cannot simply glue double quotes(").

              But I’m not “glueing” the normal (straight) double-quote marks, I’m doing the typographical (curly) ones (ASCII values 147 & 148, according to NPP’s character panel).

              The r in front of a string means raw string which basically informs python
              to use the string literally and not to treat escape sequences.

              That explanation doesn’t seem to be borne out by my previous experimental usage, or by Python Script’s installed help document, unless I’m missing it in there somewhere. The sample script “Python Regex Replacements”, included with Python Script, uses the following line of code:

              editor.rereplace(r"([A-Z]{3})\1", r"\1")
              

              I don’t think that would do its intended work if the r caused the strings to be seen as literal.

              Also, you preceded both strings with the r in your kindly suggested python code:

              editor.rereplace(r'^(.*)?\.(jpg|jpeg|png)\t(.*)\t(.*)', r'"\1.\2" \(\3\) [\4]')
              

              If the r makes them literal, how would THEY work?

              In any case, all of this seems to be a moot point for me now. For some unknown reason, none of my attempts to use, or tweak and use, my own script are doing anything at all in NPP. I have no clue why. I’ve tried restarting NPP, and the computer, but still no luck. The sample script “Python Regex Replacements” works, but not mine. If anybody has any ideas, I’m open.

              EkopalypseE 2 Replies Last reply Reply Quote 0
              • EkopalypseE
                Ekopalypse @M Andre Z Eckenrode
                last edited by Ekopalypse

                @M-Andre-Z-Eckenrode

                Ok, I made a mistake to assume that “ and ” are meant to be " because your posted script

                editor.rereplace(r"^(.*)?\.(jpg|jpeg|png)\t(.*)\t(.*)", r" “\1.\2” \(\3\) [\4]")
                

                works and because it didn’t on your side I made this wrong assumption.

                Concerning the raw string notation, it is needed because the boost::regex engine expects,
                for example to return the first match, the literal string \1 and not SOH, so python needs to be informed
                not to convert it by either using the r'\1' notation or by escaping '\\1'

                So why isn’t your initial example not working?
                Maybe there is an error in your script. Open the console as I did
                and run your script. Does it show an error?

                M Andre Z EckenrodeM 2 Replies Last reply Reply Quote 1
                • EkopalypseE
                  Ekopalypse @M Andre Z Eckenrode
                  last edited by

                  @M-Andre-Z-Eckenrode

                  I guess I, finally, understood the issue.
                  Your code

                  editor.rereplace(r"^(.*)?\.(jpg|jpeg|png)\t(.*)\t(.*)",r" “\1.\2” \(\3\) [\4]")
                  

                  was executed against an ANSI encoded buffer and therefore resulted in “Test.png†(398) [740 x 2065 x 1]
                  where as in an utf8 encoded buffer it would have returned “Test.png” (398) [740 x 2065 x 1]

                  So in order to make it work for both, utf8 and ANSI you might want to change it to

                  editor.rereplace(r"^(.*)?\.(jpg|jpeg|png)\t(.*)\t(.*)", ur" “\1.\2” \(\3\) [\4]")
                  

                  1 Reply Last reply Reply Quote 2
                  • Alan KilbornA
                    Alan Kilborn
                    last edited by

                    The UTF-8 / ANSI thing. Hmmm.
                    The answer for me would be to avoid the whole ANSI thing entirely, if possible (I’m not clear if this is possible/impossible in the OP’s situation). Usually, if you’re in total control of the files for your own purposes and you are specifically choosing ANSI as the encoding, I’d wonder why.
                    People (especially “old timers”) seem to confuse ANSI and ASCII which are different things, and not realize that that ASCII (what they really mean when they say ANSI) is in reality fully represented in UTF-8.
                    Go UTF-8!
                    Hopefully the above comments are on-target. I’m far from an expert, being one of the “old timers”.

                    Otherwise:

                    I’m kind of “struck” by one line in one of the OP’s post.

                    @M-Andre-Z-Eckenrode said:

                    If the r makes them literal, how would THEY work?

                    I think this is a common point of confusion.
                    All that the r"" in Python does is to save the user from having to type (and look at) doubled backslashes.
                    It has nothing to do with regex AT ALL.

                    1 Reply Last reply Reply Quote 2
                    • M Andre Z EckenrodeM
                      M Andre Z Eckenrode @Ekopalypse
                      last edited by

                      This post is deleted!
                      1 Reply Last reply Reply Quote 0
                      • M Andre Z EckenrodeM
                        M Andre Z Eckenrode @Ekopalypse
                        last edited by

                        @Ekopalypse

                        Open the console as I did and run your script. Does it show an error?

                        I’ve now modified and expanded my code to encompass more use cases, as such:

                        editor.rereplace(r"^(.*)?\.(?i)(jpe?g|png|gif|tif{1,2}|bmp)\t(.*)\t(.*)", ur"  “\1.\2” \(\3\) [\4]")
                        

                        NPP running, Python console open, and code above run from the console’s Run line, it works like a charm, does exactly what it’s supposed to do.

                        NPP running, Python console open, but code now run from file via menu Plugins > Python Script > Scripts > IMG+Size+Resolution, error message received:

                        SyntaxError: Non-ASCII character '\x93' in file C:\Users\Administrator\AppData\Roaming\Notepad++\plugins\Config\PythonScript\scripts\IMG+Size+Resolution.py on line 1, but no encoding declared; see http :// www.python.org / peps / pep-0263 . html for details
                        

                        After consulting the online document specified above, which recommends and discusses various encoding declaration schemes (ascii, latin-1, iso-8859-15, utf-8, etc.) for Python scripts, I put # coding: latin-1 at the top of my script file, and attempted another run, which gets me:

                        File "C:\Program Files (x86)\Notepad++\plugins\PythonScript\lib\encodings\cp1252.py", line 12, in encode
                        return codecs.charmap_encode(input,errors,encoding_table)
                        UnicodeEncodeError: 'charmap' codec can't encode character u'\x93' in position 2: character maps to <undefined>
                        

                        Changed encoding declaration in file to Windows-1252, and that works, thankfully. But a question remains for me: Why does it work with an explicit encoding declaration from the console’s Run line, but require the declaration when run from script file?

                        Thanks for your helpful suggestions, Ekopalypse.

                        @Alan-Kilborn

                        you are specifically choosing ANSI as the encoding, I’d wonder why.

                        Because 99.9% of the text I’m dealing with is adequately expressed in ANSI, and Unicode files are larger than ANSI. Granted, they’re just text files and not particularly large ones either way, but I’m a conservationist at heart and am compelled to avoid wasting space. :-)

                        All that the r"" in Python does is to save the user from having to type (and look at) doubled backslashes.

                        Thanks for the explanation, though still not clear on why doubled backslashes would be necessary without r"".

                        M Andre Z EckenrodeM 1 Reply Last reply Reply Quote 1
                        • M Andre Z EckenrodeM
                          M Andre Z Eckenrode @M Andre Z Eckenrode
                          last edited by

                          Why does it work with an explicit encoding declaration from the console’s Run line…

                          That was meant to say, “Why does it work WITHOUT an explicit encoding declaration from the console’s Run line…”

                          EkopalypseE 1 Reply Last reply Reply Quote 0
                          • EkopalypseE
                            Ekopalypse @M Andre Z Eckenrode
                            last edited by

                            @M-Andre-Z-Eckenrode said in Intricacies of Regex Replace via Python Script:

                            “Why does it work WITHOUT

                            If you run a script from the menu, then there is no other C++
                            interaction then telling the python interpreter to run that specific file and python assume it to be utf8 encoded if not told otherwise.

                            If you run code from the run textbox then PS ensures it is utf8 encoded.

                            though still not clear on why doubled backslashes would be necessary without r""

                            Because certain sequences do have a different meanings, like
                            in a regular replace action the regex engine wants to have
                            a string \1 or $1 but python would convert \1 to SOH
                            before it reaches the boost regex engine.

                            1 Reply Last reply Reply Quote 2
                            • Alan KilbornA
                              Alan Kilborn
                              last edited by

                              @M-Andre-Z-Eckenrode said in Intricacies of Regex Replace via Python Script:

                              Because 99.9% of the text I’m dealing with is adequately expressed in ANSI, and Unicode files are larger than ANSI. Granted, they’re just text files and not particularly large ones either way, but I’m a conservationist at heart and am compelled to avoid wasting space. :-)

                              I don’t think that is a compelling reason at all.
                              But, democracy rules, so do what you like. :-)
                              But, as this thread sort of shows, a complete understanding of encoding is an important thing.
                              For myself, I’m not 100% of the way there, but hopefully getting better all the time

                              1 Reply Last reply Reply Quote 2
                              • First post
                                Last post
                              The Community of users of the Notepad++ text editor.
                              Powered by NodeBB | Contributors