Community
    • Login

    Intricacies of Regex Replace via Python Script

    Scheduled Pinned Locked Moved Notepad++ & Plugin Development
    14 Posts 3 Posters 2.2k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • EkopalypseE
      Ekopalypse @M Andre Z Eckenrode
      last edited by Ekopalypse

      @M-Andre-Z-Eckenrode

      just in case I wasn’t clear enough, that would be the code in python

      editor.rereplace(r'^(.*)?\.(jpg|jpeg|png)\t(.*)\t(.*)', r'"\1.\2" \(\3\) [\4]')
      
      1 Reply Last reply Reply Quote 1
      • M Andre Z EckenrodeM
        M Andre Z Eckenrode @Ekopalypse
        last edited by

        This post is deleted!
        M Andre Z EckenrodeM 1 Reply Last reply Reply Quote 0
        • M Andre Z EckenrodeM
          M Andre Z Eckenrode @M Andre Z Eckenrode
          last edited by

          This post is deleted!
          1 Reply Last reply Reply Quote 0
          • M Andre Z EckenrodeM
            M Andre Z Eckenrode @Ekopalypse
            last edited by

            @Ekopalypse

            Thanks for your reply.

            You cannot simply glue double quotes(").

            But I’m not “glueing” the normal (straight) double-quote marks, I’m doing the typographical (curly) ones (ASCII values 147 & 148, according to NPP’s character panel).

            The r in front of a string means raw string which basically informs python
            to use the string literally and not to treat escape sequences.

            That explanation doesn’t seem to be borne out by my previous experimental usage, or by Python Script’s installed help document, unless I’m missing it in there somewhere. The sample script “Python Regex Replacements”, included with Python Script, uses the following line of code:

            editor.rereplace(r"([A-Z]{3})\1", r"\1")
            

            I don’t think that would do its intended work if the r caused the strings to be seen as literal.

            Also, you preceded both strings with the r in your kindly suggested python code:

            editor.rereplace(r'^(.*)?\.(jpg|jpeg|png)\t(.*)\t(.*)', r'"\1.\2" \(\3\) [\4]')
            

            If the r makes them literal, how would THEY work?

            In any case, all of this seems to be a moot point for me now. For some unknown reason, none of my attempts to use, or tweak and use, my own script are doing anything at all in NPP. I have no clue why. I’ve tried restarting NPP, and the computer, but still no luck. The sample script “Python Regex Replacements” works, but not mine. If anybody has any ideas, I’m open.

            EkopalypseE 2 Replies Last reply Reply Quote 0
            • EkopalypseE
              Ekopalypse @M Andre Z Eckenrode
              last edited by Ekopalypse

              @M-Andre-Z-Eckenrode

              Ok, I made a mistake to assume that “ and ” are meant to be " because your posted script

              editor.rereplace(r"^(.*)?\.(jpg|jpeg|png)\t(.*)\t(.*)", r" “\1.\2” \(\3\) [\4]")
              

              works and because it didn’t on your side I made this wrong assumption.

              Concerning the raw string notation, it is needed because the boost::regex engine expects,
              for example to return the first match, the literal string \1 and not SOH, so python needs to be informed
              not to convert it by either using the r'\1' notation or by escaping '\\1'

              So why isn’t your initial example not working?
              Maybe there is an error in your script. Open the console as I did
              and run your script. Does it show an error?

              M Andre Z EckenrodeM 2 Replies Last reply Reply Quote 1
              • EkopalypseE
                Ekopalypse @M Andre Z Eckenrode
                last edited by

                @M-Andre-Z-Eckenrode

                I guess I, finally, understood the issue.
                Your code

                editor.rereplace(r"^(.*)?\.(jpg|jpeg|png)\t(.*)\t(.*)",r" “\1.\2” \(\3\) [\4]")
                

                was executed against an ANSI encoded buffer and therefore resulted in “Test.png†(398) [740 x 2065 x 1]
                where as in an utf8 encoded buffer it would have returned “Test.png” (398) [740 x 2065 x 1]

                So in order to make it work for both, utf8 and ANSI you might want to change it to

                editor.rereplace(r"^(.*)?\.(jpg|jpeg|png)\t(.*)\t(.*)", ur" “\1.\2” \(\3\) [\4]")
                

                1 Reply Last reply Reply Quote 2
                • Alan KilbornA
                  Alan Kilborn
                  last edited by

                  The UTF-8 / ANSI thing. Hmmm.
                  The answer for me would be to avoid the whole ANSI thing entirely, if possible (I’m not clear if this is possible/impossible in the OP’s situation). Usually, if you’re in total control of the files for your own purposes and you are specifically choosing ANSI as the encoding, I’d wonder why.
                  People (especially “old timers”) seem to confuse ANSI and ASCII which are different things, and not realize that that ASCII (what they really mean when they say ANSI) is in reality fully represented in UTF-8.
                  Go UTF-8!
                  Hopefully the above comments are on-target. I’m far from an expert, being one of the “old timers”.

                  Otherwise:

                  I’m kind of “struck” by one line in one of the OP’s post.

                  @M-Andre-Z-Eckenrode said:

                  If the r makes them literal, how would THEY work?

                  I think this is a common point of confusion.
                  All that the r"" in Python does is to save the user from having to type (and look at) doubled backslashes.
                  It has nothing to do with regex AT ALL.

                  1 Reply Last reply Reply Quote 2
                  • M Andre Z EckenrodeM
                    M Andre Z Eckenrode @Ekopalypse
                    last edited by

                    This post is deleted!
                    1 Reply Last reply Reply Quote 0
                    • M Andre Z EckenrodeM
                      M Andre Z Eckenrode @Ekopalypse
                      last edited by

                      @Ekopalypse

                      Open the console as I did and run your script. Does it show an error?

                      I’ve now modified and expanded my code to encompass more use cases, as such:

                      editor.rereplace(r"^(.*)?\.(?i)(jpe?g|png|gif|tif{1,2}|bmp)\t(.*)\t(.*)", ur"  “\1.\2” \(\3\) [\4]")
                      

                      NPP running, Python console open, and code above run from the console’s Run line, it works like a charm, does exactly what it’s supposed to do.

                      NPP running, Python console open, but code now run from file via menu Plugins > Python Script > Scripts > IMG+Size+Resolution, error message received:

                      SyntaxError: Non-ASCII character '\x93' in file C:\Users\Administrator\AppData\Roaming\Notepad++\plugins\Config\PythonScript\scripts\IMG+Size+Resolution.py on line 1, but no encoding declared; see http :// www.python.org / peps / pep-0263 . html for details
                      

                      After consulting the online document specified above, which recommends and discusses various encoding declaration schemes (ascii, latin-1, iso-8859-15, utf-8, etc.) for Python scripts, I put # coding: latin-1 at the top of my script file, and attempted another run, which gets me:

                      File "C:\Program Files (x86)\Notepad++\plugins\PythonScript\lib\encodings\cp1252.py", line 12, in encode
                      return codecs.charmap_encode(input,errors,encoding_table)
                      UnicodeEncodeError: 'charmap' codec can't encode character u'\x93' in position 2: character maps to <undefined>
                      

                      Changed encoding declaration in file to Windows-1252, and that works, thankfully. But a question remains for me: Why does it work with an explicit encoding declaration from the console’s Run line, but require the declaration when run from script file?

                      Thanks for your helpful suggestions, Ekopalypse.

                      @Alan-Kilborn

                      you are specifically choosing ANSI as the encoding, I’d wonder why.

                      Because 99.9% of the text I’m dealing with is adequately expressed in ANSI, and Unicode files are larger than ANSI. Granted, they’re just text files and not particularly large ones either way, but I’m a conservationist at heart and am compelled to avoid wasting space. :-)

                      All that the r"" in Python does is to save the user from having to type (and look at) doubled backslashes.

                      Thanks for the explanation, though still not clear on why doubled backslashes would be necessary without r"".

                      M Andre Z EckenrodeM 1 Reply Last reply Reply Quote 1
                      • M Andre Z EckenrodeM
                        M Andre Z Eckenrode @M Andre Z Eckenrode
                        last edited by

                        Why does it work with an explicit encoding declaration from the console’s Run line…

                        That was meant to say, “Why does it work WITHOUT an explicit encoding declaration from the console’s Run line…”

                        EkopalypseE 1 Reply Last reply Reply Quote 0
                        • EkopalypseE
                          Ekopalypse @M Andre Z Eckenrode
                          last edited by

                          @M-Andre-Z-Eckenrode said in Intricacies of Regex Replace via Python Script:

                          “Why does it work WITHOUT

                          If you run a script from the menu, then there is no other C++
                          interaction then telling the python interpreter to run that specific file and python assume it to be utf8 encoded if not told otherwise.

                          If you run code from the run textbox then PS ensures it is utf8 encoded.

                          though still not clear on why doubled backslashes would be necessary without r""

                          Because certain sequences do have a different meanings, like
                          in a regular replace action the regex engine wants to have
                          a string \1 or $1 but python would convert \1 to SOH
                          before it reaches the boost regex engine.

                          1 Reply Last reply Reply Quote 2
                          • Alan KilbornA
                            Alan Kilborn
                            last edited by

                            @M-Andre-Z-Eckenrode said in Intricacies of Regex Replace via Python Script:

                            Because 99.9% of the text I’m dealing with is adequately expressed in ANSI, and Unicode files are larger than ANSI. Granted, they’re just text files and not particularly large ones either way, but I’m a conservationist at heart and am compelled to avoid wasting space. :-)

                            I don’t think that is a compelling reason at all.
                            But, democracy rules, so do what you like. :-)
                            But, as this thread sort of shows, a complete understanding of encoding is an important thing.
                            For myself, I’m not 100% of the way there, but hopefully getting better all the time

                            1 Reply Last reply Reply Quote 2
                            • First post
                              Last post
                            The Community of users of the Notepad++ text editor.
                            Powered by NodeBB | Contributors