pythonscript: any ready pyscript to replace one huge set of regex/ phrases with others?



  • @guy038 said:

    may I ask for two improvements

    We don’t really need to repeat the delimiter, we just need to NOT ignore trailing space. What causes an ignoring of the trailing space in the original script is the rstrip() function. By default this function removes all whitespace from the right side of a string. If we change it to tell it to only strip line ending characters, it will leave blanks on that side: rstrip('\n'). Note that this will work for line endings of \n or \r\n in the file. I mention this because at first glance it would appear to only work for line endings of \n but that is not the case.

    Using # as a comment character is also easy, we can do it with this logic: if line[0] == '#': continue which means "if the first column of the data is # then “continue” the “for” loop by jumping back up to the “for” line, ignoring the rest of the indented lines under the “for”.

    A new version of the “magic” (still LOL!) script is:

    # format for each line is: delimiter then search regex then delimiter then replace regex
    sr_list = [
        '!a!A  ',
        '# I start with # so I am merely a comment line',
        '@b@B',
        '!c!C',
        ]
    
    # or take input from a file:
    #with open(r'sr_list.txt') as f: sr_list = f.readlines()
    
    editor.beginUndoAction()
    
    for line in sr_list:
        if line[0] == '#': continue
        (s,r) = line[1:].rstrip('\n').split(line[0])
        editor.rereplace(s,r)
    
    editor.endUndoAction()


  • @guy038

    Not also that I said, above, “is a common way” and not “was a common way” as I do hope that Scott will be back, on our forum, very soon !

    me too, and i think all others too … a little secret: i saw him active at the npp github repo a few days ago 😃👍 … but don’t tell anyone ;-)

    I didn’t come back and just preferred going to bed as I’ve planned to spend a ski-day, as weather was quite,nice Wednesday, on Grenoble and, in addition, I also met some friends of mine, in Chamrousse ski-station ;-))

    good done, best thing to do … but envyyyyyy ;-)

    XXXX ( INSTALL folder of N++ v7.6.2 , DIFFERENT from folder “C\Program files” and folder “C\Program files (x86)” )

    thanks for your tree, it comes in very handy and i’ve bookmarked it.

    for competition i would edit it to:
    XXXX ( PORTABLE folder of N++ v7.6.2 , DIFFERENT from folder "C\Program files" and folder "C\Program files (x86)" )
    and/or a note that doLocalConf.xml has to be present.
    just to make sure readers will not get those structures mixed up with the different folder structure of an installed version without doLocalConf.xml.



  • Hello, @v-s-rawat, @alan-kilborn, @meta-chuh and All,

    Alan, I tried your second version and everything went OK ! However, I prefer having a final separator, in order to easily see, in the SR_list.txt, the contents of the replacement regex.

    So, here is, below, my own version of your excellent script :

    #coding=utf-8
    
    import re
    
    # --------------------------------------------------------------------------------------------------------------------------------------
    
    #                                           Script "Multiples_SR.py"
    
    # A LITTLE adaptation from an ORIGINAL and VALUABLE script of Alan KILBORN ( January 2019 ) !
    
    # See https://notepad-plus-plus.org/community/topic/16942/pythonscript-any-ready-pyscript-to-replace-one-huge-set-of-regex-phrases-with-others/21
    
    # This script :
    
    #   - Reads an existing "SR_List.txt" file, of the CURRENT directory, containing a list of SEARCH/REPLACEMENT strings, ONE PER line
    #   - Selects, one at a time, a COUPLE of SEARCH and REPLACEMENT regexes  / expressions / strings / characters
    #   - Executes this present S/R on CURRENT edited file, in NOTEPAD++
    #   - Loop till the END of file
    
    # Any PURE BLANK line or COMMENT line, beginning with '#', of the "SR_list.txt" file, are simply IGNORED
    
    # --------------------------------------------------------------------------------------------------------------------------------------
    
    # For EACH line, in the "SR_List.txt" file, the format is <DELIMITER><SEARCH regex><DELIMITER><REPLACE regex><DELIMITER>
    
    ## EXAMPLES :
    ## ¯¯¯¯¯¯¯¯
    
    ##  Deletes any [ending] "; comment"  /  Delimiter = '!'
    #!(?-s)(^.*?);.+!\1!
    
    ##  Changes any LOWER-case string "notepad++" in its UPPER-case equivalent  /  Delimiter = '@'
    #@(?-i)notepad\+\+@NOTEPAD++@
    
    ##  Changes any "Smith" and 'James' strings, with that EXACT case, to, respectively, "Name" and "First name"  /  Delimiter = '&'
    ##  Deletes any "TEST" string, with that EXACT case
    #&(Smith)|TEST|(James)&(?1Name)(?2First name)&
    
    ##  Replaces any BACKSLASH character with the "123" number, both  preceded and followed with 3 SPACE characters  /  Delimiter = '%'
    #%\\%   123   %
    ##    or, also, the syntax   %\x5c%   123   %
    
    ##  Deletes any string "Fix", followed with a SPACE char, whatever its CASE  /  Delimiter = '+'
    #+(?i)Fix ++
    
    ##  Change 3 CONSECUTIVE "#" characters with 3 BACKSLASH characters  /  Delimiter = '*'
    #*###*\\\\\\*
    
    # --------------------------------------------------------------------------------------------------------------------------------------
    
    # In the CODE line, right below, you may :
    
    #   - Modify the NAME of the file, containing the SEARCH and REPLACEMENT regexes  
    #   - Indicate an ABSOLUTE or RELATIVE path, before the filename
    
    with open(r'SR_list.txt') as f: sr_list = f.readlines()
    
    # You may, as well, insert the SEARCH and REPLACE regexes, directly, in THIS script :
    
    #sr_list = [
    #    '!(?-s)(^.*?);.+!\\1!',
    #    '@(?-i)notepad\\+\\+@NOTEPAD++@',
    #    '&(Smith)|TEST|(James)&(?1Name)(?2First name)&',
    #    '%\\\\%   123   %',
    #          # or the syntax  '%\x5c\x5c%   123   %',
    #    '+(?i)Fix ++',
    #    '*###*\\\\\\\\\\\\*',
    #    ]
    
    # The use of RAW strings  r'.......'  is also possible, in order to SIMPLIFY some regexes
    
    # Note that these RAW regexes are strictly IDENTICAL to those, which could be contained in a "SR_List.txt" file, WITHOUT the 'r' PREFIX 
    
    #sr_list = [
    #    r'!(?-s)(^.*?);.+!\1!',
    #    r'@(?-i)notepad\+\+@NOTEPAD++@',
    #    r'&(Smith)|TEST|(James)&(?1Name)(?2First name)&',
    #    r'%\\%   123   %',
    #          # or the syntax  r'%\x5c%   123   %',
    #    r'+(?i)Fix ++',
    #    r'*###*\\\\\\*',
    #    ]
    
    editor.beginUndoAction()
    
    console.write ('\nMODIFICATIONS on FILE "{}: "\n\n'.format(notepad.getCurrentFilename()))
    
    # Note : Variable e is always EMPTY string ( Part AFTER the THIRD delimiter and BEFORE the END of line ! )
    
    for line in sr_list:
    
        if line[0] == '#' or line == '\n' : continue
        (s,r,e) = line[1:].rstrip('\n').split(line[0])
    
        console.write('    SEARCH  : >{}<\n'.format(s))
        console.write('    REPLACE : >{}<\n\n'.format(r))
    
        editor.rereplace(s,r)   # or editor.rereplace(s,r,re.IGNORECASE) / editor.rereplace(s,r,re.I)
    
    editor.endUndoAction()
    
    # END of Multiple_SR.py script
    

    @meta-Chuh, as you said, I slightly modify the local Notepad++ tree, in my previous post, to point out the importance of the doLocalConf.xml file ;-))
    Cheers,

    guy038



  • @guy038

    Yea, probably a good idea. Trailing blanks are hard to see without having visible line ends turned on (yuck!), or doing them as \x20 or, as you like, a trailing delimiter.

    Glad you are enjoying the script and your own script mods!



  • Would you like to create a PR of the script to be added to https://github.com/bruderstein/PythonScript/tree/master/scripts/Samples? Otherwise I could also add the last version of @guy038 , if that is ok for you.

    I know the installation of PythonScript with N++ > 7.6.x is right now a horror. Hope i will find some time to get it compatible with PluginAdmin changes. The biggest problem known so far is the move the location of python27.dll into the plugin folder.



  • @chcg

    I know the installation of PythonScript with N++ > 7.6.x is right now a horror.

    i’ve made a little guide and summary of all paths, while being in a chat with peter, for the installed version here

    and one for the portable version here

    maybe you can use it, if you need to help someone.

    The biggest problem known so far is the move the location of python27.dll into the plugin folder.

    i suppose so, unless the plugin spawns a process with a different relative path, not bound to notepad++.exe’s path, or maybe even a static python27 library in the spawn.



  • Hi, @alan-kilborn and All,

    I did some tests, with your script and, finally, the Python regex engine seems more reliable than our Boost regex engine ;-))

    Some bugs or limitations, present in our Boost implementation ( see the REMARK section of this FAQ, below )

    https://notepad-plus-plus.org/community/topic/15765/faq-desk-where-to-find-regex-documentation

    do not occur anymore with the Python regex engine ;-))

    Indeed :

    • You can insert, either, in search and replacement regexes, characters, located outside the BMP, directly or with the syntax \x{HHHHHHHH}

    • The NUL character, \x{0000}, can be used, either, in search and replacement regexes

    • The backward assertions, as, for instance, \A, seem correctly supported

    • The Look-behind assertions are correctly handled, even if it overlaps with the end of the previous match


    Seemingly, we’ll just lack, with the Python regex engine, the case modifiers, ( \u, \l, \U, \L and \E )

    These escaped sequences are available, with our Boost engine, in the replacement part. Refer to the address, below :

    https://www.boost.org/doc/libs/1_55_0/libs/regex/doc/html/boost_regex/format/boost_format_syntax.html#boost_regex.format.boost_format_syntax.escape_sequences

    For instance, against this text:

    This is simple test
    

    You may test the two regex S/R :

    SEARCH \w+

    REPLACE \u$0

    and

    SEARCH \w+

    REPLACE \U$0 $0\E <$0>

    AFAIK, they do not modify anything, ( I mean regarding case of characters ! ) when executed from a Python script :-((

    Best Regards,

    guy038



  • @guy038 said:

    I did some tests, with your script and, finally, the Python regex engine seems more reliable than our Boost regex engine

    Can you show some examples of the Python regex engine testing you did?



  • @guy038,

    the script provided by @Alan-Kilborn uses the boost regex implementation from the PythonScript plugin, which, as you’ve already shown, is implemented differently than with npp.



  • @Eko-palypse

    Well that’s kinda what I was getting at by asking @guy038 that last question. I couldn’t tell from what he was saying if he was talking about the earlier script or if he had tried some real Python re.xxx functions for search and replace. Hence my question to him.

    uses the boost regex implementation from the PythonScript plugin which is implemented differently than with npp

    Is it truly, though? I always thought that it made calls back to whatever regex engine is in N++, but, hmmm, maybe not. Maybe I should check the source code. :)



  • @Alan-Kilborn

    From what I understand, yes, this is the case, it has the boost:regex engine implemented
    https://github.com/bruderstein/PythonScript/blob/d54a2b434ec2b51f0dbacd3828fc36a20533c2dc/PythonScript/src/Replacer.cpp



  • Hi, @alan-kilborn, and All,

    Alan, it’s just all the points, described in my previous post !


    You can insert, either, in search and replacement regexes, characters, located outside the BMP, directly or with the syntax \x{HHHHHHHH}

    From the text below :

    🍬 = \x{1F36C}
    🎂 = \x{1F382}
    🎄 = \x{1F384}
    🎅 = \x{1F385}
    🎇 = \x{1F387}
    🎺 = \x{1F3BA}
    👼 = \x{1F47C}

    with the Python regex engine, you can use :

    SEARCH [\x{0001F36C}-\x{0001F47C}].+ or [\x{1F36C}-\x{1F47C}].+

    REPLACE \x{1F385} = \\x{1F385}

    So, with my modified script : @[\x{1F36C}-\x{1F47C}].+@\x{1F385} = \\x{1F385}@

    and you get:

    🎅 = \x{1F385}
    🎅 = \x{1F385}
    🎅 = \x{1F385}
    🎅 = \x{1F385}
    🎅 = \x{1F385}
    🎅 = \x{1F385}
    🎅 = \x{1F385}

    For characters with code, above \x{FFFF}, you cannot do this kind of S/R with our Boost regex engine


    The NUL character, \x{0000}, can be used, either, in search and replacement regexes

    For instance, you can execute the following S/R, with the Python regex engine :

    SEARCH [\x20-\x7f]

    REPLACE $0\x00

    giving for the script : @[\x20-\x7f]@$0\x00@

    This S/R cannot be run with our Boost regex engine, which just deletes all the characters


    The backward assertions, as, for instance, \A, seem correctly supported

    Just imagine the text “This is a test” in a new N++ tab and the regex S/R :

    SEARCH \A.

    REPLACE -

    So, in the script, the syntax @\A.@-@

    With the Python regex engine, we get the correct text -his is a test ! With our Boost regex engine, after clicking on the Replace All button, we, wrongly, obtain the text -------------- :-((


    The Look-behind assertions are correctly handled, even if it overlaps with the end of the previous match

    Consider the text aaaabaaabaaa and the regex S/R :

    SEARCH (?<=a)ba+

    REPLACE 123a

    => the syntax @(?<=a)ba+@123a@, in the script

    With the Python regex engine, the text is correctly modified as aaaa123a123a ( two S/R ) whereas, with the Boost regex engine, after clicking on the Replace All button, we get the wrong string aaaa123abaaa

    Indeed, the second match never occurs, as it should have seen that the last char of replacement a was right before the baaa string, hence a second match :-((

    Cheers,

    guy038



  • @guy038

    are you really using the python regex engine?
    This would mean you have some code like re.sub(pattern, repl, string, count=0, flags=0)
    but the snippet you showed earlier uses editor.rereplace which is supposed to be the boost regex engine.



  • Hi, @eko-palypse, @alan-kilborn and All,

    Huum…, I’m a bit confused ! When I mean : “With the Python regex engine…”, I’m just saying that I did all the tests with the Alan’s script, above, which does use the helper method editor.rereplace ! And, of course, the classical N++ Replace dialog, to compare with.

    In fact, I’m already aware of this fact, as, some time ago, I noticed differences, while using Scott Sumner’s or Claudia frank’s Python scripts, which dealt, essentially, with searches ! As, this time, we have a nice search and replace script, I just verified that my assumptions were correct : the present behavior of the editor.rereplace method gives improved results and seems to fix some bugs of the current implementation of the Boost library, within Notepad++ :-))

    But, I’m not a true coder ! So, unfortunately, it’s… up to all of you, to tell me why it’s looks better ;-))

    Cheers,

    guy038



  • @Eko-palypse @guy038

    So to clarify, when using the Pythonscript plugin, one can do 1 of 2 things:

    • editor.rereplace() which uses the Boost regex that is very similar to, but maybe not exactly the same as the one directly in N++
    • use re.sub() which uses the Python regex engine (which is its own thing, not Boost, not PCRE, not ANYTHING except Python’s own re module)

    So far I believe everything discussed in this thread is using the FIRST one.



  • @guy038 said:

    When I (say) “With the Python regex engine…”, I’m just saying that I did all the tests with…Alan’s script

    “With the Python regex engine” would be my SECOND bullet point above, but that is not what you’re doing unless you’ve changed the editor.rereplace() call in the script to a re.sub() call (and slightly changed the other logic to cope with that change).

    BTW when you import re (to get access to the re.IGNORECASE aka re.I flag) that is all you are doing–getting access to that, which happens to be shared, for convenience, with the Boost regex engine.



  • So from what I get is, that there is a difference in the implementation details of boost:regex in npp and pythonscript plugin.
    So the best would be if the pythonscript plugin would implement the missing pieces and npp silently steals the code and
    adapt it to have it work the same ;-)



  • @guy038 said:

    SEARCH \w+

    REPLACE \u$0

    AFAIK, they do not modify anything, ( I mean regarding case of characters ! ) when executed from a Python script :-((

    Interesting. I noticed that the following variant on that above WILL work to affect case when using editor.rereplace() in a script:

    Find: (\w+)
    Repl: \U\1

    It seems like either variant should capitalize all lowercase letters in a document. HOWEVER, only the script version does this! When run interactively with the Replace dialog in Notepad++, these 2 variants only capitalize the first letter of every “word”.

    Can anyone offer an explanation for:

    • why Guy’s original regex replace does nothing in the script
    • why both of these regex replaces only change to uppercase the first letter of every “word” when run with N++ interactive replace (but – and I think act correctly in the script)


  • @Alan-Kilborn said:

    why both of these regex replaces only change to uppercase the first letter of every “word” when run with N++ interactive replace (but – and I think act correctly in the script)

    Let me correct this:

    • why both of these regex replaces only change to uppercase the first letter of every “word” when run with N++ interactive replace (but the one that involves capturing group #1 and using \1 in the replace part – acts correctly in the script, at least I think it does)

    Hmm, better but maybe still not a great way of expressing it. :-P



  • @Alan-Kilborn

    If I understand you correctly, I’m totally lost - my setup must have some kind of builtin wizard as
    I do get different result. So just to clarify, having the text this is some text and aiming to get
    THIS IS SOME TEXT we would use \w+ and replace with \U$0 or (\w+) with \U$1 as replacement.
    For me, both work the same in the dialog and none work when called like editor.rereplace('\w+','\U$0') from a script.
    But you do have a different result?


Log in to reply