RegEx command to delete string with variable numbers



  • Hello everyone,

    i am looking for the proper RegEx command in order to delete a recurrent string with variable numbers.

    I want to delete the timestamps on a Summary Comment page from a PDF. For now i can do it manually one by one: exporting the FDF data, rename as XML, open it with Notepad++ and search for those strings:

    code_text
    /CreationDate(D:20200605015359+02'00') 
    /M(D:20200605015359+02'00')
    code_text
    

    numbers are the timestamp variable.

    and finally rename it as FDF.

    In fact since dont know how to code stuff, if someone is kinda enough im looking for a script that do the same thing without open notepad++.

    Thanks in advance



  • @Paul-smithers

    You’re probably going to want to show some “after” text sample as well.
    The way I read what you want is that you’d end up with:

    code_text
    code_text
    

    which I’m 99% certain isn’t what you want.



  • Thanks for your quick response.

    Let me clarify what i need.

    In Adobe DC it is possible to create a file with all comments written on the pdf documents, it’s called Comment summary:
    https://helpx.adobe.com/acrobat/kb/print-comments-acrobat-reader.html

    What i want is delete the comment timestamps on that file. It is possible exporting the comments metadata as .fdf file.

    Change extension to .xml and it looks something like this:

    timestamptest3.jpg

    Now for every comment delete the two strings:

    /CreationDate(D:20200605015359+02’00’)
    /M(D:20200605015359+02’00’)

    The numbers are the timestamp that change everytime.

    The result should be like this:

    timestamptest4.jpg

    For now i do it manually, and i would ask the proper RegEx search line for searching all those two string and replace them with nothing, ie delete them.

    Since i have many pdf with hundreds comments, it would be nice if someone helps me writing a script that do the same job without replace them in notepad++.

    Thanks again



  • @Paul-smithers

    I’d say this regex could match your situation:

    (?:/CreationDate|/M)\(D:\d{14}\+02'00'\)

    It seems like the +02'00' is constant, but if it is variable, we can deal with that as well.



  • Hello, @paul-smithers and All,

    A regex search/replacement could be :

    SEARCH (?-i)(?:(/CreationDate)|M)\(D:\d{14}\+02'00'\)/

    REPLACE \x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20(?1\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20/)

    And here are the changes :

    BEFORE : <</C[1.0 0.819611 0.0]/CreationDate(D:20200606114426+02'00')/F 28/M(D:20200606114426+02'00')/NM...
    AFTER  : <</C[1.0 0.819611 0.0]                                      /F 28/                           NM...
    

    As @alan-kilborn said, if the string +02'00' is not constant, change the search regex as below :

    SEARCH (?-i)(?:(/CreationDate)|M)\(D:\d{14}.{7}\)/

    Best Regards,

    guy038



  • @guy038

    Hi Guy,
    Is there a way to get the number of spaces to use for the replacement, from the length of the original match?



  • Hi, @paul-smithers, @Alan-kilborn and All,

    Yeaaaah ! Indeed, there is a method ;-))

    I thought about the very basic replacement of each single standard char( . ) with a space char ( \x20 )

    But we need to replace text with spaces, in some zones only, not everywhere ! To achieve such a task, we’ll use a new feature of our regex engine, since Notepad++ v7.7 : the backtracking control verbs ! Why this idea came to my mind ? Well, just because I’m preparing a documentation on these zero-width assertions !

    Fundamentally, the goal is to use this generic regex, below :

    ^What we do NOT want to match(*SKIP)((*F)|what we WANT to match, delimited with a LOOK-AHEAD|Again, what we do NOT want to match(*SKIP)(*F)|Again,what we WANT to match, delimited by an other LOOK-AHEAD|.... and so on

    Alan, could you be patient till I build up and post this documentation about these backtracking control verbs ?

    Meanwhile, you’ll find some hints, here :

    https://www.rexegg.com/backtracking-control-verbs.html#skipfail


    A little practice :

    Assuming the initial and final text, desired by @paul-smithers

    BEFORE : <</C[1.0 0.819611 0.0]/CreationDate(D:20200606114426+02'00')/F 28/M(D:20200606114426+02'00')/NM...
    AFTER  : <</C[1.0 0.819611 0.0]                                      /F 28/                           NM...
    

    We can tell that :

    • First, text, from beginning of line till a ] is unwanted

    • Then, text, till the string /F, is wanted and, for each single char in this zone, we want to replace it with a space char

    • Now, the text /F 28/ is unwanted

    • Finally, text till the string NM is also wanted and again, for each single char in this zone, we want to replace it with a space char


    So, look how easy it is to build up the search regex, from the points above ! In addition, I’ll use the free spacing mode for a better readability

    SEARCH (?x-s) ^.+] (*SKIP)(*F) | (?=.+/F) . | /F\x2028/ (*SKIP)(*F) | (?=.+NM) .

    REPLACE \x20

    We get :

    
    Text of @paul-smithers :
    
    BEFORE : <</C[1.0 0.819611 0.0]/CreationDate(D:20200606114426+02'00')/F 28/M(D:20200606114426+02'00')/NM...
    AFTER  : <</C[1.0 0.819611 0.0]                                      /F 28/                           NM...
    
    Other TESTS :
    
    BEFORE : [1.0 0.819611 0.0]/CreationDate(D:20200606114426)/F 28/M(D:20200606+02'00')/NM...
    AFTER  : [1.0 0.819611 0.0]                               /F 28/                     NM...
    
    
    BEFORE : [1.0 0.819611]/CreationDate(+02'00')/F 28/M(D:114426+02)/NM...
    AFTER  : [1.0 0.819611]                      /F 28/               NM...
    
    

    Magic, isn’t it ;-))


    Notes :

    • Beware of the final dot, after the two positive look-aheads !

    • Of course, in case of an huge file, problem of performance may occurs, as each single character is replaced with a space !

    • Note, also, that the use of the \K feature would not give the same behavior. Indeed, in that case, the part after \K ( the . ) must come, necessarily, right after \K, because this regex contains 2 alternatives only, unlike the 4 alternatives of the former regex ! Just try it :

    SEARCH (?-s)^.+]\K(?=.+/F).|/F 28/\K(?=.+NM).

    Cheers,

    guy038



  • @paul-smithers, @Alan-kilborn and All,

    I guess I must have been influenced by my upcoming documentation on Backtracking control verbs !

    In fact, be reassured, there is still a classical solution, which does not use this new feature. Here it is this second solution, written with the free-spacing mode (?x) :

    SEARCH (?x-s) (^.+]) | (?=.+/F) (.) | (/F\x2028/) | (?=.+NM) (.)

    REPLACE (?1$0)(?2\x20)(?3$0)(?4\x20)

    As you can see :

    • Any part, that we do not want to match, is simply rewritten ( $0 )

    • In zones, that we do care of, each single standard character ( . ) is replaced with a space char ( \x20 )

    BR

    guy038



  • Hello, first of all thanks everyone for the help.

    I have tried the first proposal (?:/CreationDate|/M)(D:\d{14}+02’00’) from @Alan-kilborn and it works perfectly since i dont need the \x20 space char.

    For some reason if i use the other proposals, Acrobat refuse to import the modified fdf file because an unspecified error.

    Anyway, i resolved my problem now. I use this script to remove the autor name https://adobe.ly/3emVRkC and the search RegEx for the timestamp.

    Thanks again.


Log in to reply