help replacing



  • Hello @adam-creason, @scott-sumner and All,

    Pleased to be back on our N++ site. I just realize that I have not post anything, during September ! Luckily, this was not because of health’s problems :-))

    Very clever regex, Scott ! But, just for information :

    • We can merge the (?-i) and (?-s) modifiers

    • We can use the positive look-around (?=\R), instead of searching the end of line

    • So, we don’t need a first group to get the total range of characters name ####- …name ####-, which, thus, can be rewritten with the classical $0 syntax ( the entire searched expression )

    • Finally, if we supposed that the last item is followed by some EOL characters and there is, only, a series of two identical strings name ####-, in the file, we can change the zero-lazy syntax .*? by the positive-greedy syntax .+

    Indeed, even in case of two associated lines, which would be consecutive, like below :

    name 0003-third value
    name 0003-
    

    The part (?s).+ would match, only, the EOL character(s) ( the minimum zone ! )


    To sump up, we get the shorter regex :

    SEARCH (?-is)(name \d{4}-)(.+)(?s).+\1(?=\R)

    REPLACE $0\2

    Of course, Adam, I bet that your problem is quite solved from the time :-D

    Best Regards,

    guy038



  • Scott and Guy - Hopefully, with your help, Adam’s problem has been solved, so I am not perceived as hijacking his thread.

    I have a similar issue, that I posted a month ago, but no one replied.
    You can see it at: https://notepad-plus-plus.org/community/topic/14400/find-replace-w-multiple-instances-using-a-conversion-file
    Perhaps the solution to my problem is similar to Adam’s, and you can solve it by just slightly modifying the fix you have already worked out.

    My input file, in simplified form, is similar to Adam’s:

    ==================================================
    111111,1100111
    222222,2200100
    333333,3300188

    0,2017,C001,ORFECN397,EC78,333333,FECN397,97,0,0.00,0.00,59035.26-
    0,2017,C001,ORFECN15,EC7,111111,FECN15,252.29-,0.00,35904.63-
    0,2017,C001,ORFECN355,EC66,222222,FECN355,355,0,442604.60-
    0,2017,C001,ORFECN30,EC78,222222,FECN30,0,5488848.52-,31430468.94-

    The output I am after is replacing all occurrences of the 6-digit account numbers found in the second part of the file with their corresponding 7-digit replacements shown in the first part:

    0,2017,C001,ORFECN397,EC78,3300188,FECN397,97,0,0.00,0.00,59035.26-
    0,2017,C001,ORFECN15,EC7,1100111,FECN15,252.29-,0.00,35904.63-
    0,2017,C001,ORFECN355,EC66,2200100,FECN355,355,0,442604.60-
    0,2017,C001,ORFECN30,EC78,2200100,FECN30,0,5488848.52-,31430468.94-

    Each 6-digit number can occur once, twice, hundreds of times, or not at all.
    They are not in order - I can of course sort the first part of the file, but the 6-digit account numbers to be replaced are scattered all over the second part.

    And these are huge files - the first part, the conversion table, could be several thousand lines long, and the second half can be 32MB or more, meaning over 100,000 lines!
    So I might be holding down ALT+A for a very long time, unless you know of another way.
    If it helps, the two parts can be in separate files.

    Any ideas on making N++ able to perform this huge, multi-input find-and-replace?



  • Hello, @perry-sticca, @scott-sumner and All,

    Ah ! I’m quite happy because I found a way to get the job done with an UNIQUE mouse click on the Replace All button :-D

    But, first, am I right, saying that :

    • The first part ( the conversion table ) contains only different lines, corresponding to the total number of accounts ?

    • In the second part ( about 32Mb and 100,000 lines ), each account can be present one, twice or any number of times. On the other hand, an account may, also, be absent in that second part, although one line, with its new account number is present, in the first part ?

    If we can answer “yes” to these questions, here is my method, in few steps :

    • Build a file, containing, FIRST, all the financial data ( the huge part ! ) and SECONDLY your conversion table of all the account numbers, with any separator ( some blank lines, an unique dashed line or whatever ! )

    REMARK : Both, the financial list and the conversion table do NOT need to be sorted, in any way

    • Open this new file, in Notepad++

    • Move back at the very beginning ( CTRL + Origin )

    • Open the Find / Replace dialog ( Ctrl + H )

    • Check, only, the Regular expression search mode

    • In the Find what: zone, type : ,(\d{6})(?s)(?=,.*\1(,\d{7}))

    • In the Replace with: zone, type, simply, \2

    • Click on the Replace All button, once time, only

    Et voilà !


    NOTES :

    • The beginning ,(\d{6}) looks, in the first huge part, for any six-digits account number, of each line, preceded by a comma

    • But, ONLY IF, further on, in the last part ( the conversion table ), the six-digits number can be found, followed by a comma, and the new seven-digits account number, which is stored as group 2 ( (?s)(?=,.*\1(,\d{7})) )

    • If so, the comma and the six-digits account number is simply replaced by the group 2 ( The comma followed by the new seven-digits account number ! )

    • The process continues from line to line, till it reaches the conversion table where the replacement process stops, because it’s impossible to find, further on, an other line of the form "Old_account_number, New_account_number"


    IMPORTANT :

    If your main file is too important for Notepad++, you may split it in several files !

    Just one condition : add the entire conversion table, at the end of each of them and perform the regex S/R, above, on EACH file.

    Finally, you’ll just get rid of the conversion table part and will merge all these files

    Best Regards,

    guy038

    P.S. :

    So, given the original text :

    ------- Main financial file -------
    
    0,2017,C001,ORFECN397,EC78,333333,FECN397,97,0,0.00,0.00,59035.26-
    0,2017,C001,ORFECN15,EC7,111111,FECN15,252.29-,0.00,35904.63-
    0,2017,C001,ORFECN355,EC66,222222,FECN355,355,0,442604.60-
    0,2017,C001,ORFECN123,EC325,555555,FECN123,58,0,0.00,0.00,6752.99-
    0,2017,C001,ORFECN30,EC78,222222,FECN30,0,5488848.52-,31430468.94-
    
    0,2017,C001,ORFECN123,EC325,555555,FECN123,58,0,0.00,0.00,6752.99-
    0,2017,C001,ORFECN15,EC7,111111,FECN15,252.29-,0.00,35904.63-
    0,2017,C001,ORFECN397,EC78,333333,FECN397,97,0,0.00,0.00,59035.26-
    0,2017,C001,ORFECN123,EC325,555555,FECN123,58,0,0.00,0.00,6752.99-
    0,2017,C001,ORFECN30,EC78,222222,FECN30,0,5488848.52-,31430468.94-
    
    0,2017,C001,ORFECN397,EC78,333333,FECN397,97,0,0.00,0.00,59035.26-
    0,2017,C001,ORFECN355,EC66,222222,FECN355,355,0,442604.60-
    0,2017,C001,ORFECN123,EC325,555555,FECN123,58,0,0.00,0.00,6752.99-
    0,2017,C001,ORFECN355,EC66,222222,FECN355,355,0,442604.60-
    0,2017,C001,ORFECN123,EC325,555555,FECN123,58,0,0.00,0.00,6752.99-
    0,2017,C001,ORFECN30,EC78,222222,FECN30,0,5488848.52-,31430468.94-
    0,2017,C001,ORFECN15,EC7,111111,FECN15,252.29-,0.00,35904.63-
    
    ------- Conversion table -----------
    
    111111,1100111
    333333,3300188
    444444,4400456
    555555,5500789
    222222,2200100
    

    After the unique S/R, we get 17 replacements and the text, below :

    ------- Main financial file -------
    
    0,2017,C001,ORFECN397,EC78,3300188,FECN397,97,0,0.00,0.00,59035.26-
    0,2017,C001,ORFECN15,EC7,1100111,FECN15,252.29-,0.00,35904.63-
    0,2017,C001,ORFECN355,EC66,2200100,FECN355,355,0,442604.60-
    0,2017,C001,ORFECN123,EC325,5500789,FECN123,58,0,0.00,0.00,6752.99-
    0,2017,C001,ORFECN30,EC78,2200100,FECN30,0,5488848.52-,31430468.94-
    
    0,2017,C001,ORFECN123,EC325,5500789,FECN123,58,0,0.00,0.00,6752.99-
    0,2017,C001,ORFECN15,EC7,1100111,FECN15,252.29-,0.00,35904.63-
    0,2017,C001,ORFECN397,EC78,3300188,FECN397,97,0,0.00,0.00,59035.26-
    0,2017,C001,ORFECN123,EC325,5500789,FECN123,58,0,0.00,0.00,6752.99-
    0,2017,C001,ORFECN30,EC78,2200100,FECN30,0,5488848.52-,31430468.94-
    
    0,2017,C001,ORFECN397,EC78,3300188,FECN397,97,0,0.00,0.00,59035.26-
    0,2017,C001,ORFECN355,EC66,2200100,FECN355,355,0,442604.60-
    0,2017,C001,ORFECN123,EC325,5500789,FECN123,58,0,0.00,0.00,6752.99-
    0,2017,C001,ORFECN355,EC66,2200100,FECN355,355,0,442604.60-
    0,2017,C001,ORFECN123,EC325,5500789,FECN123,58,0,0.00,0.00,6752.99-
    0,2017,C001,ORFECN30,EC78,2200100,FECN30,0,5488848.52-,31430468.94-
    0,2017,C001,ORFECN15,EC7,1100111,FECN15,252.29-,0.00,35904.63-
    
    ------- Conversion table -----------
    
    111111,1100111
    333333,3300188
    444444,4400456
    555555,5500789
    222222,2200100
    


  • guy038: It took about 10 minutes to go through the whole file, but it worked! Merci!

    I am most appreciative of you spending the time to figure that out and help me - I hope it did not take you very long.

    The first time I tried it, it did not replace anything. That is because in my “huge” file, the 6-digit numbers are not immediately followed by a comma - they have a few blanks before the comma. But in my simplified example that I provided (and you used), I edited it so all of the lines had a comma before and after each 6-digit account number.

    So, that’s how you wrote the " ,(\d{6})(?s)(?=,.*\1(,\d{7})) " expression. Once I did a find and replace on my huge file, and removed any blanks between the account number and the comma, your expression worked perfectly.

    Is it possible to modify your regex to accommodate a file where the account numbers are immediately followed by one or more blanks, and then a comma?



  • Hi, @perry-sticca and All,

    No problem at all ! I can, even, give you two regexes :-))

    These two regexes, below, detects any six-digits account number, followed by ( consecutive ) space or tabulation character(s), even none, before a comma

    With the regex ,(\d{6})(?s)(?=\h*,.*\1(,\d{7})), the new seven-digits account number is written but the possible blank characters, after the old account number, are kept in the file

    And with the regex ,(\d{6})\h*(?s)(?=,.*\1(,\d{7})), the new seven-digits account number replaces the old six-digits account number, as well as all possible blank characters, located after it !

    Notes :

    • The syntax \h represents any of the 3 horizontal blank characters : the Space (\x20 ), the Tabulation ( \x09 ) or the No Breaking Space ( \xA0 )

    • The quantifier * stands for 0 or more occurrences of the previous blank character

    Cheers,

    guy038



  • Hi @adam-creason, @scott-sumner and All,

    As I solved the Perry Sticca problem ( see above ), I realized, Adam and Scott, that the repetitive Replace All actions may be avoided, if we switch the location of the text1 and text2 Adam’s blocks :-))

    So let’s supposed the text, below :

    text2:
    
    name 0002-
    name 0001-
    name 1000-
    name 0003-
    
    text 1:
    
    name 0001-first value
    name 0002-second value
    name 0003-third value
    name 1000-one thousandth value
    

    Then, the regex :

    SEARCH (name \d{4}-)(?s)(?=\R.*\1((?-s).+))

    REPLACE $0\2

    would change it, after an UNIQUE click on the Replace All button, by :

    text2:
    
    name 0002-second value
    name 0001-first value
    name 1000-one thousandth value
    name 0003-third value
    
    text 1:
    
    name 0001-first value
    name 0002-second value
    name 0003-third value
    name 1000-one thousandth value
    

    Magic, isn’t it !


    Notes :

    • The first part, (name \d{4}-), is the regex to search, stored as group 1

    • But, ONLY IF the positive look-ahead, (?s)(?=\R.*\1((?-s).+)) is true. That is to say, if group1 is immediately followed by EOL character(s), then any range of any character, due to the (?s) modifier till an other group1, again, and the remainder of the current line, only, due to the (?-s) modifier, located inside the group 2

    In replacement, we rewrite the entire searched expression $0 , followed by the group 2. Note that we could have used the \1\2 replacement regex, instead, for identical results !

    Cheers,

    guy038



  • sorry but im hugely out of my depth and its hard to intrepret the changes that are being made for the 2 peoples tasks… this is what i have…

            <game name="Bomberman (USA)">
    			<description>Bomberman (USA)</description>
    			<rom crc="DB9DCF89" md5="0F9C8D3D3099C70368411F6DF9EF49C1" name="No_Intro_Sept_2016/No_Intro_N3.zip/Nintendo%20-%20Nintendo%20Entertainment%20System%2FBomberman%20%28USA%29.zip" sha1="D2BF7BD570430902114F1E3393F1FEB8B1C76E4D" size="16190" />
    			<title_clean>Bomberman</title_clean>
    			<plot>blah blah blah description</plot>
    			<releasedate>11/5/1987</releasedate>
    			<year>1985</year>
    			<genre>Action</genre>
    			<studio>Hudson Soft Company, Ltd.</studio>
    			<nplayers>1</nplayers>
    			<perspective>Top-Down</perspective>
    			<rating>3.4</rating>
    			<ESRB>E - Everyone</ESRB>
    			<videoid>CZ9Pu9Usk5o</videoid>
    			<thegamesdb_id>1040</thegamesdb_id>
    			<gamefaqs_url>http://www.gamefaqs.com/nes/563390-bomberman</gamefaqs_url>
    			<mobygames_url>http://www.mobygames.com/game/nes/bomberman-</mobygames_url>
    			<giantbomb_url>http://www.giantbomb.com/bomberman/3030-20589/</giantbomb_url>
    			<consolegrid_url>http://consolegrid.com/games/98</consolegrid_url>
    			<snapshot1>http://i.imgur.com/qBmuc17.jpg</snapshot1>
    			<fanart1>http://i.imgur.com/PL7xZJD.jpg</fanart1>
    		</game>
    		<game name="Bomber Man II (Japan)">
    			<description>Bomber Man II (Japan)</description>
    			<rom crc="0C401790" md5="E8DD578E17C4326D5E6E9C916B2328A1" name="No_Intro_Sept_2016/No_Intro_N3.zip/Nintendo%20-%20Nintendo%20Entertainment%20System%2FBomber%20Man%20II%20%28Japan%29.zip" sha1="CD665ACEA15A4542A9E4CF16A7CA2CE53C88726D" size="67022" />
    			<title_clean>Bomber Man II</title_clean>
    			<plot>blaaaaaaah</plot>
    			<releasedate>28/2/1993</releasedate>
    			<year>1991</year>
    			<genre>Action</genre>
    			<studio>Hudson Soft Company, Ltd., Hudson Soft USA, Inc.</studio>
    			<nplayers>1-3 VS</nplayers>
    			<perspective>Top-Down</perspective>
    			<rating>4.0</rating>
    			<videoid>7K6Ktv6G_j0</videoid>
    			<thegamesdb_id>1653</thegamesdb_id>
    			<gamefaqs_url>http://www.gamefaqs.com/nes/587150-bomberman-ii</gamefaqs_url>
    			<mobygames_url>http://www.mobygames.com/game/nes/bomberman-ii</mobygames_url>
    			<giantbomb_url>http://www.giantbomb.com/bomberman-ii/3030-5993/</giantbomb_url>
    			<consolegrid_url>http://consolegrid.com/games/7199</consolegrid_url>
    			<boxart1>http://i.imgur.com/iQH8lAk.jpg</boxart1>
    			<snapshot1>http://i.imgur.com/8lyzbhy.jpg</snapshot1>
    			<fanart1>http://i.imgur.com/wXBXYhu.jpg</fanart1>
    			<banner1>http://i.imgur.com/kv37dnC.png</banner1>
    		</game>
    

    i want to remove all non US licensed games(thousands spanning almost 30 lists)
    ive been folding all then batch marking by searching (Japan)">

    then either removing all bookmarked lines assuming they will delete all within folded brackets but its not turning out this way at all.

    many or all are just removing the first line/bookmarked line and leaving rest of data which immediately breaks the launcher.

    please help and THANK YOU!



  • @Thomas-Daryl-Phillips-II

    Your task isn’t really related to the earlier 2 tasks–those were trying to replace text somewhere in a document based upon some text somewhere else in the doc. You just want to find text and replace it (removal by replacement with nothing still qualifies).

    Marking/bookmarking text is problematic here because your text spans multiple lines, and as you found, only the first line of a match is bookmarked. This you can’t follow up the marking with a delete-bookmarked-lines command.

    So I think a search for the following could do what you want. It may not be the best way to do it, but it gets the job done:

    Find what zone: (?-s)^\s*<game(?=.*\(Japan\))(?s).*?</game>\R

    If this (or ANY posting on the Notepad++ Community site) is useful, don’t reply with a “thanks”, simply up-vote ( click the ^ in the ^ 0 v area on the right ).



  • @Scott-Sumner said:

    (?-s)^\s*<game(?=.(Japan))(?s).?</game>\R

    im sorry to ask this. you already helped me so much… ive been banging my head into the wall for days over this!

    but could you please break down and explain how that selected exactly what i needed to be deleted?
    i wish to understand it so i can edit the command to fit similar filtering needs.

    i can see you used <game
    (Japan)
    and </game>

    as the keywords

    could you break down the expressions used step by step?



  • @Thomas-Daryl-Phillips-II

    Sure. I guess that means it worked for you. Okay, step by step I’ll break down the regular expression:

    (?-s): for whatever follows, when a . is used, only allow a match on the current line (a . is a “wildcard” for “any character”)

    ^: from the start of a line

    \s*: match any amount of whitespace (spaces or tabs)

    <game: match <game exactly

    (?=.*\(Japan\)): keep the match going only if the exact text (Japan) occurs later on the same line (the “same line” part is due to the (?-s) from earlier)–note that this is just saying “keep the match alive”, it doesn’t include any “Japan” text in the actual match!

    (?s): switch to saying that for whatever follows, a . is allowed to match any character across line boundaries

    .*?: minimally match any number of characters until what comes next is satisfied–note this is what actually makes “Japan” part of the real match

    </game>: match </game> exactly

    \R: match a line-ending

    I think I hit it all…and like I said earlier, I didn’t analyze the problem to death so I’m sure there are better ways to do it. But as this forum is about Notepad++ and how to get things done with it, and NOT about how to craft the best-ever regular expression, I can let it go… :-)



  • @Scott

    @Scott-Sumner said:

    @Thomas-Daryl-Phillips-II

    Sure. I guess that means it worked for you. Okay, step by step I’ll break down the regular expression:

    (?-s): for whatever follows, when a . is used, only allow a match on the current line (a . is a “wildcard” for “any character”)

    ^: from the start of a line

    \s*: match any amount of whitespace (spaces or tabs)

    <game: match <game exactly

    (?=.*\(Japan\)): keep the match going only if the exact text (Japan) occurs later on the same line (the “same line” part is due to the (?-s) from earlier)–note that this is just saying “keep the match alive”, it doesn’t include any “Japan” text in the actual match!

    (?s): switch to saying that for whatever follows, a . is allowed to match any character across line boundaries

    .*?: minimally match any number of characters until what comes next is satisfied–note this is what actually makes “Japan” part of the real match

    </game>: match </game> exactly

    \R: match a line-ending

    I think I hit it all…and like I said earlier, I didn’t analyze the problem to death so I’m sure there are better ways to do it. But as this forum is about Notepad++ and how to get things done with it, and NOT about how to craft the best-ever regular expression, I can let it go… :-)

    theres one thing im confused about which is the back slash used in (Japan)



  • @Thomas-Daryl-Phillips-II

    You can see that ( and ) are used in a few other places in the regular expression. This is a clue that these characters have special meaning. So…if you have literal ( and ) that you need to match in your text, you need to put them in as \( and \). The \ is an instruction to say “interpret the following symbol literally”.

    I don’t know your exact needs–it could very well be that matching Japan and not the more restrictive (Japan) meets your need. If so, you could change to: (?-s)^\s*<game(?=.*Japan)(?s).*?</game>\R



  • this is everything i need for a HUGE chunk of my project.

    you saved my sanity and i appreciate it!!!


Log in to reply