help replacing

guy038

Hello, @perry-sticca, @scott-sumner and All,

Ah ! I’m quite happy because I found a way to get the job done with an UNIQUE mouse click on the Replace All button :-D

But, first, am I right, saying that :

The first part ( the conversion table ) contains only different lines, corresponding to the total number of accounts ?
In the second part ( about 32Mb and 100,000 lines ), each account can be present one, twice or any number of times. On the other hand, an account may, also, be absent in that second part, although one line, with its new account number is present, in the first part ?

If we can answer “yes” to these questions, here is my method, in few steps :

Build a file, containing, FIRST, all the financial data ( the huge part ! ) and SECONDLY your conversion table of all the account numbers, with any separator ( some blank lines, an unique dashed line or whatever ! )

REMARK : Both, the financial list and the conversion table do NOT need to be sorted, in any way

Open this new file, in Notepad++
Move back at the very beginning ( CTRL + Origin )
Open the Find / Replace dialog ( Ctrl + H )
Check, only, the Regular expression search mode
In the Find what: zone, type : ,(\d{6})(?s)(?=,.*\1(,\d{7}))
In the Replace with: zone, type, simply, \2
Click on the Replace All button, once time, only

Et voilà !

NOTES :

The beginning ,(\d{6}) looks, in the first huge part, for any six-digits account number, of each line, preceded by a comma
But, ONLY IF, further on, in the last part ( the conversion table ), the six-digits number can be found, followed by a comma, and the new seven-digits account number, which is stored as group 2 ( (?s)(?=,.*\1(,\d{7})) )
If so, the comma and the six-digits account number is simply replaced by the group 2 ( The comma followed by the new seven-digits account number ! )
The process continues from line to line, till it reaches the conversion table where the replacement process stops, because it’s impossible to find, further on, an other line of the form “Old_account_number, New_account_number”

IMPORTANT :

If your main file is too important for Notepad++, you may split it in several files !

Just one condition : add the entire conversion table, at the end of each of them and perform the regex S/R, above, on EACH file.

Finally, you’ll just get rid of the conversion table part and will merge all these files

Best Regards,

guy038

P.S. :

So, given the original text :

------- Main financial file -------

0,2017,C001,ORFECN397,EC78,333333,FECN397,97,0,0.00,0.00,59035.26-
0,2017,C001,ORFECN15,EC7,111111,FECN15,252.29-,0.00,35904.63-
0,2017,C001,ORFECN355,EC66,222222,FECN355,355,0,442604.60-
0,2017,C001,ORFECN123,EC325,555555,FECN123,58,0,0.00,0.00,6752.99-
0,2017,C001,ORFECN30,EC78,222222,FECN30,0,5488848.52-,31430468.94-

0,2017,C001,ORFECN123,EC325,555555,FECN123,58,0,0.00,0.00,6752.99-
0,2017,C001,ORFECN15,EC7,111111,FECN15,252.29-,0.00,35904.63-
0,2017,C001,ORFECN397,EC78,333333,FECN397,97,0,0.00,0.00,59035.26-
0,2017,C001,ORFECN123,EC325,555555,FECN123,58,0,0.00,0.00,6752.99-
0,2017,C001,ORFECN30,EC78,222222,FECN30,0,5488848.52-,31430468.94-

0,2017,C001,ORFECN397,EC78,333333,FECN397,97,0,0.00,0.00,59035.26-
0,2017,C001,ORFECN355,EC66,222222,FECN355,355,0,442604.60-
0,2017,C001,ORFECN123,EC325,555555,FECN123,58,0,0.00,0.00,6752.99-
0,2017,C001,ORFECN355,EC66,222222,FECN355,355,0,442604.60-
0,2017,C001,ORFECN123,EC325,555555,FECN123,58,0,0.00,0.00,6752.99-
0,2017,C001,ORFECN30,EC78,222222,FECN30,0,5488848.52-,31430468.94-
0,2017,C001,ORFECN15,EC7,111111,FECN15,252.29-,0.00,35904.63-

------- Conversion table -----------

111111,1100111
333333,3300188
444444,4400456
555555,5500789
222222,2200100

After the unique S/R, we get 17 replacements and the text, below :

------- Main financial file -------

0,2017,C001,ORFECN397,EC78,3300188,FECN397,97,0,0.00,0.00,59035.26-
0,2017,C001,ORFECN15,EC7,1100111,FECN15,252.29-,0.00,35904.63-
0,2017,C001,ORFECN355,EC66,2200100,FECN355,355,0,442604.60-
0,2017,C001,ORFECN123,EC325,5500789,FECN123,58,0,0.00,0.00,6752.99-
0,2017,C001,ORFECN30,EC78,2200100,FECN30,0,5488848.52-,31430468.94-

0,2017,C001,ORFECN123,EC325,5500789,FECN123,58,0,0.00,0.00,6752.99-
0,2017,C001,ORFECN15,EC7,1100111,FECN15,252.29-,0.00,35904.63-
0,2017,C001,ORFECN397,EC78,3300188,FECN397,97,0,0.00,0.00,59035.26-
0,2017,C001,ORFECN123,EC325,5500789,FECN123,58,0,0.00,0.00,6752.99-
0,2017,C001,ORFECN30,EC78,2200100,FECN30,0,5488848.52-,31430468.94-

0,2017,C001,ORFECN397,EC78,3300188,FECN397,97,0,0.00,0.00,59035.26-
0,2017,C001,ORFECN355,EC66,2200100,FECN355,355,0,442604.60-
0,2017,C001,ORFECN123,EC325,5500789,FECN123,58,0,0.00,0.00,6752.99-
0,2017,C001,ORFECN355,EC66,2200100,FECN355,355,0,442604.60-
0,2017,C001,ORFECN123,EC325,5500789,FECN123,58,0,0.00,0.00,6752.99-
0,2017,C001,ORFECN30,EC78,2200100,FECN30,0,5488848.52-,31430468.94-
0,2017,C001,ORFECN15,EC7,1100111,FECN15,252.29-,0.00,35904.63-

------- Conversion table -----------

111111,1100111
333333,3300188
444444,4400456
555555,5500789
222222,2200100

Perry Sticca

guy038: It took about 10 minutes to go through the whole file, but it worked! Merci!

I am most appreciative of you spending the time to figure that out and help me - I hope it did not take you very long.

The first time I tried it, it did not replace anything. That is because in my “huge” file, the 6-digit numbers are not immediately followed by a comma - they have a few blanks before the comma. But in my simplified example that I provided (and you used), I edited it so all of the lines had a comma before and after each 6-digit account number.

So, that’s how you wrote the " ,(\d{6})(?s)(?=,.*\1(,\d{7})) " expression. Once I did a find and replace on my huge file, and removed any blanks between the account number and the comma, your expression worked perfectly.

Is it possible to modify your regex to accommodate a file where the account numbers are immediately followed by one or more blanks, and then a comma?

guy038

Hi, @perry-sticca and All,

No problem at all ! I can, even, give you two regexes :-))

These two regexes, below, detects any six-digits account number, followed by ( consecutive ) space or tabulation character(s), even none, before a comma

With the regex ,(\d{6})(?s)(?=\h*,.*\1(,\d{7})), the new seven-digits account number is written but the possible blank characters, after the old account number, are kept in the file

And with the regex ,(\d{6})\h*(?s)(?=,.*\1(,\d{7})), the new seven-digits account number replaces the old six-digits account number, as well as all possible blank characters, located after it !

Notes :

The syntax \h represents any of the 3 horizontal blank characters : the Space (\x20 ), the Tabulation ( \x09 ) or the No Breaking Space ( \xA0 )
The quantifier * stands for 0 or more occurrences of the previous blank character

Cheers,

guy038

guy038

Hi @adam-creason, @scott-sumner and All,

As I solved the Perry Sticca problem ( see above ), I realized, Adam and Scott, that the repetitive Replace All actions may be avoided, if we switch the location of the text1 and text2 Adam’s blocks :-))

So let’s supposed the text, below :

text2:

name 0002-
name 0001-
name 1000-
name 0003-

text 1:

name 0001-first value
name 0002-second value
name 0003-third value
name 1000-one thousandth value

Then, the regex :

SEARCH (name \d{4}-)(?s)(?=\R.*\1((?-s).+))

REPLACE $0\2

would change it, after an UNIQUE click on the Replace All button, by :

text2:

name 0002-second value
name 0001-first value
name 1000-one thousandth value
name 0003-third value

text 1:

name 0001-first value
name 0002-second value
name 0003-third value
name 1000-one thousandth value

Magic, isn’t it !

Notes :

The first part, (name \d{4}-), is the regex to search, stored as group 1
But, ONLY IF the positive look-ahead, (?s)(?=\R.*\1((?-s).+)) is true. That is to say, if group1 is immediately followed by EOL character(s), then any range of any character, due to the (?s) modifier till an other group1, again, and the remainder of the current line, only, due to the (?-s) modifier, located inside the group 2

In replacement, we rewrite the entire searched expression $0 , followed by the group 2. Note that we could have used the \1\2 replacement regex, instead, for identical results !

Cheers,

guy038

Thomas Daryl Phillips II

sorry but im hugely out of my depth and its hard to intrepret the changes that are being made for the 2 peoples tasks… this is what i have…

        <game name="Bomberman (USA)">
			<description>Bomberman (USA)</description>
			<rom crc="DB9DCF89" md5="0F9C8D3D3099C70368411F6DF9EF49C1" name="No_Intro_Sept_2016/No_Intro_N3.zip/Nintendo%20-%20Nintendo%20Entertainment%20System%2FBomberman%20%28USA%29.zip" sha1="D2BF7BD570430902114F1E3393F1FEB8B1C76E4D" size="16190" />
			<title_clean>Bomberman</title_clean>
			<plot>blah blah blah description</plot>
			<releasedate>11/5/1987</releasedate>
			<year>1985</year>
			<genre>Action</genre>
			<studio>Hudson Soft Company, Ltd.</studio>
			<nplayers>1</nplayers>
			<perspective>Top-Down</perspective>
			<rating>3.4</rating>
			<ESRB>E - Everyone</ESRB>
			<videoid>CZ9Pu9Usk5o</videoid>
			<thegamesdb_id>1040</thegamesdb_id>
			<gamefaqs_url>http://www.gamefaqs.com/nes/563390-bomberman</gamefaqs_url>
			<mobygames_url>http://www.mobygames.com/game/nes/bomberman-</mobygames_url>
			<giantbomb_url>http://www.giantbomb.com/bomberman/3030-20589/</giantbomb_url>
			<consolegrid_url>http://consolegrid.com/games/98</consolegrid_url>
			<snapshot1>http://i.imgur.com/qBmuc17.jpg</snapshot1>
			<fanart1>http://i.imgur.com/PL7xZJD.jpg</fanart1>
		</game>
		<game name="Bomber Man II (Japan)">
			<description>Bomber Man II (Japan)</description>
			<rom crc="0C401790" md5="E8DD578E17C4326D5E6E9C916B2328A1" name="No_Intro_Sept_2016/No_Intro_N3.zip/Nintendo%20-%20Nintendo%20Entertainment%20System%2FBomber%20Man%20II%20%28Japan%29.zip" sha1="CD665ACEA15A4542A9E4CF16A7CA2CE53C88726D" size="67022" />
			<title_clean>Bomber Man II</title_clean>
			<plot>blaaaaaaah</plot>
			<releasedate>28/2/1993</releasedate>
			<year>1991</year>
			<genre>Action</genre>
			<studio>Hudson Soft Company, Ltd., Hudson Soft USA, Inc.</studio>
			<nplayers>1-3 VS</nplayers>
			<perspective>Top-Down</perspective>
			<rating>4.0</rating>
			<videoid>7K6Ktv6G_j0</videoid>
			<thegamesdb_id>1653</thegamesdb_id>
			<gamefaqs_url>http://www.gamefaqs.com/nes/587150-bomberman-ii</gamefaqs_url>
			<mobygames_url>http://www.mobygames.com/game/nes/bomberman-ii</mobygames_url>
			<giantbomb_url>http://www.giantbomb.com/bomberman-ii/3030-5993/</giantbomb_url>
			<consolegrid_url>http://consolegrid.com/games/7199</consolegrid_url>
			<boxart1>http://i.imgur.com/iQH8lAk.jpg</boxart1>
			<snapshot1>http://i.imgur.com/8lyzbhy.jpg</snapshot1>
			<fanart1>http://i.imgur.com/wXBXYhu.jpg</fanart1>
			<banner1>http://i.imgur.com/kv37dnC.png</banner1>
		</game>

i want to remove all non US licensed games(thousands spanning almost 30 lists)
ive been folding all then batch marking by searching (Japan)">

then either removing all bookmarked lines assuming they will delete all within folded brackets but its not turning out this way at all.

many or all are just removing the first line/bookmarked line and leaving rest of data which immediately breaks the launcher.

please help and THANK YOU!

Scott Sumner

@Thomas-Daryl-Phillips-II

Your task isn’t really related to the earlier 2 tasks–those were trying to replace text somewhere in a document based upon some text somewhere else in the doc. You just want to find text and replace it (removal by replacement with nothing still qualifies).

Marking/bookmarking text is problematic here because your text spans multiple lines, and as you found, only the first line of a match is bookmarked. This you can’t follow up the marking with a delete-bookmarked-lines command.

So I think a search for the following could do what you want. It may not be the best way to do it, but it gets the job done:

Find what zone: (?-s)^\s*<game(?=.*$Japan$)(?s).*?</game>\R

If this (or ANY posting on the Notepad++ Community site) is useful, don’t reply with a “thanks”, simply up-vote ( click the ^ in the ^ 0 v area on the right ).

Thomas Daryl Phillips II

@Scott-Sumner said:

(?-s)^\s*<game(?=.(Japan))(?s).?</game>\R

im sorry to ask this. you already helped me so much… ive been banging my head into the wall for days over this!

but could you please break down and explain how that selected exactly what i needed to be deleted?
i wish to understand it so i can edit the command to fit similar filtering needs.

i can see you used <game
(Japan)
and </game>

as the keywords

could you break down the expressions used step by step?

Scott Sumner

@Thomas-Daryl-Phillips-II

Sure. I guess that means it worked for you. Okay, step by step I’ll break down the regular expression:

(?-s): for whatever follows, when a . is used, only allow a match on the current line (a . is a “wildcard” for “any character”)

^: from the start of a line

\s*: match any amount of whitespace (spaces or tabs)

<game: match <game exactly

(?=.*$Japan$): keep the match going only if the exact text (Japan) occurs later on the same line (the “same line” part is due to the (?-s) from earlier)–note that this is just saying “keep the match alive”, it doesn’t include any “Japan” text in the actual match!

(?s): switch to saying that for whatever follows, a . is allowed to match any character across line boundaries

.*?: minimally match any number of characters until what comes next is satisfied–note this is what actually makes “Japan” part of the real match

</game>: match </game> exactly

\R: match a line-ending

I think I hit it all…and like I said earlier, I didn’t analyze the problem to death so I’m sure there are better ways to do it. But as this forum is about Notepad++ and how to get things done with it, and NOT about how to craft the best-ever regular expression, I can let it go… :-)

Thomas Daryl Phillips II

@Scott

@Scott-Sumner said:

@Thomas-Daryl-Phillips-II

Sure. I guess that means it worked for you. Okay, step by step I’ll break down the regular expression:

(?-s): for whatever follows, when a . is used, only allow a match on the current line (a . is a “wildcard” for “any character”)

^: from the start of a line

\s*: match any amount of whitespace (spaces or tabs)

<game: match <game exactly

(?=.*$Japan$): keep the match going only if the exact text (Japan) occurs later on the same line (the “same line” part is due to the (?-s) from earlier)–note that this is just saying “keep the match alive”, it doesn’t include any “Japan” text in the actual match!

(?s): switch to saying that for whatever follows, a . is allowed to match any character across line boundaries

.*?: minimally match any number of characters until what comes next is satisfied–note this is what actually makes “Japan” part of the real match

</game>: match </game> exactly

\R: match a line-ending

I think I hit it all…and like I said earlier, I didn’t analyze the problem to death so I’m sure there are better ways to do it. But as this forum is about Notepad++ and how to get things done with it, and NOT about how to craft the best-ever regular expression, I can let it go… :-)

theres one thing im confused about which is the back slash used in (Japan)

Scott Sumner

@Thomas-Daryl-Phillips-II

You can see that ( and ) are used in a few other places in the regular expression. This is a clue that these characters have special meaning. So…if you have literal ( and ) that you need to match in your text, you need to put them in as $ and $. The \ is an instruction to say “interpret the following symbol literally”.

I don’t know your exact needs–it could very well be that matching Japan and not the more restrictive (Japan) meets your need. If so, you could change to: (?-s)^\s*<game(?=.*Japan)(?s).*?</game>\R

Thomas Daryl Phillips II

this is everything i need for a HUGE chunk of my project.

you saved my sanity and i appreciate it!!!