Replace certain text in lines with each line another file.

guy038

Hello, @gideon-addai, @scott-sumner, @terry-r and All,

I think I found out a solution, which needs ONE click, only, of the Replace All button :-)) So ,

Copy/paste your data file in N++ new 1 tab
Again, copy and paste your data file in new 2 tab
Let’s suppose we just have this part of your file, below, as sample :

str/4
<</Contents(100 cups)/(Date)
Colour red
<</Contents(080 bowls)/(Date)
Status used
Pack team
<</Contents(200 John)/(Date)
School house

Then, in new 2 tab, open the Replace dialog ( Ctrl + H )
Select the Regular expression search mode
Tick the Wrap around option

SEARCH (?-si)^<</Contents$(.+)$.+|.+\R

REPLACE ?1\1

Click on the Replace All button

You should get, in new 2 tab, the list of all the strings located between the strings <</Contents( and )/(Date)

100 cups
080 bowls
200 John

Now, in new 1 tab, after a separation line ( I personally chose a line of tildes ), append all the new 2 contents
Then, copy/paste your list of new strings, in new 3 tab
And append all new 3 contents at the very end of new 1 tab, after, again, a separation line, with tildes

You should see, in new 1 tab, the following text :

str/4
<</Contents(100 cups)/(Date)
Colour red
<</Contents(080 bowls)/(Date)
Status used
Pack team
<</Contents(200 John)/(Date)
School house
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
100 cups
080 bowls
200 John
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tree house
Colon format
Same variable

Finally, open, again the Replace dialog and perform, in the new 1 tab, the regex S/R, below :

SEARCH (?-s)(?<=Contents$)(.+?)(?=$/(?:(?s).+)\1\R(?:.+\R){3}(.+))|(?s)^~~~.+

REPLACE ?2\2

After a single click, on the Replace All button ( Do not use the Replace button ! ), you should get your expected text :

str/4
<</Contents(Tree house)/(Date)
Colour red
<</Contents(Colon format)/(Date)
Status used
Pack team
<</Contents(Same variable)/(Date)
School house

Voila !

Notes :

First, note the parentheses symbols, ( and ), must be escaped with the \ symbol to be considered as literals !
As usual, at beginning, the (?-s) modifier means that the dot . character will match any single standard character
The main search , (.+?), stands for the shortest non-null range of characters, stored as group 1 for later use, but ONLY IF two conditions are true :
- It must be preceded with the string Contents$, due to the the look-behind (?<=Contents\()
- It must match the regex $/(?:(?s).+)\1\R(?:.+\R){3}(.+), due to the look-ahead (?=\)/(?:(?s).+)\1\R(?:.+\R){3}(.+))
  - The string )/, due to the regex part \)/
  - The longest multi-lines non-null range of characters, due to the non-capturing group (?:(?s).+)
  - Till the group 1 contents, ( your initial string between parentheses ), with its end of line, due to the syntax \1\R
  - Then followed with 3 complete lines, with the non-capturing group (?:.+\R){3}. Of course, if your list of strings, to modify, contains, let’s say, 25 values, just write (?:.+\R){25}, instead !
  - And, finally, followed with the corresponding new string, which have to be placed between parentheses, stored as group 2, due to the (.+) regex
Now, after all the initial strings are read, the regex engine tries the second alternative of the search regex, after the | symbol, that is to say the regex (?s)^~~~.+, which grabs the longest range of any character, from the first separation line till the very end of the file, due to the (?s) modifier which means that dot . matches any single character ( standard and EOL one )
In replacement :
- If the first alternative of the search regex matched, the group 2 exists and the initial string is changed into the new corresponding string, due to the conditional replacement ?2\2
- If the second alternative matched, all text, beginning at the first separation line, is deleted, as group 2 does not exist !

Again, remember that the number 3, between braces, near the end of the search regex, must be changed with the exact number of “strings” to modify, in your text

Cheers,

guy038

Scott Sumner

So @guy038’s solution is a good one, and I like that when it is finished executing, there is no “clean-up” of extraneous data, as well as it requires only one Replace All action. It only has 2 minor drawbacks: it requires more data preparation, and it requires a count to be embedded (reference the {3} part).

I’ll propose the following solution (drawback: requires multiple Replace All actions); prepare data as follows, all in one file (put lines of replacement data at the top, followed by a line of 10 dashes, followed by the main file data):

Tree house
Colon format
Same variable
----------
str/4
<</Contents(100 cups)/(Date)
Colour red
<</Contents(080 bowls)/(Date)
Status used
Pack team
<</Contents(200 John)/(Date)
School house

Do the following:

Invoke Replace dialog (default key: ctrl+h)
Find what zone: (?-i)(?:(?<group1>(?-s).+)(?<group2>(?s).+?-{10}.+?Contents\()\d+ \w+)|(?:\R+-{10}\R)
Replace with zone: $+{group2}$+{group1}
Wrap around checkbox: ticked
Search mode selection: Regular expression
Action: Press Replace All button repeatedly (or hold down it’s Alt+a accelerator) until you see Replace All: 0 occurrences were replaced.

Note: I used named groups (“group1”, “group2”) here just for an added twist on regex…I didn’t come up with great names for the groups, however!

Perhaps it is interesting to actually see the steps the replacement goes through to transform the data…it’s short…so:

First match (top) and post-first-match replacement (bottom):

Imgur

Note: “group1” is the initial yellow color in the top part, “group2” is the brown color, “group3” is the final yellow color in the top part

Second match (top) and post-second-match replacement (bottom):

Imgur

Third match (top) and post-third-match replacement (bottom):

Imgur

Fourth match (top) and post-fourth-match replacement (bottom) – this is merely a “clean-up” step:

Imgur

guy038

Hello, @gideon-addai, @scott-sumner, @terry-r and All,

Ah, Scott, I was about to post when I just saw your reply. Interesting use of named groups, both in search and replacement ! I agree that my regex needs more data preparation ! Regarding the second drawback, you’re teasing me aren’t you :-)) Not very difficult to get the line number of the last line of a file, in the status bar !

So, here is, below, a second version of the regex S/R, of my previous post, where I, simply, change the logic of the modifiers :

SEARCH (?s)(?<=Contents$)(.+?)(?=$/.+\1\R(?:[^\r\n]+\R){3}([^\r\n]+))|^---.+

REPLACE ?2\2

Notes :

Now, the (?s) modifier is set, by default. That is to say that the dot . can match, absolutely, any single character
In that case, of course :
- The regex (?-s).+\R must be changed as (?s)[^\r\n]+\R
- The regex (?-s).+ must be changed as (?s)[^\r\n]+

BTW, I found out, in the list of regex tools, referenced on the www.rexegg.com site, an amazing on-line tool, at :

https://www.debuggex.com/

Choose the PCRE flavor, instead of the default JavaScript one
Then, in the Regular Expression editor area, paste your regex, to test
And, in the Text String Editor area, paste your test data

For instance, see my second regex version, above, at the address, below :

https://www.debuggex.com/r/LUBgLokvpxHUJnrv

Astonishing, isn’t it ?

IMPORTANT :

Before pasting your regex, in Debuggex site :
Change any syntax \R into \n
Change any syntax \r into \n
Change any syntax \r\n into \n

Hence, paste the regex (?s)(?<=Contents$)(.+?)(?=$/.+\1\n(?:[^\r\n]+\n){3}([^\r\n]+))|^---.+

I’ll probably test some regex tools and, soon, I’ll add some links to the main tools, in the FAQ Desk post, relative to regex documentation :-))

Best Regards,

guy038

Gideon Addai

Thanks to you guys for all the helpful comments. @Terry R, the entries am changing are many, a thousand maybe. There is this other method I tried that worked just fine. Though it kinda included quiet a handful of steps.

It mainly involves using the Alt+Shift selection method and TextFX plugin. The aim is to try and get all lines with "<</Contents(…)/(Date) following each other, replace everything within “Contents(” and “)” with a character, and replace that character with text from second file and sort everything in it original order.

Insert the line number for each line: This will lead us back to the original order.

Select the first character in each line using Alt+Shift and arrow keys.

Select TextFX>>TextFX Tools>>T:Insert line numbers. The result looked like this;

                00000001 str/4
                00000002 <</Contents(100 cups)/(Date)
                00000003 Colour red
                00000004 <</Contents(080 bowls)/(Date)
                00000005 Status used
                00000006 Pack team
                00000007 <</Contents(200 John)/(Date)
                00000008 School house

Sort original data in ascending order: This will get the lines <</Contents(…)/(Date) following each other
3. Repeat step 1 for the orginal data (not the inserted line numbers)
4. Select TextFX>>TextFX Tools>>T:Sort lines case insensitive (at column). The result looked like this

                  00000004 <</Contents(080 bowls)/(Date)
                  00000002 <</Contents(100 cups)/(Date)
                  00000007 <</Contents(200 John)/(Date)
                  00000003 Colour red
                  00000006 Pack team
                  00000008 School house
                  00000005 Status used
                  00000001 str/4

Search and Replace everything between “Contents(” and “)” with a character (0 maybe): This will help us easily replace text from the second file (ie. the content of “Contents(” and “)” don’t contain the same number of characters)
5. Press Ctrl+H
6. Search (nts$.*?$/)} and replace with nts$0$/ with regular expression checked (might be a better regex to do this though). The result looked like this

                00000004 <</Contents(0)/(Date)
                00000002 <</Contents(0)/(Date)
                00000007 <</Contents(0)/(Date)
                00000003 Colour red
                00000006 Pack team
                00000008 School house
                00000005 Status used
                00000001 str/4

Replace all 0 with text from second file
7. Use Alt+Shift+arrow keys, to select all columns in the second file (Don’t use Ctrl+a) and copy.
8. Select all 0 between “Contents(” and “)” like you did in step 1 and 3
9. Press Ctrl+v to paste all-column-copied text from second file. The result looked like this

               00000004 <</Contents(Tree house)/(Date)
               00000002 <</Contents(Colon format)/(Date)
               00000007 <</Contents(Same variable)/(Date)
               00000003 Colour red
               00000006 Pack team
               00000008 School house
               00000005 Status used
               00000001 str/4

Select and sort the numbers from step 2 in ascending order: This will restore the original order
10. Using the same Alt+Shift+arrow keys approach, select all the beginning numbers
11. Select TextFX>>TextFX Tools>>T:Sort lines case insensitive (at column). The result looked like this;

                 00000001 str/4
                 00000002 <</Contents(Colon format)/(Date)
                 00000003 Colour red
                 00000004 <</Contents(Tree house)/(Date)
                 00000005 Status used
                 00000006 Pack team
                 00000007 <</Contents(Same variable)/(Date)
                 00000008 School house

Finally delete all beginning numbers and space: This will restore the orginal format
12. Use the same Alt+Shift+arrow keys to select all number and space columns and delete

It involves a number of steps like I said and probably might not be the best approach, but I hope the idea would help someone.

Terry R

@Gideon-Addai , see what you’ve done to the team @Notepad++ community (@Scott-Sumner, @guy038)! Your problem, on top of the one I had previously provided a solution for has really set the team off. We’re all trying to outdo each other now with the BEST regex to solve this problem.;-D

Of course as we all realize, Notepad++/regex was never designed for such a task (without plugins or lots of preperation). And whilst we can make it do it (in a round about way), in all seriousness I think a script, be it Python or something else would do the job much easier. But therein lies the rub. To create a script requires an even greater learning curve. Not many are willing to take that step, if at all they can remain within the “REGEX” world!

Terry

PS just wait @guy038 and @scott-sumner, I’ve got the gloves off now. An idea is percolating within the grey matter. It’s a bit far fetched but I’ll follow it as far as it will take me.

Terry R

Another solution

First tab has the original file
Second tab has the replacement strings, 1 per line
For each of the tabs (1 & 2) select the left most position in line 1. Select Edit, Column editor. Select text to add and type in a space, then again select column editor and for the first tab select number using 1, increase by 2 (leading zeros ticked). For the second tab select 2, increase by 2 (leading zeros ticked). Both tabs have increasing numbers seperated by a space from the original text on each line.
For the first tab, Search, Mark. Tick “Bookmark line” and type in the character string that determines it is a line needing (partial) replacement.
Create a new tab (third tab) and cut the bookmarked lines to it, Search, Bookmark, Cut bookmarked lines. Select Tab #3 and paste.
Repeat step 3 on tab #3, adding first a space, then a number, again start at 1, increase by 2 (leading zeros ticked).
Copy the contents of tab #2 to the bottom of tab#3.
Edit, Line Operations, Sort the lines as integer ascending. Add a blank line at bottom.
It is a simple matter to merge the 2nd line of each pair into position in the first line, and removing the first number on the first line of the pair.
Copy the contents of tab #3 back to tab #1.
Edit, Line Operations, Sort the lines as integer ascending.
Remove the numbers and space at the start of each line.

For the example provided the regex’s are:
#9 Find what: \d+\h(\d+\h)(.+?Contents$).+?($.+?\R)\d+\h(.+)\R
Replace with: \1\2\4\3
#12 Find what: \d+\h
Replace with: empty field here

So is it any better than @guy038 solution? I think it is marginally. First off it does not require finding out the number (although to be fair as @guy038 said, just look at the number of the last line). It does look a bit less complex as no need to have the original and replacement text “linked” so number of lines between remains constant, see “You should see, in new 1 tab, the following text :” and the image.
The main regex is a lot neater, less complex. The 2nd regex is very basic.
It also does not require any PlugIn as per OP’s own solution using TextFX, so more portable. And the OP’s solution also had 12 steps, @guy038 had 11.

This was not my original idea as per previous post, but I happened to think of this along the way. So I may (or NOT) yet provide another alternative.

Not yet tested is how to redesign this when the original text to be replaced covers multiple lines!

Terry

guy038

Hello, @gideon-addai, @scott-sumner, @terry-r and All,

@gideon-addai, your method, involving the TextFX plugin just needs a lot of manipulations. But if you are just used to TextFX, why not !

@terry-r, I think that your regex, at step #9, in your last post, could be shortened a bit, as below :

SEARCH (?-s)\d+\h(.+?$).+?($.+)\R\d+\h(.+)

REPLACE \1\3\2

So, from the text, in tab #3 :

1 03 <</Contents(100 cups)/(Date)
2 Tree house
3 07 <</Contents(080 bowls)/(Date)
4 Colon format
5 13 <</Contents(200 John)/(Date)
6 Same variable

It would give :

03 <</Contents(Tree house)/(Date)
07 <</Contents(Colon format)/(Date)
13 <</Contents(Same variable)/(Date)

Best Regards,

guy038

Scott Sumner

@Terry-R said:

We’re all trying to outdo each other now with the BEST regex to solve this problem

No…no…no…please let’s NOT do this here. Maybe on a forum strictly dedicated to regex, but not here.

For the record, I don’t post to present a “better” regex to solve a problem that has already been solved. I might, however, post to show a new twist on something…in this case the technique to hold down a keycombo to repeatedly run a Replace All…often a very effective technique that makes writing the necessary regex faster and less complicated. It’s not a brand-new twist (running repeated Replace Alls), but it doesn’t get discussed a huge amount here.

Also, in this case, I had my solution mostly composed already when @guy038 submitted his post first…it would have been disheartening to hit Discard on so much text. :-)

I don’t think there’s a lot of value in a “best” regex for typical usage by Notepad++ users. Most simply want a solution to their problem so they can move on. So there is little value in presenting something to a solved problem that shaves a few characters off a regex…unless there is a key point also being made (which should definitely be elaborated on with additional text!). This is not intended to ignore others who might put their regex into a production environment where it will be run all the time–in these cases regex efficiency (one version of “best”) is critical…but again that is not typically the case here. And a final thought: a “best” regex, in terms of said efficiency, is often NOT the shortest one. :-)

guy038

Hello, @gideon-addai, @scott-sumner, @terry-r and All,

Scott, when you said :

So there is little value in presenting something to a solved problem that shaves a few characters off a regex…

I think that it’s… just the case of my last post ;-)) Of course, I agree with you that Terry-r’s regex works nice and that my last contribution was not essential at all ! However, sometimes, I, simply, think that a shorter regex may be more easily understood, for everybody !

But, eventually, you’re right : the best regex, in terms of number of matches and minimum execution speed, is not always the shortest one ! And, generally, building a good generic regex, which can suit in many texts, at the same time, is rather difficult to achieve !

Finally ,instead of speaking about the concept of “best” regex, we would rather speak about the most suitable regex for each OP :-))

Cheers,

guy038

Terry R

Apologies @guy038 and @scott-sumner, my post on “best” regex and “trying to outdo each other” was not meant seriously (;-D). And yes our primary goal here is to help the OP with their problem. I do still think though that we are trying to make our regex’s more efficient through this process. Does that equate to “best”, possibly in a roundabout way? And yes more efficient doesn’t necessarily mean faster, it may mean less complicated for the OP to understand and follow.

And I welcome having others critique my regex, thanks @guy038, your “shortened” version of my regex was welcomed. I hadn’t spent a lot of time on the regex, I was mainly concerned with looking for a “more efficient”/“less complicated” solution.

Thanks
Terry :-))