How to copy the Headings, add some text before it and reproduce that just before some unique text

dr ramaanand

@dr-ramaanand the <b> and <span.............> have exchanged places in some files (out of multiple files of a folder). Heading 2, that is, the text between <H2..........> and </H2> is not limited to 3 lines in some files (out of multiple files of a folder) but I want all the text that is found between <H2..........> and </H2> on alll the lines to be reproduced as explained above.

Alan Kilborn

@dr-ramaanand

Please note:

This Community Forum is not a data transformation service; you should not expect to be able to always say “I have data like X and want it to look like Y” and have us do all the work for you. If you are new to the Forum, and new to regular expressions, we will often give help on the first one or two data-transformation questions, especially if they are well-asked and you show a willingness to learn; and we will point you to the documentation where you can learn how to do the data transformations for yourself in the future. But if you repeatedly ask us to do your work for you, you will find that the patience of usually-helpful Community members wears thin. The best way to learn regular expressions is by experimenting with them yourself, and getting a feel for how they work; having us spoon-feed you the answers without you putting in the effort doesn’t help you in the long term and is uninteresting and annoying for us.

USEFUL REFERENCES

Where to find regular expressions (regex) documentation

dr ramaanand

@Alan-Kilborn I tried to do something on my own but gave up because I couldn’t. If someone can help, please do! I have 400 files, so it will be difficult to edit each file separately

PeterJones

@dr-ramaanand said in How to copy the Headings, add some text before it and reproduce that just before some unique text:

but gave up

There is the biggest problem you’ve faced so far. And only you can help you with that problem.

tried to do something on my own … I have 400 files

And yet, you didn’t show what you tried, despite the advice given in the search-replace template.

The correct advice is: if you have more than a handful of text files in the same format, they should have been converted into a database (and 400 pages is way more than a handful). The HTML then becomes a report that you generate from the database, rather than the main storage. It’s much simpler to edit a single template that gets applied to all the data in the database than it is to make the same change to hundreds of pages. Given that you’ve had multiple requests like this already (all apparently on the same set of HTML files), it would be a truly good use of your time to split it out into a database and do that. (In the HTML world, those databases are often called “Content Management Systems” or CMS.)

That said, I am doubtful you will put in the effort to do it right, and instead expect that we give you a regex that does it.

Well, I won’t do that – at least not everything – because I doubt you’d actually learn anything if I just handed you the solution, and it would thus be a lot of effort on my part to get things “just right”, only to have you come back in a few days with another similar problem that you expect us to solve for you. But what I will do is say that if it were me, I would break the problem into multiple steps.

Identify the place in all the files where you want to insert the text. Figure out a regex that will place a simple identifiable sequence (like ☺, it just has to be something that will only occur once in each file) where you want to insert the text – probably even with the prefix, like <p...>We have a ☺</p>
Assuming all the <H2> data is on contiguous lines like you’ve shown, then I’ll give you a regex that will take everything from the first <H2...> to the last </H2> near the top, and put it where the ☺ used to be:
FIND = (?-s)((<H2.*?</H2>\s*)+)(?s).*?\K☺
REPLACE = ☺$1☹
Now you have the data between a ☺ and ☹, but with the extra <H2...> and </H2> tags in your way. This means you have a “begin” and “end”, where you want to make a change just between those. Our generic regex FAQ lists just such a topic, “Replacing in a specific zone of text” . Follow the formula in that post, where ☺ is your start-of-range marker and ☹ is your end-of-range marker, and you want to search for something like <H2.*?>\s* or </H2>\s* and replace with empty string. (You might have to do it once for each if you cannot figure out the combined regex that deletes both the open and close H2 tags in one, though there are multiple variants that could do it.)
Once that’s done, you can delete the ☺ and ☹ in all the files

The only way to learn regex is by doing. You have been shown the documentation multiple times. At this point, you need to learn how to do the regex on your own, because you have gone beyond the 1-2 freebies for a newbie, and it’s doubtful that most of the regulars will bother answering you until you start showing more effort in your questions. Good luck.

-—

Useful References

dr ramaanand

@PeterJones the RegEx (?s)\A.+?\K((<h2.+?</h2>\R)+).+\K(Please\s*E-mail\s*us)(?<=<p[^]>) will probably find what I want with the <H2.........> and </H2> and then I can probably use We have a $1, $3 in the Replace field. Then I should remove the <H2.........> and </H2> with another RegEx, right? Find (We have a )<H2.*?>(.*?)</H2> and replace with $1 $2 to remove the <H2.........> and </H2> in the final results.

dr ramaanand

@PeterJones the RegEx, (?s)\A.+?\K((<h2.+?</h2>\R)+).*\K(Please\s*E-mail\s*us)(?<=<p[^]>) is invalid. I however, want a solution desperately. Please help with the correct RegEx for that! I will manage the removal of the, <H2..............> and </H2> on my own (using find (We have a\x20)<H2.*?>(.*?)</H2> and replace with $1, $2)
Anyone can help. Please help!

dr ramaanand

@PeterJones (?s)\A.+?\K((<h2.+?</h2>\R)+).*\K(?=Please\s*E-mail\s*us) finds the first <H2......> to </H2> block and puts it just before the last occurrence of, “Please Email us” text but I want it to come even before, that is, before the <p........> string. I tried (?s)\A.+?\K((<h2.+?</h2>\R)+).*\K(?=Please\s*E-mail\s*us)(?<=<p[^]>) without any success. Now I need some help. Please help!

dr ramaanand

@PeterJones <p...>We have a $1,</p>\x20$2 is what I used in the replace field (for your information)

dr ramaanand

@PeterJones (?s)\A.+?\K((<h2.+?</h2>\R)+).*\K(?=<p.*?Please\s*E-mail\s*us) helped find the <p...........> string just before the last occurrence of, “Please E-mail us”. Thanks for the help!

guy038

Hello, @dr-ramaanand, @alan-kilborn, @peterjones and All,

@dr-ramaanand, here is my only contribution to your problem :

If you’ll find this solution insteresting, just be nice and make a donation to @Don-ho. It’s, I think, the least you can do !

If I assume that :

There is only 1 concerned zone, of consecutive <H2>.......</H2> blocks, located right after the <H1>.......</H1> block
The text of these <H2>.......</H2> blocks must be copied before the last line of your file which contains the string Please E-Mail us
And that your INPUT text is :

<H1........>Heading1</H1>
<H2........>Some text</H2>
<H2........>Different text</H2>
<H2........>Altogether different text</H2>
Some paragraphs here
</P> (or </ul>)
<P..........><span..........><b>Please E-mail us</b></span></P>
<H2........>Heading that should not be reproduced</H2>
Some paragraphs here
</ul> (or </P>)
<P..........><b><span..........>Please E-mail us</span></b></P>

Note : Each <H2>..........</H2> line may be preceded and/or followed with tabulation and/or space characters

Then :

Open a new tab
Paste the above INPUT text in this new tab
Open the Replace dialog ( Ctrl + H )
Select the Regular expression mode and tick the Wrap around option
Follow the road map, below

Note : I’ll use the free-spacing mode, (?x), in order to easily identify the main parts of the search regexes

With the first regex S/R, below, we’ll place the line @@@ right before the last line of the file containing the string Please E-mail us

SEARCH (?xsi) \A .+ \K (?= ^ <P .+ Please \x20 E-mail \x20 us )

REPLACE @@@\r\n

So, we get this temporary OUTPUT :

<H1........>Heading1</H1>
<H2........>Some text</H2>
<H2........>Different text</H2>
<H2........>Altogether different text</H2>
Some paragraphs here
</P> (or </ul>)
<P..........><span..........><b>Please E-mail us</b></span></P>
<H2........>Heading that should not be reproduced</H2>
Some paragraphs here
</ul> (or </P>)
@@@
<P..........><b><span..........>Please E-mail us</span></b></P>

With the second regex S/R, below, we’ll recopy all the <H2>.......</H2> lines, located right after the <H1>.....</H1> block, just before the delimiter line @@@

SEARCH (?xsi) (?<= </H1> \r\n ) ( \s* (?: <H2.+?> .+? </H2> \s* )+ ) ^ .+ \K (?= @@@ \R)

REPLACE \1

And the obtain this temporary OUTPUT :

<H1........>Heading1</H1>
<H2........>Some text</H2>
<H2........>Different text</H2>
<H2........>Altogether different text</H2>
Some paragraphs here
</P> (or </ul>)
<P..........><span..........><b>Please E-mail us</b></span></P>
<H2........>Heading that should not be reproduced</H2>
Some paragraphs here
</ul> (or </P>)
<H2........>Some text</H2>
<H2........>Different text</H2>
<H2........>Altogether different text</H2>
@@@
<P..........><b><span..........>Please E-mail us</span></b></P>

Note that the line <H2........>Heading that should not be reproduced</H2>, which is not consecutive to the other H2 lines, is not re-copied, as expected !

Now, with the third regex S/R, below, we’ll just rewrite the text of all these <H2>.....</H2> blocks, in a single line :

SEARCH (?xi-s) .+ > (.+) </H2> \h* \R (?= ( (?: \h* <H2 .+ \R )+ )? @@@ \R )

REPLACE We have a \1?2, :.

We get the temporary OUTPUT :

<H1........>Heading1</H1>
<H2........>Some text</H2>
<H2........>Different text</H2>
<H2........>Altogether different text</H2>
Some paragraphs here
</P> (or </ul>)
<P..........><span..........><b>Please E-mail us</b></span></P>
<H2........>Heading that should not be reproduced</H2>
Some paragraphs here
</ul> (or </P>)
We have a Some text, We have a Different text, We have a Altogether different text.@@@
<P..........><b><span..........>Please E-mail us</span></b></P>

Finally, with the fourth regex S/R , below, we simply add the leading <P........> part and delete the @@@ string, alltogether

SEARCH (?x-s) ^ ( .+) @@@ $

REPLACE <P whatever you need >\1

And here is your expected OUTPUT text :

<H1........>Heading1</H1>
<H2........>Some text</H2>
<H2........>Different text</H2>
<H2........>Altogether different text</H2>
Some paragraphs here
</P> (or </ul>)
<P..........><span..........><b>Please E-mail us</b></span></P>
<H2........>Heading that should not be reproduced</H2>
Some paragraphs here
</ul> (or </P>)
<P whatever you need >We have a Some text, We have a Different text, We have a Altogether different text.
<P..........><b><span..........>Please E-mail us</span></b></P>

Best Regards,

guy038