Find and delete all between and including tag only when certain words are within
-
Hi everyone.
Newbie here.
I have xml files that contain scrolling lyrics for karaoke songs that we are acquiring from another company. I’m in need of removing each <pg> tag that contain multiline phrases like:
8
BAR
INSTRUMENTAL
BREAKThey are always on their own separate page within a <pg> tag. The company told us the common words that appear every time are BAR & BREAK. This will avoid actual lyrics from being deleted within the remaining page tags (hopefully). There may be multiple instances of these tags throughout the xml as well. I need find and delete all of them.
I’m able to select the opening <pg and all the code up until the next opening <pg one at a time with this regex in Notepad++:
(<pg)(.+?)(?=<pg)
Is there a way to add code to locate both words BAR and BREAK to the above regex and only have those full tags found and deleted (multiple times within a file)? Then I can switch to Find In Files for a bulk search and replace routine?
Below is an example of 3 <pg> tags consecutively. I need the 2nd complete tag found and deleted, then continue on to delete another full <pg> tag if found until it reaches the end of the file. (rinse and repeat)
I have about 24 files to test with 7000 to follow. I’m hoping the common denominator of words to select between the <pg> tags are always BAR and BREAK.
Thank you so much for any help and advice.
```<pg id=“lyrics.16” t=“157.09,15.88”>
<ln>
<lyr s="I’M " t=“161.28,.24”/>
<lyr s="ON " t=“161.52,.43”/>
<lyr s="MY " t=“161.95,.37”/>
<lyr s="OWN " t=“162.32,1.05”/>
</ln>
<ln>
<lyr s="I’M " t=“164.57,.26”/>
<lyr s="ON " t=“164.83,.42”/>
<lyr s="MY " t=“165.25,.43”/>
<lyr s="OWN " t=“165.68,1.07”/>
</ln>
<ln>
<lyr s="I’M " t=“167.91,.24”/>
<lyr s="ON " t=“168.15,.38”/>
<lyr s="MY " t=“168.53,.42”/>
<lyr s="OWN " t=“168.95,.62”/>
</ln>
<ln>
<lyr s="NO " t=“169.57,.48”/>
<lyr s="NO " t=“170.05,.19”/>
<lyr s="NO " t=“170.24,.41”/>
<lyr s="NO " t=“170.65,.43”/>
<lyr s="NO " t=“171.08,.56”/>
</ln>
<ln>
<lyr s="YEAH " t=“171.64,.23”/>
<lyr s="EH " t=“171.87,.42”/>
<lyr s="YEAH " t=“172.29,.58”/>
</ln>
</pg>
<pg id=“lyrics.17” t=“172.97,7.93”>
<ln>
<lyr s="8 " t=“174.16,.21”/>
<lyr s="BAR " t=“174.37,.24”/>
</ln>
<ln>
<lyr s="INSTRUMENTAL " t=“174.61,4.52”/>
</ln>
<ln>
<lyr s="BREAK " t=“179.13,1.67”/>
</ln>
</pg>
<pg id=“lyrics.18” t=“180.9,9.72”>
<count c=“pt.1” t=“184.92,1.27” n=“4”/>
<ln>
<lyr s="WOAH " t=“186.55,.25”/>
<lyr s="OH " t=“186.8,.39”/>
<lyr s="WOAH " t=“187.19,.41”/>
</ln>
<ln>
<lyr s="I " t=“187.6,.21”/>
<lyr s="CAN’T " t=“187.81,.38”/>
<lyr s="LET " t=“188.19,.28”/>
<lyr s="YOU " t=“188.47,.38”/>
<lyr s="GO " t=“188.85,.6”/>
</ln>
<ln>
<lyr s="MY " t=“189.45,.44”/>
<lyr s="LITTLE " t=“189.89,.6”/>
<lyr s="GIRL " t=“190.49,.03”/>
</ln>
</pg> -
I think you might find THIS THREAD and its discussion relevant to your problem.
-
@Alan-Kilborn thank you. With a little work, I was able to create this with the help of the thread you suggested. It’s a messy solution (just like my question was), but it’s working on the 30 files I tested, plus none of the actual lyrics are removed by mistake.
(?s-i:<pg(?:(?!</pg>).)*?BAR(?:(?!</pg>).)*?BREAK(?:(?!</pg>).)*?</pg>)\r\t -
You could install the XmlTools plugin (available via PluginsAdmin) and use its
XSL Transformationfeature to resolve your issue.After installing the plugin, save the following for example as
DeletePgTag.xslin character encodingUTF-8:<?xml version="1.0" encoding="UTF-8"?> <xsl:transform version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output method="xml" omit-xml-declaration="no" indent="yes" encoding="UTF-8" /> <xsl:template match="@*|node()"> <xsl:copy> <xsl:apply-templates select="@*|node()"/> </xsl:copy> </xsl:template> <xsl:template match="/root/pg[ln[lyr[@s='8 '] and lyr[@s='BAR ']] and ln/lyr[@s='INSTRUMENTAL '] and ln/lyr[@s='BREAK ']]"> <xsl:text></xsl:text> </xsl:template> </xsl:transform>Please note:
-
Since you did not tell us what is the exact XML path to your
<pg>nodes, I decided to assume it could beroot, see the search expression in line 18<xsl:template match="/root/pg .... You have to change that to your needs. -
Also, if the encoding of your XML file is not
UTF-8, you have to change theencodingattribute of thexsl:outputnode in line 9 to your needs.
Now:
- Load your XML file with Notepad++. Ensure that it is the active tab.
- Navigate to
(menu) -> Plugins -> XML Tools -> XSL Transformation. - In the dialog popping up select the XSL file you created from the code above.
- Click
Transformbutton.
A new tab will be opened with the changed content of your original XML file.
[EDIT]
Uh, too late.
[/EDIT] -
-
Hello @jeff-Michaels, @alan-kilborn, @dinkumoil and All,
Congratulations to brilliantly achieve your goal from the link provided by @alan-kilborn ;-)) You seem to be a regex’s guru, too !
Regarding your solution, I suppose that the end should be
\r\n( instead of\r\t)Now, I think that you can simplify your regex as :
(?xs-i) <pg \x20 (?: (?! </pg> ) . )*? BAR .*? BREAK .*? </pg> \R # Delete any <pg> section containing BAR and BREAK-
Indeed, no need to be sure that the string
</pg>does not occur between the wordsBARandBREAKand betweenBREAKand</pg>. Once you get the good<pg...>block, with the negative look-ahead, the lazy quantifiers, coming next, forces the correct detection of that block ! -
You may use
\Ras is stands for any kind of line-break (\r\n,\nand\r) -
Note that I use the Free-spacing mode (
(?x...) which allows to separate the main parts of the regex and allows comments after the#char
Best Regards,
guy038
-