Find and delete all between and including tag only when certain words are within

Jeff Michaels

Hi everyone.

Newbie here.

I have xml files that contain scrolling lyrics for karaoke songs that we are acquiring from another company. I’m in need of removing each <pg> tag that contain multiline phrases like:

8
BAR
INSTRUMENTAL
BREAK

They are always on their own separate page within a <pg> tag. The company told us the common words that appear every time are BAR & BREAK. This will avoid actual lyrics from being deleted within the remaining page tags (hopefully). There may be multiple instances of these tags throughout the xml as well. I need find and delete all of them.

I’m able to select the opening <pg and all the code up until the next opening <pg one at a time with this regex in Notepad++:

(<pg)(.+?)(?=<pg)

Is there a way to add code to locate both words BAR and BREAK to the above regex and only have those full tags found and deleted (multiple times within a file)? Then I can switch to Find In Files for a bulk search and replace routine?

Below is an example of 3 <pg> tags consecutively. I need the 2nd complete tag found and deleted, then continue on to delete another full <pg> tag if found until it reaches the end of the file. (rinse and repeat)

I have about 24 files to test with 7000 to follow. I’m hoping the common denominator of words to select between the <pg> tags are always BAR and BREAK.

Thank you so much for any help and advice.

```

Alan Kilborn

I think you might find THIS THREAD and its discussion relevant to your problem.

Jeff Michaels

@Alan-Kilborn thank you. With a little work, I was able to create this with the help of the thread you suggested. It’s a messy solution (just like my question was), but it’s working on the 30 files I tested, plus none of the actual lyrics are removed by mistake.

(?s-i:<pg(?:(?!</pg>).)*?BAR(?:(?!</pg>).)*?BREAK(?:(?!</pg>).)*?</pg>)\r\t

dinkumoil

@Jeff-Michaels

You could install the XmlTools plugin (available via PluginsAdmin) and use its XSL Transformation feature to resolve your issue.

After installing the plugin, save the following for example as DeletePgTag.xsl in character encoding UTF-8:

<?xml version="1.0" encoding="UTF-8"?>

<xsl:transform version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

  <xsl:output
    method="xml"
    omit-xml-declaration="no"
    indent="yes"
    encoding="UTF-8"
  />

  <xsl:template match="@*|node()">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

  <xsl:template match="/root/pg[ln[lyr[@s='8 '] and lyr[@s='BAR ']] and ln/lyr[@s='INSTRUMENTAL '] and ln/lyr[@s='BREAK ']]">
    <xsl:text></xsl:text>
  </xsl:template>

</xsl:transform>

Please note:

Since you did not tell us what is the exact XML path to your <pg> nodes, I decided to assume it could be root, see the search expression in line 18 <xsl:template match="/root/pg .... You have to change that to your needs.
Also, if the encoding of your XML file is not UTF-8, you have to change the encoding attribute of the xsl:output node in line 9 to your needs.

Now:

Load your XML file with Notepad++. Ensure that it is the active tab.
Navigate to (menu) -> Plugins -> XML Tools -> XSL Transformation.
In the dialog popping up select the XSL file you created from the code above.
Click Transform button.

A new tab will be opened with the changed content of your original XML file.

[EDIT]
Uh, too late.
[/EDIT]

guy038

Hello @jeff-Michaels, @alan-kilborn, @dinkumoil and All,

Congratulations to brilliantly achieve your goal from the link provided by @alan-kilborn ;-)) You seem to be a regex’s guru, too !

Regarding your solution, I suppose that the end should be \r\n ( instead of \r\t )

Now, I think that you can simplify your regex as :

(?xs-i) <pg \x20 (?: (?! </pg> ) . )*? BAR .*? BREAK .*? </pg> \R # Delete any <pg> section containing BAR and BREAK

Indeed, no need to be sure that the string </pg> does not occur between the words BAR and BREAK and between BREAK and </pg>. Once you get the good <pg...> block, with the negative look-ahead, the lazy quantifiers, coming next, forces the correct detection of that block !
You may use \R as is stands for any kind of line-break ( \r\n, \n and \r )
Note that I use the Free-spacing mode ( (?x... ) which allows to separate the main parts of the regex and allows comments after the # char

Best Regards,

guy038