Community
    • Login

    Find and delete all between and including tag only when certain words are within

    Scheduled Pinned Locked Moved General Discussion
    5 Posts 4 Posters 1.3k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Jeff MichaelsJ
      Jeff Michaels
      last edited by Jeff Michaels

      Hi everyone.

      Newbie here.

      I have xml files that contain scrolling lyrics for karaoke songs that we are acquiring from another company. I’m in need of removing each <pg> tag that contain multiline phrases like:

      8
      BAR
      INSTRUMENTAL
      BREAK

      They are always on their own separate page within a <pg> tag. The company told us the common words that appear every time are BAR & BREAK. This will avoid actual lyrics from being deleted within the remaining page tags (hopefully). There may be multiple instances of these tags throughout the xml as well. I need find and delete all of them.

      I’m able to select the opening <pg and all the code up until the next opening <pg one at a time with this regex in Notepad++:

      (<pg)(.+?)(?=<pg)

      Is there a way to add code to locate both words BAR and BREAK to the above regex and only have those full tags found and deleted (multiple times within a file)? Then I can switch to Find In Files for a bulk search and replace routine?

      Below is an example of 3 <pg> tags consecutively. I need the 2nd complete tag found and deleted, then continue on to delete another full <pg> tag if found until it reaches the end of the file. (rinse and repeat)

      I have about 24 files to test with 7000 to follow. I’m hoping the common denominator of words to select between the <pg> tags are always BAR and BREAK.

      Thank you so much for any help and advice.

      ```
      

      <pg id=“lyrics.16” t=“157.09,15.88”>
      <ln>
      <lyr s="I’M " t=“161.28,.24”/>
      <lyr s="ON " t=“161.52,.43”/>
      <lyr s="MY " t=“161.95,.37”/>
      <lyr s="OWN " t=“162.32,1.05”/>
      </ln>
      <ln>
      <lyr s="I’M " t=“164.57,.26”/>
      <lyr s="ON " t=“164.83,.42”/>
      <lyr s="MY " t=“165.25,.43”/>
      <lyr s="OWN " t=“165.68,1.07”/>
      </ln>
      <ln>
      <lyr s="I’M " t=“167.91,.24”/>
      <lyr s="ON " t=“168.15,.38”/>
      <lyr s="MY " t=“168.53,.42”/>
      <lyr s="OWN " t=“168.95,.62”/>
      </ln>
      <ln>
      <lyr s="NO " t=“169.57,.48”/>
      <lyr s="NO " t=“170.05,.19”/>
      <lyr s="NO " t=“170.24,.41”/>
      <lyr s="NO " t=“170.65,.43”/>
      <lyr s="NO " t=“171.08,.56”/>
      </ln>
      <ln>
      <lyr s="YEAH " t=“171.64,.23”/>
      <lyr s="EH " t=“171.87,.42”/>
      <lyr s="YEAH " t=“172.29,.58”/>
      </ln>
      </pg>
      <pg id=“lyrics.17” t=“172.97,7.93”>
      <ln>
      <lyr s="8 " t=“174.16,.21”/>
      <lyr s="BAR " t=“174.37,.24”/>
      </ln>
      <ln>
      <lyr s="INSTRUMENTAL " t=“174.61,4.52”/>
      </ln>
      <ln>
      <lyr s="BREAK " t=“179.13,1.67”/>
      </ln>
      </pg>

      <pg id=“lyrics.18” t=“180.9,9.72”>
      <count c=“pt.1” t=“184.92,1.27” n=“4”/>
      <ln>
      <lyr s="WOAH " t=“186.55,.25”/>
      <lyr s="OH " t=“186.8,.39”/>
      <lyr s="WOAH " t=“187.19,.41”/>
      </ln>
      <ln>
      <lyr s="I " t=“187.6,.21”/>
      <lyr s="CAN’T " t=“187.81,.38”/>
      <lyr s="LET " t=“188.19,.28”/>
      <lyr s="YOU " t=“188.47,.38”/>
      <lyr s="GO " t=“188.85,.6”/>
      </ln>
      <ln>
      <lyr s="MY " t=“189.45,.44”/>
      <lyr s="LITTLE " t=“189.89,.6”/>
      <lyr s="GIRL " t=“190.49,.03”/>
      </ln>
      </pg>

      dinkumoilD 1 Reply Last reply Reply Quote 0
      • Alan KilbornA
        Alan Kilborn
        last edited by

        I think you might find THIS THREAD and its discussion relevant to your problem.

        Jeff MichaelsJ 1 Reply Last reply Reply Quote 3
        • Jeff MichaelsJ
          Jeff Michaels @Alan Kilborn
          last edited by

          @Alan-Kilborn thank you. With a little work, I was able to create this with the help of the thread you suggested. It’s a messy solution (just like my question was), but it’s working on the 30 files I tested, plus none of the actual lyrics are removed by mistake.

          (?s-i:<pg(?:(?!</pg>).)*?BAR(?:(?!</pg>).)*?BREAK(?:(?!</pg>).)*?</pg>)\r\t
          
          1 Reply Last reply Reply Quote 2
          • dinkumoilD
            dinkumoil @Jeff Michaels
            last edited by dinkumoil

            @Jeff-Michaels

            You could install the XmlTools plugin (available via PluginsAdmin) and use its XSL Transformation feature to resolve your issue.

            After installing the plugin, save the following for example as DeletePgTag.xsl in character encoding UTF-8:

            <?xml version="1.0" encoding="UTF-8"?>
            
            <xsl:transform version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
            
              <xsl:output
                method="xml"
                omit-xml-declaration="no"
                indent="yes"
                encoding="UTF-8"
              />
            
              <xsl:template match="@*|node()">
                <xsl:copy>
                  <xsl:apply-templates select="@*|node()"/>
                </xsl:copy>
              </xsl:template>
            
              <xsl:template match="/root/pg[ln[lyr[@s='8 '] and lyr[@s='BAR ']] and ln/lyr[@s='INSTRUMENTAL '] and ln/lyr[@s='BREAK ']]">
                <xsl:text></xsl:text>
              </xsl:template>
            
            </xsl:transform>
            

            Please note:

            • Since you did not tell us what is the exact XML path to your <pg> nodes, I decided to assume it could be root, see the search expression in line 18 <xsl:template match="/root/pg .... You have to change that to your needs.

            • Also, if the encoding of your XML file is not UTF-8, you have to change the encoding attribute of the xsl:output node in line 9 to your needs.

            Now:

            1. Load your XML file with Notepad++. Ensure that it is the active tab.
            2. Navigate to (menu) -> Plugins -> XML Tools -> XSL Transformation.
            3. In the dialog popping up select the XSL file you created from the code above.
            4. Click Transform button.

            A new tab will be opened with the changed content of your original XML file.

            [EDIT]
            Uh, too late.
            [/EDIT]

            1 Reply Last reply Reply Quote 1
            • guy038G
              guy038
              last edited by guy038

              Hello @jeff-Michaels, @alan-kilborn, @dinkumoil and All,

              Congratulations to brilliantly achieve your goal from the link provided by @alan-kilborn ;-)) You seem to be a regex’s guru, too !


              Regarding your solution, I suppose that the end should be \r\n ( instead of \r\t )

              Now, I think that you can simplify your regex as :

              (?xs-i) <pg \x20 (?: (?! </pg> ) . )*? BAR .*? BREAK .*? </pg> \R # Delete any <pg> section containing BAR and BREAK

              • Indeed, no need to be sure that the string </pg> does not occur between the words BAR and BREAK and between BREAK and </pg>. Once you get the good <pg...> block, with the negative look-ahead, the lazy quantifiers, coming next, forces the correct detection of that block !

              • You may use \R as is stands for any kind of line-break ( \r\n, \n and \r )

              • Note that I use the Free-spacing mode ( (?x... ) which allows to separate the main parts of the regex and allows comments after the # char

              Best Regards,

              guy038

              1 Reply Last reply Reply Quote 2
              • First post
                Last post
              The Community of users of the Notepad++ text editor.
              Powered by NodeBB | Contributors