• Login
Community
  • Login

Find and delete all between and including tag only when certain words are within

Scheduled Pinned Locked Moved General Discussion
5 Posts 4 Posters 1.3k Views
Loading More Posts
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • J
    Jeff Michaels
    last edited by Jeff Michaels Apr 1, 2023, 3:23 PM Apr 1, 2023, 3:21 PM

    Hi everyone.

    Newbie here.

    I have xml files that contain scrolling lyrics for karaoke songs that we are acquiring from another company. I’m in need of removing each <pg> tag that contain multiline phrases like:

    8
    BAR
    INSTRUMENTAL
    BREAK

    They are always on their own separate page within a <pg> tag. The company told us the common words that appear every time are BAR & BREAK. This will avoid actual lyrics from being deleted within the remaining page tags (hopefully). There may be multiple instances of these tags throughout the xml as well. I need find and delete all of them.

    I’m able to select the opening <pg and all the code up until the next opening <pg one at a time with this regex in Notepad++:

    (<pg)(.+?)(?=<pg)

    Is there a way to add code to locate both words BAR and BREAK to the above regex and only have those full tags found and deleted (multiple times within a file)? Then I can switch to Find In Files for a bulk search and replace routine?

    Below is an example of 3 <pg> tags consecutively. I need the 2nd complete tag found and deleted, then continue on to delete another full <pg> tag if found until it reaches the end of the file. (rinse and repeat)

    I have about 24 files to test with 7000 to follow. I’m hoping the common denominator of words to select between the <pg> tags are always BAR and BREAK.

    Thank you so much for any help and advice.

    ```
    

    <pg id=“lyrics.16” t=“157.09,15.88”>
    <ln>
    <lyr s="I’M " t=“161.28,.24”/>
    <lyr s="ON " t=“161.52,.43”/>
    <lyr s="MY " t=“161.95,.37”/>
    <lyr s="OWN " t=“162.32,1.05”/>
    </ln>
    <ln>
    <lyr s="I’M " t=“164.57,.26”/>
    <lyr s="ON " t=“164.83,.42”/>
    <lyr s="MY " t=“165.25,.43”/>
    <lyr s="OWN " t=“165.68,1.07”/>
    </ln>
    <ln>
    <lyr s="I’M " t=“167.91,.24”/>
    <lyr s="ON " t=“168.15,.38”/>
    <lyr s="MY " t=“168.53,.42”/>
    <lyr s="OWN " t=“168.95,.62”/>
    </ln>
    <ln>
    <lyr s="NO " t=“169.57,.48”/>
    <lyr s="NO " t=“170.05,.19”/>
    <lyr s="NO " t=“170.24,.41”/>
    <lyr s="NO " t=“170.65,.43”/>
    <lyr s="NO " t=“171.08,.56”/>
    </ln>
    <ln>
    <lyr s="YEAH " t=“171.64,.23”/>
    <lyr s="EH " t=“171.87,.42”/>
    <lyr s="YEAH " t=“172.29,.58”/>
    </ln>
    </pg>
    <pg id=“lyrics.17” t=“172.97,7.93”>
    <ln>
    <lyr s="8 " t=“174.16,.21”/>
    <lyr s="BAR " t=“174.37,.24”/>
    </ln>
    <ln>
    <lyr s="INSTRUMENTAL " t=“174.61,4.52”/>
    </ln>
    <ln>
    <lyr s="BREAK " t=“179.13,1.67”/>
    </ln>
    </pg>

    <pg id=“lyrics.18” t=“180.9,9.72”>
    <count c=“pt.1” t=“184.92,1.27” n=“4”/>
    <ln>
    <lyr s="WOAH " t=“186.55,.25”/>
    <lyr s="OH " t=“186.8,.39”/>
    <lyr s="WOAH " t=“187.19,.41”/>
    </ln>
    <ln>
    <lyr s="I " t=“187.6,.21”/>
    <lyr s="CAN’T " t=“187.81,.38”/>
    <lyr s="LET " t=“188.19,.28”/>
    <lyr s="YOU " t=“188.47,.38”/>
    <lyr s="GO " t=“188.85,.6”/>
    </ln>
    <ln>
    <lyr s="MY " t=“189.45,.44”/>
    <lyr s="LITTLE " t=“189.89,.6”/>
    <lyr s="GIRL " t=“190.49,.03”/>
    </ln>
    </pg>

    D 1 Reply Last reply Apr 1, 2023, 6:19 PM Reply Quote 0
    • A
      Alan Kilborn
      last edited by Apr 1, 2023, 3:24 PM

      I think you might find THIS THREAD and its discussion relevant to your problem.

      J 1 Reply Last reply Apr 1, 2023, 6:04 PM Reply Quote 3
      • J
        Jeff Michaels @Alan Kilborn
        last edited by Apr 1, 2023, 6:04 PM

        @Alan-Kilborn thank you. With a little work, I was able to create this with the help of the thread you suggested. It’s a messy solution (just like my question was), but it’s working on the 30 files I tested, plus none of the actual lyrics are removed by mistake.

        (?s-i:<pg(?:(?!</pg>).)*?BAR(?:(?!</pg>).)*?BREAK(?:(?!</pg>).)*?</pg>)\r\t
        
        1 Reply Last reply Reply Quote 2
        • D
          dinkumoil @Jeff Michaels
          last edited by dinkumoil Apr 1, 2023, 6:21 PM Apr 1, 2023, 6:19 PM

          @Jeff-Michaels

          You could install the XmlTools plugin (available via PluginsAdmin) and use its XSL Transformation feature to resolve your issue.

          After installing the plugin, save the following for example as DeletePgTag.xsl in character encoding UTF-8:

          <?xml version="1.0" encoding="UTF-8"?>
          
          <xsl:transform version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
          
            <xsl:output
              method="xml"
              omit-xml-declaration="no"
              indent="yes"
              encoding="UTF-8"
            />
          
            <xsl:template match="@*|node()">
              <xsl:copy>
                <xsl:apply-templates select="@*|node()"/>
              </xsl:copy>
            </xsl:template>
          
            <xsl:template match="/root/pg[ln[lyr[@s='8 '] and lyr[@s='BAR ']] and ln/lyr[@s='INSTRUMENTAL '] and ln/lyr[@s='BREAK ']]">
              <xsl:text></xsl:text>
            </xsl:template>
          
          </xsl:transform>
          

          Please note:

          • Since you did not tell us what is the exact XML path to your <pg> nodes, I decided to assume it could be root, see the search expression in line 18 <xsl:template match="/root/pg .... You have to change that to your needs.

          • Also, if the encoding of your XML file is not UTF-8, you have to change the encoding attribute of the xsl:output node in line 9 to your needs.

          Now:

          1. Load your XML file with Notepad++. Ensure that it is the active tab.
          2. Navigate to (menu) -> Plugins -> XML Tools -> XSL Transformation.
          3. In the dialog popping up select the XSL file you created from the code above.
          4. Click Transform button.

          A new tab will be opened with the changed content of your original XML file.

          [EDIT]
          Uh, too late.
          [/EDIT]

          1 Reply Last reply Reply Quote 1
          • G
            guy038
            last edited by guy038 Apr 1, 2023, 7:11 PM Apr 1, 2023, 7:03 PM

            Hello @jeff-Michaels, @alan-kilborn, @dinkumoil and All,

            Congratulations to brilliantly achieve your goal from the link provided by @alan-kilborn ;-)) You seem to be a regex’s guru, too !


            Regarding your solution, I suppose that the end should be \r\n ( instead of \r\t )

            Now, I think that you can simplify your regex as :

            (?xs-i) <pg \x20 (?: (?! </pg> ) . )*? BAR .*? BREAK .*? </pg> \R # Delete any <pg> section containing BAR and BREAK

            • Indeed, no need to be sure that the string </pg> does not occur between the words BAR and BREAK and between BREAK and </pg>. Once you get the good <pg...> block, with the negative look-ahead, the lazy quantifiers, coming next, forces the correct detection of that block !

            • You may use \R as is stands for any kind of line-break ( \r\n, \n and \r )

            • Note that I use the Free-spacing mode ( (?x... ) which allows to separate the main parts of the regex and allows comments after the # char

            Best Regards,

            guy038

            1 Reply Last reply Reply Quote 2
            1 out of 5
            • First post
              1/5
              Last post
            The Community of users of the Notepad++ text editor.
            Powered by NodeBB | Contributors