Community
    • Login

    Need help with regex for XML removal

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    11 Posts 4 Posters 1.1k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Piotr StefańskiP
      Piotr Stefański
      last edited by

      Hello,

      I’m trying to find a regex that would allow me to remove all XML instances on <machine …> … </machine> that contain <control type=“gambling” … />.
      Here’s the text I have:

      <machine name="3bagflnz" sourcefile="aristocrat/aristmk4.cpp" cloneof="3bagflvt" romof="3bagflvt" sampleof="3bagflvt">
      	<description>3 Bags Full (3VXFC5345, New Zealand)</description>
      	<year>1996</year>
      	<manufacturer>Aristocrat</manufacturer>
      	<rom name="2vas004.u59" merge="2vas004.u59" size="8192" crc="84226547" sha1="df9c2c01a7ac4d930c06a8c4863853ddb1a2adbe" region="maincpu" offset="2000"/>
      	<device_ref name="mc6809e"/>
      	<sample name="tick"/>
      	<input players="1" coins="2">
      		<control type="gambling" buttons="21"/>
      	</input>
      	<driver status="good" emulation="good" savestate="unsupported"/>
      </machine>
      

      Basically I just want to remove all instances of <machine …></machine> if they contain type=“gambling”.

      Here’s what I have tried but for some reason it does not work:

      Find What = (?s)<(machine)\x20((?!<\1).)+?type="gambling".+?</\1>
      Replace With = LEFT EMPTY
      Search Mode = REGULAR EXPRESSION
      Dot Matches Newline = NOT CHECKED
      

      The weird part is sometimes I manage to find 1 instance (out of thousands) while sometimes it fails completely. Any help will be greatly appreciated.

      Kind Regards,
      Piotr

      Alan KilbornA CoisesC 3 Replies Last reply Reply Quote 0
      • Alan KilbornA
        Alan Kilborn @Piotr Stefański
        last edited by Alan Kilborn

        @Piotr-Stefański

        This or something similar has potential to work:

        Find: (?s-i)<machine.+?(?:(type="gambling".+?</machine>\R)|(?:</machine>\R))
        Replace: ?{1}:$0

        Or maybe it’s just a variant on yours that will still cause complexity too much for the regex engine.

        Alan KilbornA 1 Reply Last reply Reply Quote 1
        • CoisesC
          Coises @Piotr Stefański
          last edited by

          @Piotr-Stefański said in Need help with regex for XML removal:

          it fails completely

          As in it claims to find nothing, or as in it shows “Invalid regular expression” at the bottom?

          If the latter, please hover over the three dots to the right of that message and tell us if it says something about the expression being too complex.

          1 Reply Last reply Reply Quote 1
          • CoisesC
            Coises @Piotr Stefański
            last edited by

            @Piotr-Stefański said in Need help with regex for XML removal:

            I’m trying to find a regex that would allow me to remove all XML instances on <machine …> … </machine> that contain <control type=“gambling” … />.

            For what it’s worth:

            If I were trying to solve this problem for myself, I’d break it into a couple of steps instead of trying to write a single, very clever regular expression.

            If it is safe to assume that <control type=“gambling” never occurs outside of a <machine …>…</machine> pair, then I would first do:

            Find what : <control type="gambling".*?</machine>
            Replace with : </machine DELETE>
            Search Mode: Regular expression
            . matches newline: checked

            Then I’d do:

            Find what : <machine .*?</machine(*SKIP) DELETE>\R
            Replace with : (empty)
            Search Mode: Regular expression
            . matches newline: checked

            Alan KilbornA 1 Reply Last reply Reply Quote 2
            • Alan KilbornA
              Alan Kilborn @Coises
              last edited by

              @Coises said in Need help with regex for XML removal:

              I’d break it into a couple of steps instead of trying to write a single, very clever regular expression.

              Ha, yea. The use of (*SKIP) pretty much makes it just as clever. :-)

              Referencing HERE we have a good explanation:

              if at current position in string, the regex engine can match the part before (*SKIP) but cannot match the part after (*SKIP), the regex engine discards any further search and the current match attempt just fails. So the regex engine must advance to the location where the zero-width (*SKIP) verb occurs, for a new match attempt

              1 Reply Last reply Reply Quote 1
              • guy038G
                guy038
                last edited by guy038

                Hello, @piotr-stefański, @alan-kilborn, @coises and All,

                @piotr-stefański, instead of the regex (?s)<(machine)\x20((?!<\1).)+?type="gambling".+?</\1>, use preferably this one, below :

                • SEARCH (?s)<machine\x20(?:(?!<machine).)+?type="gambling".+?</machine>

                • REPLACE Leave EMPTY

                This new version does not use any group to be stored ! May be, you’ll get better results ;-))


                Now, let’s start, for example, with this¨INPUT text :

                <machine name="3bagflnz" sourcefile="aristocrat/aristmk4.cpp" cloneof="3bagflvt" romof="3bagflvt" sampleof="3bagflvt">
                	<description>3 Bags Full (3VXFC5345, New Zealand)</description>
                	<year>1996</year>
                	<manufacturer>Aristocrat</manufacturer>
                	<rom name="2vas004.u59" merge="2vas004.u59" size="8192" crc="84226547" sha1="df9c2c01a7ac4d930c06a8c4863853ddb1a2adbe" region="maincpu" offset="2000"/>
                	<device_ref name="mc6809e"/>
                	<sample name="tick"/>
                	<input players="1" coins="2">
                	</input>
                	<driver status="good" emulation="good" savestate="unsupported"/>
                </machine>
                
                <machine name="3bagflnz" sourcefile="aristocrat/aristmk4.cpp" cloneof="3bagflvt" romof="3bagflvt" sampleof="3bagflvt">
                	<description>3 Bags Full (3VXFC5345, New Zealand)</description>
                	<year>1996</year>
                	<manufacturer>Aristocrat</manufacturer>
                	<rom name="2vas004.u59" merge="2vas004.u59" size="8192" crc="84226547" sha1="df9c2c01a7ac4d930c06a8c4863853ddb1a2adbe" region="maincpu" offset="2000"/>
                	<device_ref name="mc6809e"/>
                	<sample name="tick"/>
                	<input players="1" coins="2">
                		<control type="gambling" buttons="21"/>
                	</input>
                	<driver status="good" emulation="good" savestate="unsupported"/>
                </machine>
                
                <machine name="3bagflnz" sourcefile="aristocrat/aristmk4.cpp" cloneof="3bagflvt" romof="3bagflvt" sampleof="3bagflvt">
                	<description>3 Bags Full (3VXFC5345, New Zealand)</description>
                	<year>1996</year>
                	<manufacturer>Aristocrat</manufacturer>
                	<rom name="2vas004.u59" merge="2vas004.u59" size="8192" crc="84226547" sha1="df9c2c01a7ac4d930c06a8c4863853ddb1a2adbe" region="maincpu" offset="2000"/>
                	<device_ref name="mc6809e"/>
                	<sample name="tick"/>
                	<input players="1" coins="2">
                		<control type="other" buttons="21"/>
                	</input>
                	<driver status="good" emulation="good" savestate="unsupported"/>
                </machine>
                
                <machine name="3bagflnz" sourcefile="aristocrat/aristmk4.cpp" cloneof="3bagflvt" romof="3bagflvt" sampleof="3bagflvt">
                	<description>3 Bags Full (3VXFC5345, New Zealand)</description>
                	<year>1996</year>
                	<manufacturer>Aristocrat</manufacturer>
                	<rom name="2vas004.u59" merge="2vas004.u59" size="8192" crc="84226547" sha1="df9c2c01a7ac4d930c06a8c4863853ddb1a2adbe" region="maincpu" offset="2000"/>
                	<device_ref name="mc6809e"/>
                	<sample name="tick"/>
                	<input players="1" coins="2">
                		<control type="gambling" buttons="21"/>
                	</input>
                	<driver status="good" emulation="good" savestate="unsupported"/>
                </machine>
                

                with the @coises method, we use its first regex S/R :

                • SEARCH (?s)<control type="gambling".*?</machine>

                • REPLACE </machine DELETE>

                To get the temporary text below :

                <machine name="3bagflnz" sourcefile="aristocrat/aristmk4.cpp" cloneof="3bagflvt" romof="3bagflvt" sampleof="3bagflvt">
                	<description>3 Bags Full (3VXFC5345, New Zealand)</description>
                	<year>1996</year>
                	<manufacturer>Aristocrat</manufacturer>
                	<rom name="2vas004.u59" merge="2vas004.u59" size="8192" crc="84226547" sha1="df9c2c01a7ac4d930c06a8c4863853ddb1a2adbe" region="maincpu" offset="2000"/>
                	<device_ref name="mc6809e"/>
                	<sample name="tick"/>
                	<input players="1" coins="2">
                	</input>
                	<driver status="good" emulation="good" savestate="unsupported"/>
                </machine>
                
                <machine name="3bagflnz" sourcefile="aristocrat/aristmk4.cpp" cloneof="3bagflvt" romof="3bagflvt" sampleof="3bagflvt">
                	<description>3 Bags Full (3VXFC5345, New Zealand)</description>
                	<year>1996</year>
                	<manufacturer>Aristocrat</manufacturer>
                	<rom name="2vas004.u59" merge="2vas004.u59" size="8192" crc="84226547" sha1="df9c2c01a7ac4d930c06a8c4863853ddb1a2adbe" region="maincpu" offset="2000"/>
                	<device_ref name="mc6809e"/>
                	<sample name="tick"/>
                	<input players="1" coins="2">
                		</machine DELETE>
                
                <machine name="3bagflnz" sourcefile="aristocrat/aristmk4.cpp" cloneof="3bagflvt" romof="3bagflvt" sampleof="3bagflvt">
                	<description>3 Bags Full (3VXFC5345, New Zealand)</description>
                	<year>1996</year>
                	<manufacturer>Aristocrat</manufacturer>
                	<rom name="2vas004.u59" merge="2vas004.u59" size="8192" crc="84226547" sha1="df9c2c01a7ac4d930c06a8c4863853ddb1a2adbe" region="maincpu" offset="2000"/>
                	<device_ref name="mc6809e"/>
                	<sample name="tick"/>
                	<input players="1" coins="2">
                		<control type="other" buttons="21"/>
                	</input>
                	<driver status="good" emulation="good" savestate="unsupported"/>
                </machine>
                
                <machine name="3bagflnz" sourcefile="aristocrat/aristmk4.cpp" cloneof="3bagflvt" romof="3bagflvt" sampleof="3bagflvt">
                	<description>3 Bags Full (3VXFC5345, New Zealand)</description>
                	<year>1996</year>
                	<manufacturer>Aristocrat</manufacturer>
                	<rom name="2vas004.u59" merge="2vas004.u59" size="8192" crc="84226547" sha1="df9c2c01a7ac4d930c06a8c4863853ddb1a2adbe" region="maincpu" offset="2000"/>
                	<device_ref name="mc6809e"/>
                	<sample name="tick"/>
                	<input players="1" coins="2">
                		</machine DELETE>
                

                Then, with its second regex S/R :

                • SEARCH (?s)<machine .*?</machine(*SKIP) DELETE>\R

                • REPLACE Leave EMPTY

                We end up with our expected OUTPUT text :

                <machine name="3bagflnz" sourcefile="aristocrat/aristmk4.cpp" cloneof="3bagflvt" romof="3bagflvt" sampleof="3bagflvt">
                	<description>3 Bags Full (3VXFC5345, New Zealand)</description>
                	<year>1996</year>
                	<manufacturer>Aristocrat</manufacturer>
                	<rom name="2vas004.u59" merge="2vas004.u59" size="8192" crc="84226547" sha1="df9c2c01a7ac4d930c06a8c4863853ddb1a2adbe" region="maincpu" offset="2000"/>
                	<device_ref name="mc6809e"/>
                	<sample name="tick"/>
                	<input players="1" coins="2">
                	</input>
                	<driver status="good" emulation="good" savestate="unsupported"/>
                </machine>
                
                
                <machine name="3bagflnz" sourcefile="aristocrat/aristmk4.cpp" cloneof="3bagflvt" romof="3bagflvt" sampleof="3bagflvt">
                	<description>3 Bags Full (3VXFC5345, New Zealand)</description>
                	<year>1996</year>
                	<manufacturer>Aristocrat</manufacturer>
                	<rom name="2vas004.u59" merge="2vas004.u59" size="8192" crc="84226547" sha1="df9c2c01a7ac4d930c06a8c4863853ddb1a2adbe" region="maincpu" offset="2000"/>
                	<device_ref name="mc6809e"/>
                	<sample name="tick"/>
                	<input players="1" coins="2">
                		<control type="other" buttons="21"/>
                	</input>
                	<driver status="good" emulation="good" savestate="unsupported"/>
                </machine>
                
                

                However, note that its second regex S/R could also have been solved without any control verb, with the following S/R :

                • SEARCH (?s)<machine (?:(?!</machine).)*?</machine DELETE>\R

                • REPLACE Leave EMPTY

                REMARK :

                See the fundamental difference between these two regex S/R syntaxes :

                • SEARCH (?s)<machine .*?</machine(*SKIP) DELETE>\R

                • REPLACE Leave EMPTY

                and

                • SEARCH (?s)<machine .*?</machine\K DELETE>\R

                • REPLACE Leave EMPTY

                In the first case :

                • IF the \x20DELETE\R string is found, all this specific section will be deleted

                • IF the \x20DELETE\R string is NOT found, as the back-tracking process cannot occur, all this specific section is just ignored

                But, in the second case :

                • IF the \x20DELETE\R string is found, only the part \x20DELETE\R will be deleted, due to the \K syntax

                • IF the \x20DELETE\R string is NOT found, NO replacement occurs, due to the \K syntax


                Now, a third alternative would be to simply use the generic regex <What I don't want>(*SKIP)(*F)|<What I want>. Indeed :

                • We do NOT want all the <machine .......</machine> sections, which do not contain the type="gambling" string, thus ignored sections

                • We DO want all the <machine .......</machine> sections, which contain the type="gambling" string, in order to delete these specific sections by the S/R

                This leads to the functional regex S/R :

                • SEARCH (?s)<machine (?:(?!type="gambling").)+?</machine>(*SKIP)(*F)|<machine .+?</machine>

                • REPLACE Leave EMPTY

                And, starting with the INPUT text again, we would obtain, once more, our expected OUTPUT :

                <machine name="3bagflnz" sourcefile="aristocrat/aristmk4.cpp" cloneof="3bagflvt" romof="3bagflvt" sampleof="3bagflvt">
                	<description>3 Bags Full (3VXFC5345, New Zealand)</description>
                	<year>1996</year>
                	<manufacturer>Aristocrat</manufacturer>
                	<rom name="2vas004.u59" merge="2vas004.u59" size="8192" crc="84226547" sha1="df9c2c01a7ac4d930c06a8c4863853ddb1a2adbe" region="maincpu" offset="2000"/>
                	<device_ref name="mc6809e"/>
                	<sample name="tick"/>
                	<input players="1" coins="2">
                	</input>
                	<driver status="good" emulation="good" savestate="unsupported"/>
                </machine>
                
                
                
                <machine name="3bagflnz" sourcefile="aristocrat/aristmk4.cpp" cloneof="3bagflvt" romof="3bagflvt" sampleof="3bagflvt">
                	<description>3 Bags Full (3VXFC5345, New Zealand)</description>
                	<year>1996</year>
                	<manufacturer>Aristocrat</manufacturer>
                	<rom name="2vas004.u59" merge="2vas004.u59" size="8192" crc="84226547" sha1="df9c2c01a7ac4d930c06a8c4863853ddb1a2adbe" region="maincpu" offset="2000"/>
                	<device_ref name="mc6809e"/>
                	<sample name="tick"/>
                	<input players="1" coins="2">
                		<control type="other" buttons="21"/>
                	</input>
                	<driver status="good" emulation="good" savestate="unsupported"/>
                </machine>
                
                
                

                Best Regards,

                guy038

                1 Reply Last reply Reply Quote 2
                • Alan KilbornA
                  Alan Kilborn @Alan Kilborn
                  last edited by Alan Kilborn

                  @Alan-Kilborn said in Need help with regex for XML removal:

                  Find: (?s-i)<machine.+?(?:(type="gambling".+?</machine>\R)|(?:</machine>\R))

                  Replace: ?{1}:$0

                  I feel somewhat slighted as @guy038 avoided commenting on my proposed solution. :-(

                  Mine has a bit of symmetry with:

                  What_I_don’t_want(*SKIP)(*F)|What_I_want>

                  because I also used an | to specify non-wanted vs. wanted sections.

                  If a non-wanted section (according to the OP’s definition of what he doesn’t want) appears, I captured it into group1 and then my replacement replaces that whole <machine>…</machine> section with nothing (because there is nothing between the } and the :), otherwise it replaces that section with itself via $0 (thus, keeping it).

                  Piotr StefańskiP 1 Reply Last reply Reply Quote 1
                  • Piotr StefańskiP
                    Piotr Stefański @Alan Kilborn
                    last edited by

                    Thank you everyone, this is a real treasure trove of info.
                    I’ll experiment with all of the above.

                    Again, huge thanks!

                    1 Reply Last reply Reply Quote 1
                    • guy038G
                      guy038
                      last edited by guy038

                      Hi, @alan-kilborn, @piotr-stefański, @coises and All,

                      @alan-kilborn, I’m rather disappointed that you thought I’d intentionally omitted to comment on your solution :-(

                      As, at this stage, @coises had found a solution that used control verbs, and, what’s more, you had given him a glowing review, I just focused on his solution !

                      No, I simply didn’t notice your response. Sorry for this “faux pas” !


                      So, allow me to use the Free spacing mode again ! Thus, your regex S/R can be expressed as :

                      SEARCH  :   (?xs-i) <machine .+? (?: ( type="gambling" .+? </machine> \R ) | </machine> \R )
                                                             ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
                                                                         Group 1
                      
                      REPLACE :   ?1:$0
                      

                      And, indeed, you’ve found out a very powerful solution, because, if we use in replacement, the regex ?1$0:, it also inverts the logic and delete ONLY the sections which do not contain the type= "gambling" string !


                      So, if we assume that :

                      • BSR = <machine(?:\x20|>)

                      • ESR = </machine>\R

                      • FR = type="gambling"

                      This leads to the generic regex S/R, below :

                      SEARCH (?s-i)BSR.+?(?:(FR.+?ESR)|ESR)

                      REPLACE RR

                      And :

                      • IF RR = ?1:$0, the complete BSR....ESR sections, WITH the FR string, are deleted

                      • IF RR = ?1$0:, the complete BSR....ESR sections, WITHOUT the FR string, are deleted


                      Thus, Alan, I’m going to add a new BLOG post about this powerful and simple method, soon ;-))

                      Best Regards,

                      guy038

                      CoisesC Alan KilbornA 2 Replies Last reply Reply Quote 1
                      • CoisesC
                        Coises @guy038
                        last edited by

                        @guy038 Before we get too excited about any of our proposed solutions, I hope we hear more from @Piotr-Stefański, the original poster.

                        When I tried copying his example data, duplicating and creating a variant with a different control type, and then making many copies of each all in one file, his original regular expression worked.

                        Unless he reports back to us that one or more of our proposed solutions worked on his actual data, or unless someone else manages to construct an example on which his expression fails and one or more of our solutions works, we don’t know that we have solved anything.

                        We don’t even know what the original problem was. I’m guessing the “complexity” message, which for some reason seems to be cropping up a lot lately; but he has not confirmed that.

                        1 Reply Last reply Reply Quote 1
                        • Alan KilbornA
                          Alan Kilborn @guy038
                          last edited by Alan Kilborn

                          @guy038 :

                          Interesting. I hadn’t thought of it as any sort of “general” solution to a problem!

                          [ But, really, there already was the makings of a general solution to half of the problem, from you (ref. HERE) ]

                          A couple of notes:


                          Note 1:

                          IF RR = ?1$0:, the complete BSR…ESR sections, WITHOUT the FR string, are deleted

                          In this variant of the replace expression, the : isn’t necessary, thus:

                          IF RR = ?1$0, the complete…


                          Note 2:

                          Since the overall regex uses group1, it isn’t available in the BSR, ESR, FR and even the RR subexpressions.

                          Thus a user of this would have to keep in mind that if he is using further grouping inside these expressions, that he has to think in terms of group2 and above.

                          This is definitely unlike another templated regex solution I use a lot, ref. HERE, where the user does not have to keep this in mind.

                          1 Reply Last reply Reply Quote 1
                          • First post
                            Last post
                          The Community of users of the Notepad++ text editor.
                          Powered by NodeBB | Contributors