• Login
Community
  • Login

Regex to delete sections of XML

Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
8 Posts 3 Posters 3.4k Views
Loading More Posts
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • C
    Christopher Phillips
    last edited by Jan 6, 2019, 12:56 AM

    How can I use regex to find/delete all XML strings/units that contain (approved=“yes”)
    I have tried this and is find the sections that have Yes but becomes greedy when it doesn’t
    <trans.?(approved=“yes”).?unit>?\n

    Find
    <trans-unit id=“1” identifier=“e4c7” approved=“yes”>
    <source>Hello world
    </trans-unit>

    Ignore
    <trans-unit id=“5” identifier=“e4c7” approved=“no”>
    <source>Welcome to the world
    </trans-unit>

    1 Reply Last reply Reply Quote 0
    • T
      Terry R
      last edited by Terry R Jan 6, 2019, 1:43 AM Jan 6, 2019, 1:41 AM

      @Christopher-Phillips
      This should help.

      Find what:^(<trans.+?approved=)(?=“yes”)(?s).+?trans-unit>\R
      Replace With:nothing in this field <— field empty

      This will find those occurrences with the parameter “yes” and with the replace field set to ‘blank’ it will remove those occurrences. Note the last character is a carriage return/line feed, just so you don’t finish up with extra blank lines afterwards. Also note I started the regex with a ^, meaning start of line, I assume these XML strings will ALL start at the beginning of a line.

      Also note the " characters need to be exactly as you have it. If it doesn’t work initially, change these in my regex to be the same as your XML strings.

      Terry

      1 Reply Last reply Reply Quote 3
      • G
        guy038
        last edited by guy038 Jan 6, 2019, 2:33 AM Jan 6, 2019, 2:16 AM

        Hello, @christopher-phillips, and All,

        To delete all the <trans-unit......>.........</trans-unit> areas, just execute this regex S/R :

        SEARCH (?s)<(trans-unit)\x20((?!<\1).)+?approved="yes".+?</\1>\R

        REPLACE Leave EMPTY

        • Check preferably the Wrap around option

        • Select the Regular expression search mode

        • Click, once, on the Replace All button

        Et voilà !


        Test it against the text below :

        <trans-unit id="1" identifier="e4c7" approved="yes"> <source>Hello world </trans-unit>
        
        <trans-unit id="5" identifier="e4c7" approved="no"> <source>Welcome to the world </trans-unit>
        
        <trans-unit id="1" identifier="e4c7" approved="yes">
        <source>Hello world
        </trans-unit>
        
        <trans-unit id="5" identifier="e4c7" approved="no">
        <source>Welcome to the world
        </trans-unit>
        
        <trans-unit id="1"
         identifier="e4c7"
         approved="yes">
        <source>Hello world
        </trans-unit>
        
        <trans-unit id="5"
         identifier="e4c7" approved="no">
        <source>Welcome to the world
        </trans-unit>
        

        Notes :

        • This search regex uses the usual Quotation Mark symbol " ( \x{0022} ) and not the Left and Right Double Quotation Mark “ and ” ( \x{201C} and \x{201D} ). Change the double quotes, if necessary !

        • The first part (?s) means that dot will match any single char ( standard or EOL chars )

        • Then, the regex looks for the < symbol, followed with the string trans-unit, stored as group 1, because of the parentheses, followed with a space char ( <(<trans-unit)\x20 )

        • After the part ((?!<\1).)+?approved="yes" tries to find the smallest range, even multi-lines, of any character till the string approved="yes" ONLY IF the string <trans-unit cannot be found at any position of that range

        • Finally, the part .+?</\1>\R tries to match the smallest range, even multi-lines, of any character till the string </trans-unit>, followed with the usual EOL characters of current line

        Best Regards,

        guy038

        C 1 Reply Last reply Jan 6, 2019, 5:16 PM Reply Quote 2
        • C
          Christopher Phillips @guy038
          last edited by Jan 6, 2019, 5:16 PM

          @guy038 said:

          Thank you both.
          guy038, yours was the one that sorted it for me. Thanks

          1 Reply Last reply Reply Quote 0
          • C
            Christopher Phillips
            last edited by Jan 7, 2019, 9:10 PM

            @guy038 If I have many <trans-unit> <trans-unit> without approved=“yes” it is like Notepad++ can’t handle it and selects the whole file from start to finish as if I had pressed Ctrl+a
            Seems there is an issues with grouping from what I could find
            https://github.com/notepad-plus-plus/notepad-plus-plus/issues/683

            I am not sure what grouping is but can you think of a workaround?

            1 Reply Last reply Reply Quote 0
            • G
              guy038
              last edited by Jan 8, 2019, 12:36 AM

              Hi, @christopher-phillips, @terry-r and All,

              Indeed, in some cases, the N++ regex engine wrongly matches all the file contents ! I have not been able to find out, so far, which condition(s) cause(s) this issue :-((

              But if all your ranges of characters <trans-unit...........approved="yes/no" lie in a single line only, the more simple regex, below, without the negative look-ahead structure, should work better :

              (?-s)^\h*<(trans-unit)\x20.+approved="yes"((?s).+?)</\1>\R

              Cheers,

              guy038

              1 Reply Last reply Reply Quote 2
              • C
                Christopher Phillips
                last edited by Jan 8, 2019, 7:13 AM

                Mine are note on the same line. In some cases there are line breaks within the element as well :-(

                  <trans-unit id="8" identifier="b2a7b029bf7d20000a606ec7a87bc248">
                    <source>The old password is not right</source>
                    <target state="needs-translation">The old password is not right</target>
                    <note>Context: -&gt; The old password is not correct</note>
                  </trans-unit>
                  <trans-unit id="9" identifier="d0d863d18d76100000ad54f79a2eed11">
                    <source>No account found</source>
                    <target state="needs-translation">No account found</target>
                    <note>No account</note>
                  </trans-unit>
                  <trans-unit id="11" identifier="bd421d33a9b0000e1e46049b1273eb9" approved="yes">
                    <source>Cannot get questions</source>
                    <target>ಪ್ರಶ್ನೆಗಳನ್ನು ಪಡೆಯಲು</target>
                    <note>Context: #error</note>
                  </trans-unit>
                
                1 Reply Last reply Reply Quote 0
                • C
                  Christopher Phillips
                  last edited by Jan 8, 2019, 9:01 AM

                  @guy038 said:

                  Thanks @guy038 you are a star. I misread and thought everything needed to be on a single line. You new regex works perfectly.

                  1 Reply Last reply Reply Quote 1
                  6 out of 8
                  • First post
                    6/8
                    Last post
                  The Community of users of the Notepad++ text editor.
                  Powered by NodeBB | Contributors