Community
    • Login

    Regex to delete sections of XML

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    8 Posts 3 Posters 4.1k Views 1 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Christopher PhillipsC Offline
      Christopher Phillips
      last edited by

      How can I use regex to find/delete all XML strings/units that contain (approved=“yes”)
      I have tried this and is find the sections that have Yes but becomes greedy when it doesn’t
      <trans.?(approved=“yes”).?unit>?\n

      Find
      <trans-unit id=“1” identifier=“e4c7” approved=“yes”>
      <source>Hello world
      </trans-unit>

      Ignore
      <trans-unit id=“5” identifier=“e4c7” approved=“no”>
      <source>Welcome to the world
      </trans-unit>

      1 Reply Last reply Reply Quote 0
      • Terry RT Offline
        Terry R
        last edited by Terry R

        @Christopher-Phillips
        This should help.

        Find what:^(<trans.+?approved=)(?=“yes”)(?s).+?trans-unit>\R
        Replace With:nothing in this field <— field empty

        This will find those occurrences with the parameter “yes” and with the replace field set to ‘blank’ it will remove those occurrences. Note the last character is a carriage return/line feed, just so you don’t finish up with extra blank lines afterwards. Also note I started the regex with a ^, meaning start of line, I assume these XML strings will ALL start at the beginning of a line.

        Also note the " characters need to be exactly as you have it. If it doesn’t work initially, change these in my regex to be the same as your XML strings.

        Terry

        1 Reply Last reply Reply Quote 3
        • guy038G Offline
          guy038
          last edited by guy038

          Hello, @christopher-phillips, and All,

          To delete all the <trans-unit......>.........</trans-unit> areas, just execute this regex S/R :

          SEARCH (?s)<(trans-unit)\x20((?!<\1).)+?approved="yes".+?</\1>\R

          REPLACE Leave EMPTY

          • Check preferably the Wrap around option

          • Select the Regular expression search mode

          • Click, once, on the Replace All button

          Et voilà !


          Test it against the text below :

          <trans-unit id="1" identifier="e4c7" approved="yes"> <source>Hello world </trans-unit>
          
          <trans-unit id="5" identifier="e4c7" approved="no"> <source>Welcome to the world </trans-unit>
          
          <trans-unit id="1" identifier="e4c7" approved="yes">
          <source>Hello world
          </trans-unit>
          
          <trans-unit id="5" identifier="e4c7" approved="no">
          <source>Welcome to the world
          </trans-unit>
          
          <trans-unit id="1"
           identifier="e4c7"
           approved="yes">
          <source>Hello world
          </trans-unit>
          
          <trans-unit id="5"
           identifier="e4c7" approved="no">
          <source>Welcome to the world
          </trans-unit>
          

          Notes :

          • This search regex uses the usual Quotation Mark symbol " ( \x{0022} ) and not the Left and Right Double Quotation Mark “ and ” ( \x{201C} and \x{201D} ). Change the double quotes, if necessary !

          • The first part (?s) means that dot will match any single char ( standard or EOL chars )

          • Then, the regex looks for the < symbol, followed with the string trans-unit, stored as group 1, because of the parentheses, followed with a space char ( <(<trans-unit)\x20 )

          • After the part ((?!<\1).)+?approved="yes" tries to find the smallest range, even multi-lines, of any character till the string approved="yes" ONLY IF the string <trans-unit cannot be found at any position of that range

          • Finally, the part .+?</\1>\R tries to match the smallest range, even multi-lines, of any character till the string </trans-unit>, followed with the usual EOL characters of current line

          Best Regards,

          guy038

          Christopher PhillipsC 1 Reply Last reply Reply Quote 2
          • Christopher PhillipsC Offline
            Christopher Phillips @guy038
            last edited by

            @guy038 said:

            Thank you both.
            guy038, yours was the one that sorted it for me. Thanks

            1 Reply Last reply Reply Quote 0
            • Christopher PhillipsC Offline
              Christopher Phillips
              last edited by

              @guy038 If I have many <trans-unit> <trans-unit> without approved=“yes” it is like Notepad++ can’t handle it and selects the whole file from start to finish as if I had pressed Ctrl+a
              Seems there is an issues with grouping from what I could find
              https://github.com/notepad-plus-plus/notepad-plus-plus/issues/683

              I am not sure what grouping is but can you think of a workaround?

              1 Reply Last reply Reply Quote 0
              • guy038G Offline
                guy038
                last edited by

                Hi, @christopher-phillips, @terry-r and All,

                Indeed, in some cases, the N++ regex engine wrongly matches all the file contents ! I have not been able to find out, so far, which condition(s) cause(s) this issue :-((

                But if all your ranges of characters <trans-unit...........approved="yes/no" lie in a single line only, the more simple regex, below, without the negative look-ahead structure, should work better :

                (?-s)^\h*<(trans-unit)\x20.+approved="yes"((?s).+?)</\1>\R

                Cheers,

                guy038

                1 Reply Last reply Reply Quote 2
                • Christopher PhillipsC Offline
                  Christopher Phillips
                  last edited by

                  Mine are note on the same line. In some cases there are line breaks within the element as well :-(

                    <trans-unit id="8" identifier="b2a7b029bf7d20000a606ec7a87bc248">
                      <source>The old password is not right</source>
                      <target state="needs-translation">The old password is not right</target>
                      <note>Context: -&gt; The old password is not correct</note>
                    </trans-unit>
                    <trans-unit id="9" identifier="d0d863d18d76100000ad54f79a2eed11">
                      <source>No account found</source>
                      <target state="needs-translation">No account found</target>
                      <note>No account</note>
                    </trans-unit>
                    <trans-unit id="11" identifier="bd421d33a9b0000e1e46049b1273eb9" approved="yes">
                      <source>Cannot get questions</source>
                      <target>ಪ್ರಶ್ನೆಗಳನ್ನು ಪಡೆಯಲು</target>
                      <note>Context: #error</note>
                    </trans-unit>
                  
                  1 Reply Last reply Reply Quote 0
                  • Christopher PhillipsC Offline
                    Christopher Phillips
                    last edited by

                    @guy038 said:

                    Thanks @guy038 you are a star. I misread and thought everything needed to be on a single line. You new regex works perfectly.

                    1 Reply Last reply Reply Quote 1

                    Hello! It looks like you're interested in this conversation, but you don't have an account yet.

                    Getting fed up of having to scroll through the same posts each visit? When you register for an account, you'll always come back to exactly where you were before, and choose to be notified of new replies (either via email, or push notification). You'll also be able to save bookmarks and upvote posts to show your appreciation to other community members.

                    With your input, this post could be even better 💗

                    Register Login
                    • First post
                      Last post
                    The Community of users of the Notepad++ text editor.
                    Powered by NodeBB | Contributors