Regex to delete sections of XML
-
How can I use regex to find/delete all XML strings/units that contain (approved=“yes”)
I have tried this and is find the sections that have Yes but becomes greedy when it doesn’t
<trans.?(approved=“yes”).?unit>?\nFind
<trans-unit id=“1” identifier=“e4c7” approved=“yes”>
<source>Hello world
</trans-unit>Ignore
<trans-unit id=“5” identifier=“e4c7” approved=“no”>
<source>Welcome to the world
</trans-unit> -
@Christopher-Phillips
This should help.Find what:
^(<trans.+?approved=)(?=“yes”)(?s).+?trans-unit>\R
Replace With:nothing in this field
<— field emptyThis will find those occurrences with the parameter “yes” and with the replace field set to ‘blank’ it will remove those occurrences. Note the last character is a carriage return/line feed, just so you don’t finish up with extra blank lines afterwards. Also note I started the regex with a
^
, meaning start of line, I assume these XML strings will ALL start at the beginning of a line.Also note the
"
characters need to be exactly as you have it. If it doesn’t work initially, change these in my regex to be the same as your XML strings.Terry
-
Hello, @christopher-phillips, and All,
To delete all the
<trans-unit......>.........</trans-unit>
areas, just execute this regex S/R :SEARCH
(?s)<(trans-unit)\x20((?!<\1).)+?approved="yes".+?</\1>\R
REPLACE
Leave EMPTY
-
Check preferably the
Wrap around
option -
Select the
Regular expression
search mode -
Click, once, on the
Replace All
button
Et voilà !
Test it against the text below :
<trans-unit id="1" identifier="e4c7" approved="yes"> <source>Hello world </trans-unit> <trans-unit id="5" identifier="e4c7" approved="no"> <source>Welcome to the world </trans-unit> <trans-unit id="1" identifier="e4c7" approved="yes"> <source>Hello world </trans-unit> <trans-unit id="5" identifier="e4c7" approved="no"> <source>Welcome to the world </trans-unit> <trans-unit id="1" identifier="e4c7" approved="yes"> <source>Hello world </trans-unit> <trans-unit id="5" identifier="e4c7" approved="no"> <source>Welcome to the world </trans-unit>
Notes :
-
This search regex uses the usual Quotation Mark symbol
"
(\x{0022}
) and not the Left and Right Double Quotation Mark“
and”
(\x{201C}
and\x{201D}
). Change the double quotes, if necessary ! -
The first part
(?s)
means that dot will match any single char ( standard or EOL chars ) -
Then, the regex looks for the
<
symbol, followed with the string trans-unit, stored as group1
, because of the parentheses, followed with a space char (<(<trans-unit)\x20
) -
After the part
((?!<\1).)+?approved="yes"
tries to find the smallest range, even multi-lines, of any character till the stringapproved="yes"
ONLY IF the string<trans-unit
cannot be found at any position of that range -
Finally, the part
.+?</\1>\R
tries to match the smallest range, even multi-lines, of any character till the string</trans-unit>
, followed with the usual EOL characters of current line
Best Regards,
guy038
-
-
@guy038 said:
Thank you both.
guy038, yours was the one that sorted it for me. Thanks -
@guy038 If I have many <trans-unit> <trans-unit> without approved=“yes” it is like Notepad++ can’t handle it and selects the whole file from start to finish as if I had pressed Ctrl+a
Seems there is an issues with grouping from what I could find
https://github.com/notepad-plus-plus/notepad-plus-plus/issues/683I am not sure what grouping is but can you think of a workaround?
-
Hi, @christopher-phillips, @terry-r and All,
Indeed, in some cases, the N++ regex engine wrongly matches all the file contents ! I have not been able to find out, so far, which condition(s) cause(s) this issue :-((
But if all your ranges of characters
<trans-unit...........approved="yes/no"
lie in a single line only, the more simple regex, below, without the negative look-ahead structure, should work better :(?-s)^\h*<(trans-unit)\x20.+approved="yes"((?s).+?)</\1>\R
Cheers,
guy038
-
Mine are note on the same line. In some cases there are line breaks within the element as well :-(
<trans-unit id="8" identifier="b2a7b029bf7d20000a606ec7a87bc248"> <source>The old password is not right</source> <target state="needs-translation">The old password is not right</target> <note>Context: -> The old password is not correct</note> </trans-unit> <trans-unit id="9" identifier="d0d863d18d76100000ad54f79a2eed11"> <source>No account found</source> <target state="needs-translation">No account found</target> <note>No account</note> </trans-unit> <trans-unit id="11" identifier="bd421d33a9b0000e1e46049b1273eb9" approved="yes"> <source>Cannot get questions</source> <target>ಪ್ರಶ್ನೆಗಳನ್ನು ಪಡೆಯಲು</target> <note>Context: #error</note> </trans-unit>
-