Replace second occurrence of a tag in multiple files

Anonymous Guy

Hello, I’ve been generating xml files from excel rows and am almost done, however I have a problem I haven’t been able to fix yet. I have 500+ files opened in notepad and want to batch replace second occurrence of a certain tag in all opened files. I have two closing tags at different lines </contributor></contributor> but I’d like to change the second one to <contributor_description> so the result would be </contributor><contributor_description>. Not a complete beginner but this one seems hard to crack, any help would be greatly appreciated. Thank you in advance.

PeterJones

@Anonymous-Guy said in Replace second occurrence of a tag in multiple files:

I have two closing tags at different lines </contributor></contributor> but I’d like to change the second one to <contributor_description> so the result would be </contributor><contributor_description> … this one seems hard to crack

Either you’re not telling us the whole story, or the forum mangled what you posted, because looking at your literal text, that is of the simplest forms of search/replace: you just have to enter what you had and what you want.

FIND = </contributor></contributor>
REPLACE = </contributor><contributor_description>
Search Mode = normal or regular expression

For me, that converted </contributor></contributor> into </contributor><contributor_description>

Once you’ve got the search/replace working in one document, you can replace all in all opened documents

Of course, if the text you posted in your example (or, at least, how your data rendered in the forum) doesn’t actually match your real XML data, then that might explain why you couldn’t get it to work.

If this doesn’t work for you, please post more-representative data, and make sure you use the forum’s </> button while highlighting the data to mark your text as “code”, so the forum will not mangle the data. Also read my boilerplate for more search/replace advice.

-----

Please Read And Understand This

FYI: I often add this to my response in regex threads, unless I am sure the original poster has seen it before. Here is some helpful information for finding out more about regular expressions, and for formatting posts in this forum (especially quoting data) so that we can fully understand what you’re trying to ask:

This forum is formatted using Markdown. Fortunately, it has a formatting toolbar above the edit window, and a preview window to the right; make use of those. The </> button formats text as “code”, so that the text you format with that button will come through literally; use that formatting for example text that you want to make sure comes through literally, no matter what characters you use in the text (otherwise, the forum might interpret your example text as Markdown, with unexpected-for-you results, giving us a bad indication of what your data really is). Images can be pasted directly into your post, or you can hit the image button. (For more about how to manually use Markdown in this forum, please see @Scott-Sumner’s post in the “how to markdown code on this forum” topic, and my updates near the end.) Please use the preview window on the right to confirm that your text looks right before hitting SUBMIT. If you want to clearly communicate your text data to us, you need to properly format it.

If you have further search-and-replace (“matching”, “marking”, “bookmarking”, regular expression, “regex”) needs, study the official Notepad++ searching using regular-expressions docs, as well as this forum’s FAQ and the documentation it points to. Before asking a new regex question, understand that for future requests, many of us will expect you to show what data you have (exactly), what data you want (exactly), what regex you already tried (to show that you’re showing effort), why you thought that regex would work (to prove it wasn’t just something randomly typed), and what data you’re getting with an explanation of why that result is wrong. When you show that effort, you’ll see us bend over backward to get things working for you. If you need help formatting, see the paragraph above.

Please note that for all regex and related queries, it is best if you are explicit about what needs to match, and what shouldn’t match, and have multiple examples of both in your example dataset. Often, what shouldn’t match helps define the regular expression as much or more than what should match.

Here is the way I usually break down trying to figure out a regex (whether it’s for myself or for helping someone in the forum):

Compare what portions of each line I want to match is identical to every other one (“constants”), and what parts do I want to allow to be different in each line (“variables”) but still be part of the match.

Look at both the variables and constants, and see what portions of each I’ll want to keep or move around, vs which parts get thrown away completely. Each sub-component that I want to keep will be put in a regex group. Anything that gets completely thrown away doesn’t need to be in a group, though sometimes I put it in a numbered (___) or unnumbered (?:___) group anyway, if I have a good reason for it. Anything that needs to be split apart, I break into multiple groups, instead of having it as one group.

For each group, I do a mental “how would I describe to my son how to correctly match these characters?” – which should hopefully give me a simple, foolproof algorithm of characters that must match or must not match; then I ask, “how would I translate those instructions into regex sequences?” If I don’t know the answer to the second, I read documentation, or ask a specific question.

try it, debug, iterate.

PeterJones

@Anonymous-Guy said in Replace second occurrence of a tag in multiple files:

I have two closing tags at different lines

Oh, sorry. Re-reading, I noticed you mentioned “at different lines”. Once again, showing representative data is the best way to avoid confusion. Assuming you have

        ...blah...</contributor>
   </contributor>

which shows a newline and other whitespace between them, then use

FIND = </contributor>(\s+)</contributor>
REPLACE = </contributor>$1<contributor_description>
MODE = regular expression

This preserves the spacing, but converts the second </contributor> to <contributor_description>

        ...blah...</contributor>
    <contributor_description>

Anonymous Guy

@PeterJones no need to apologize it’s my first post here and the input window is small so I thought I’d try to describe the problem as shortly as I can to also encourage someone to actually read it. Thank you for the FYI, and excuse me for describing the problem the way I did. I tried your suggestion and it says can’t find the text “</contributor>(\s+)</contributor>”. I’ve done some batch editing before so I know about a few tricks with Regular expressions and so on. My situation is more like …blah…</contributor> and then followed by another …blah…</contributor> if that changes anything. Hopefully you can see the sample in the </> field. Thank you for trying to help.

<?xml version="1.0" encoding="utf-8"?>
<record>
  <date>1981</date>
  <title title_type="main">Dani Gancev</title>
  <title title_type="subtitle">Soncna pot</title>
  <publisher>samozal. J. Kenda</publisher>
  <relation relation_type="main">Zbirka fotografij Janija Kende</relation>
  <contributor contributor_type_id="600">Kenda, Jani</contributor>
  <contributor_description>baskitarist</contributor>
  <coverage coverage_type="spatial">Ljubljana</coverage>
  <coverage coverage_type="spatial">Diskoteka Turist</coverage>
  <description description_type="notes" language_type_id="slv">Baskitarist skupine Soncna pot na nastopu v diskoteki Turist</description>
  <subject language_type_id="slv">koncert</subject>
  <relation>1_FILM_ZITO-TURIST_DISKO_SN_017.jpg</relation>
  <identifier>1_FILM_ZITO-TURIST_DISKO_SN_017</identifier>
</record>

dinkumoil

@Anonymous-Guy

Try the following:

Find what: <contributor_description>.*?\K</contributor>
Replace with: </contributor_description>

Match case unchecked.
Wrap around checked.
Regular expression selected

PeterJones

@Anonymous-Guy ,

Looking at your data – I assume your sample is the data before the change:

  <contributor contributor_type_id="600">Kenda, Jani</contributor>
  <contributor_description>baskitarist</contributor>

If you want to change the second </contributor> to </contributor_description>, that’s doable (though not exactly what you described)

FIND = </contributor>(\s+.*)</contributor>
REPLACE = </contributor>$1</contributor_description>

However, I’m not sure you really need the </contributor> part of the match. If that XML snippet that we can see is representative of your data, the real problem is that you’re trying to get the closing tag to match the opening tag for <contributor_description>, which doesn’t seem to require the extra matching before.

If you have more similar mismatches, but not all of them using the same tag, I would suggest a more generic one-line-matching search, which checks on a given line that the opening tag on a given line matches the closing tag (assuming there is a closing tag)

(?-s)^\h*<(\w+)(.*?</)\K(?!\1)\w+>
$1>

EDIT: Oh, I see @dinkumoil chimed in while I was replying, with an expression that matches what I was thinking for “I’m not sure you really need the </contributor> part of the match” thought

guy038

Hello, @anonymous-guy, @dinkumoil, @peterjones and All,

The @dinkumoil’s solution can even be shortened !

Open the Replace dialog ( Ctrl + H )
SEARCH (?s-i)<contributor_description>.*?</contributor\K
REPLACE _description
Tick the Wrap around option
Select the Regular expression search mode
Click, exclusively, on the Replace All button ( Do not use the Replace button )

This regex looks for the zero-length location, right before the > ending symbol of the range <contributor_description>....</contributor>, even when split in several lines, like below and simply insert the string _description at this location !

So, for instance, the text :

<contributor_description>baskitarist
<coverage coverage_type="spatial">Ljubljana</coverage>
<coverage coverage_type="spatial">Diskoteka Turist</coverage>
</contributor>

<contributor_description>baskitarist</contributor>

would be changed into :

<contributor_description>baskitarist
<coverage coverage_type="spatial">Ljubljana</coverage>
<coverage coverage_type="spatial">Diskoteka Turist</coverage>
</contributor_description>

<contributor_description>baskitarist</contributor_description>

@peterjones, your regex, which finds mismatched opening and ending tags is quite valuable. With a tiny change, this similar version (?-s)^\h*<(\w+)(.*?</)\K(?!\1)\w*> even allows the case of an empty ending tag </>

Best regards,

guy038

Anonymous Guy

@PeterJones @dinkumoil @dinkumoil
<big>Thank you!</big>
I’ve tested all three methods you provided and all of them worked. I’ll surely save this offline and come back to it in the future and try to understand it even better. Thank you for taking the time and effort to reply in such detail to my topic, appreciate it a lot. Always amazed at how a few simple lines can make your life much easier, I’ve optimized my workflow in many ways using portable software, renamers and many other things and can edit solutions from others but sadly I’m not yet able to come up with solutions like yours from scratch. Now that I think about it maybe I can give something in return. Search for “everything” by voidtools on Google, in case you don’t know about it… it’s great for personal use or much more. It scans all your drives (or mapped network locations) relatively fast (minute or two for say single 1TB drive), creates database and than you are able to find files with search terms like part of name, or for example search “june .doc” and it will find salaryjune2019.doc from parts of file name. I use it at work to search files on over 70 TB of data. It’s free, hope you find it useful, thanks again for your help and have a great day!