Regex: Delete all the instances of <title> html tag, except the first one

Hellena Crainicu

I have several files that looks like this:

<title>用正确方式打开 MyGainer增健肌粉！ - MYPROTEIN™</title>
blah bhla
blah bhla
<title>Home is me</title>
blah bhla
<title>Payton is your name</title>

I want to find a regex that to delete all lines that contains <title>.*</title> except the first line:

My regex is not very good:

FIND: (<title>.*?</title>)(?=(?:<title>|$)) or
(?s-i)\A.*\K<title>(.*?)(.*?</title>)
Replace by: \1

I made a Python code, very good, but I need the regex for this job:
--------------------------

import re

def keep_first_title_tag(extracted_content):
    # Find all `<title>` tags
    title_tags = re.findall(r'<title>(.*?)</title>', extracted_content, re.DOTALL)

    # Keep only the first `<title>` tag
    extracted_content = title_tags[0]

    return extracted_content


extracted_content = """
<title>用正确方式打开 MyGainer增健肌粉！ - MYPROTEIN™</title>
blah bhla
blah bhla
<title>Home is me</title>
blah bhla
<title>Payton is your name</title>
"""

extracted_content = keep_first_title_tag(extracted_content)
print(extracted_content)

Coises

@Hellena-Crainicu

If your example is precise in that the title line you want to keep is the very first line of the file, then you can use the fact that all the other title lines will be preceded by line ending characters; so use:

\R<title>.*?</title>

and replace with an empty string.

Hellena Crainicu

@Coises super formula, was so easy. Thanks a lot !

Also, I update with anothe Python code that makes the same thing:

    import regex

    def remove_last_title_tags(text):
        # Find all instances of the `<title>` tag
        title_tags = regex.findall(r"(?<=^|\n)<title>.*?</title>", text, flags=regex.DOTALL)

        # Replace the last instance of each `<title>` tag with an empty string
        for i in range(len(title_tags) - 1, -1, -1):
            if i == 0:
                continue
            text = text.replace(title_tags[i], "")

        return text

    extracted_content = remove_last_title_tags(extracted_content)

Terry R

@Hellena-Crainicu

I agree with @Coises. In fact I had exactly the same regex. As asked, if it is definitely at the very start of the file, that should work.

Otherwise if the first <title> isn’t at the very start of the file regex won’t be able to do this in 1 pass. The other option would be to just find the first instance and tag it. then a second pass to remove all other instances and a third pass to remove the tag on the remaining <title>.

Terry

mkupper

@Hellena-Crainicu Another way to do this is:

Search: (?s)(<title>.*?</title>.*?)<title>.*?</title>
Replace: $1

You would need to repeat this until it stops replacing.
In summary:

(?s) puts the regexp engine in dot matches newline mode meaning scans for “.” also include end of line characters. Normally a scan for “.” stops at the end of the line.
(<title>.*?</title>.*?) grab the first title and everything after the first title using a non-greedy scan.
<title>.*?</title> is the second title.

Thus we are saving the first title and everything after it up to the second title and discarding the second title. You will find that as you do the search/replace that it re-positions the cursor meaning the second search replace will save the third and discard the fourth title. Keep repeating. Eventually there will be just one title left and it will always be the first one.

Hellena Crainicu

@mkupper

yes, but if I have text before the first <title> instance, it will delete exactly the <title> instances that are not needed.

try your regex with this example. You will see that the first instance of <title> will be deleted. And I need exactly that one to keep.

<p>但是减脂期可不一定就意味着每天只吃水煮鸡胸和水煮西蓝花，完全苦行僧一样的生活。如果在营养均衡的三餐之间适当的尝试一些健康的小零食，不仅能为减脂期提供动力和新鲜感，还可以为生活增添不少的趣味呢。</p>除此之外，从营养学角度，适当的加餐可以预防三餐之间出现低血糖的现象，还能防止因为饥饿而在下一餐中暴饮暴食，摄入过多热量的情况发生。</p>因此，健康的小零食不仅有助于完成减脂目标，还能让你的减肥期丰富多彩，何乐而不为呢？下面就来推荐给大家10种好吃又健康的小零食。</p>
<title>用正确方式打开 MyGainer增健肌粉！ - MYPROTEIN™</title>
blah bhla
blah bhla
<title>Home is me</title>
blah bhla
<title>Payton is your name</title>

guy038

Hi, @hellena-crainicu, @coises, @terry-r, @mkupper and All,

@hellena-crainicu, we must take in account two cases :

A) The <title>.......</title> line is the very first one of your file(s) :

Then, I personally, found out two other solutions :

SEARCH (?s)(?!\A)<title>.*?</title>\R
REPLACE Leave EMPTY

AND

SEARCH (?s)\A(<title>.*?</title>\R)(*SKIP)(*F)|(?1)
REPLACE Leave EMPTY

However, the @coises’s formulation, with the leading modifier (?s)

SEARCH (?s)\R<title>.*?</title>
REPLACE Leave EMPTY

is really clever and definitively the best one, as the \R syntax is quicker to execute than the negative look-ahead (?!\A) anyway and could be of importance if numerous files are concerned !

B) The <title>.......</title> line may NOT be, necessarily, the very first one of your file(s) :

In this case, a solution, derived from my second formulation above, could be :

SEARCH (?s)\A.*?(<title>.*?</title>\R)(*SKIP)(*F)|(?1)
REPLACE Leave EMPTY

Best Regards,

guy038

mkupper

@Hellena-Crainicu I was puzzled by your comment. I suspect you were testing by having the expression you were testing with at the top of the file. In that case the first title was in the expression itself.

I modified my search expression slightly to replace < with \x3c so that we can have the search/replace expression within the file for testing. I put it at the bottom in these examples.

Here is the test I ran:

Original data

<p>但是减脂期可不一定就意味着每天只吃水煮鸡胸和水煮西蓝花，完全苦行僧一样的生活。如果在营养均衡的三餐之间适当的尝试一些健康的小零食，不仅能为减脂期提供动力和新鲜感，还可以为生活增添不少的趣味呢。</p>除此之外，从营养学角度，适当的加餐可以预防三餐之间出现低血糖的现象，还能防止因为饥饿而在下一餐中暴饮暴食，摄入过多热量的情况发生。</p>因此，健康的小零食不仅有助于完成减脂目标，还能让你的减肥期丰富多彩，何乐而不为呢？下面就来推荐给大家10种好吃又健康的小零食。</p>
<title>用正确方式打开 MyGainer增健肌粉！ - MYPROTEIN™</title>
blah bhla
blah bhla
<title>Home is me</title>
blah bhla
<title>Payton is your name</title>


Search: (?s)(\x3ctitle>.*?\x3c/title>.*?)\x3ctitle>.*?\x3c/title>
Replace: $1

###First pass
This is after doing search-replace-all one time. It removed the second title that was on line 5.

<p>但是减脂期可不一定就意味着每天只吃水煮鸡胸和水煮西蓝花，完全苦行僧一样的生活。如果在营养均衡的三餐之间适当的尝试一些健康的小零食，不仅能为减脂期提供动力和新鲜感，还可以为生活增添不少的趣味呢。</p>除此之外，从营养学角度，适当的加餐可以预防三餐之间出现低血糖的现象，还能防止因为饥饿而在下一餐中暴饮暴食，摄入过多热量的情况发生。</p>因此，健康的小零食不仅有助于完成减脂目标，还能让你的减肥期丰富多彩，何乐而不为呢？下面就来推荐给大家10种好吃又健康的小零食。</p>
<title>用正确方式打开 MyGainer增健肌粉！ - MYPROTEIN™</title>
blah bhla
blah bhla

blah bhla
<title>Payton is your name</title>


Search: (?s)(\x3ctitle>.*?\x3c/title>.*?)\x3ctitle>.*?\x3c/title>
Replace: $1

###Second pass
This is after doing search-replace-all twice. The first pass removed second title that was on line 5 and the second pass removed the third title that was on line 7.

<p>但是减脂期可不一定就意味着每天只吃水煮鸡胸和水煮西蓝花，完全苦行僧一样的生活。如果在营养均衡的三餐之间适当的尝试一些健康的小零食，不仅能为减脂期提供动力和新鲜感，还可以为生活增添不少的趣味呢。</p>除此之外，从营养学角度，适当的加餐可以预防三餐之间出现低血糖的现象，还能防止因为饥饿而在下一餐中暴饮暴食，摄入过多热量的情况发生。</p>因此，健康的小零食不仅有助于完成减脂目标，还能让你的减肥期丰富多彩，何乐而不为呢？下面就来推荐给大家10种好吃又健康的小零食。</p>
<title>用正确方式打开 MyGainer增健肌粉！ - MYPROTEIN™</title>
blah bhla
blah bhla

blah bhla



Search: (?s)(\x3ctitle>.*?\x3c/title>.*?)\x3ctitle>.*?\x3c/title>
Replace: $1

If you watch the status line at the bottom of the search/replace box you will see:

After pass 1: Replace All: 1 occurrence was replaced in entire file
After pass 2: Replace All: 1 occurrence was replaced in entire file
After pass 3: Replace All: 0 occurrences were replaced in entire file

While your examples had the titles on their own lines I had coded to allow them to be anywhere in a line and for them to span lines as that’s what HTML allows. If you want to only support titles on a line by itself then we can add some anchoring:
Search: (?s)^(\x3ctitle>.*?\x3c/title>\R.*?\R)\x3ctitle>.*?\x3c/title>$
Replace: $1

Even that is not perfect as it allows titles to span or more lines. If you insists on only matching titles on one line and not to span them then toggle the dot/EOL spanner flag:
Search: ^(\x3ctitle>(?-s).*?\x3c/title>\R(?s).*?\R)\x3ctitle>(?-s).*?\x3c/title>$
Replace: $1

As you can see, the expression is getting more complicated to deal with the edge cases and requirements.