• Login
Community
  • Login

Regex: Delete all the instances of <title> html tag, except the first one

Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
8 Posts 5 Posters 405 Views
Loading More Posts
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • H
    Hellena Crainicu
    last edited by Hellena Crainicu Dec 3, 2023, 8:24 PM Dec 3, 2023, 8:15 PM

    I have several files that looks like this:

    <title>用正确方式打开 MyGainer增健肌粉! - MYPROTEIN™</title>
    blah bhla
    blah bhla
    <title>Home is me</title>
    blah bhla
    <title>Payton is your name</title>
    

    I want to find a regex that to delete all lines that contains <title>.*</title> except the first line:

    My regex is not very good:

    FIND: (<title>.*?</title>)(?=(?:<title>|$)) or
    (?s-i)\A.*\K<title>(.*?)(.*?</title>)
    Replace by: \1

    I made a Python code, very good, but I need the regex for this job:
    --------------------------

    import re
    
    def keep_first_title_tag(extracted_content):
        # Find all `<title>` tags
        title_tags = re.findall(r'<title>(.*?)</title>', extracted_content, re.DOTALL)
    
        # Keep only the first `<title>` tag
        extracted_content = title_tags[0]
    
        return extracted_content
    
    
    extracted_content = """
    <title>用正确方式打开 MyGainer增健肌粉! - MYPROTEIN™</title>
    blah bhla
    blah bhla
    <title>Home is me</title>
    blah bhla
    <title>Payton is your name</title>
    """
    
    extracted_content = keep_first_title_tag(extracted_content)
    print(extracted_content)
    
    C T M 3 Replies Last reply Dec 3, 2023, 8:26 PM Reply Quote 0
    • C
      Coises @Hellena Crainicu
      last edited by Dec 3, 2023, 8:26 PM

      @Hellena-Crainicu

      If your example is precise in that the title line you want to keep is the very first line of the file, then you can use the fact that all the other title lines will be preceded by line ending characters; so use:

      \R<title>.*?</title>

      and replace with an empty string.

      H 1 Reply Last reply Dec 3, 2023, 8:30 PM Reply Quote 4
      • H
        Hellena Crainicu @Coises
        last edited by Hellena Crainicu Dec 3, 2023, 8:49 PM Dec 3, 2023, 8:30 PM

        @Coises super formula, was so easy. Thanks a lot !

        Also, I update with anothe Python code that makes the same thing:

            import regex
        
            def remove_last_title_tags(text):
                # Find all instances of the `<title>` tag
                title_tags = regex.findall(r"(?<=^|\n)<title>.*?</title>", text, flags=regex.DOTALL)
        
                # Replace the last instance of each `<title>` tag with an empty string
                for i in range(len(title_tags) - 1, -1, -1):
                    if i == 0:
                        continue
                    text = text.replace(title_tags[i], "")
        
                return text
        
            extracted_content = remove_last_title_tags(extracted_content)
        
        1 Reply Last reply Reply Quote 0
        • T
          Terry R @Hellena Crainicu
          last edited by Terry R Dec 3, 2023, 8:31 PM Dec 3, 2023, 8:31 PM

          @Hellena-Crainicu

          I agree with @Coises. In fact I had exactly the same regex. As asked, if it is definitely at the very start of the file, that should work.

          Otherwise if the first <title> isn’t at the very start of the file regex won’t be able to do this in 1 pass. The other option would be to just find the first instance and tag it. then a second pass to remove all other instances and a third pass to remove the tag on the remaining <title>.

          Terry

          1 Reply Last reply Reply Quote 1
          • M
            mkupper @Hellena Crainicu
            last edited by Dec 4, 2023, 4:24 AM

            @Hellena-Crainicu Another way to do this is:

            Search: (?s)(<title>.*?</title>.*?)<title>.*?</title>
            Replace: $1

            You would need to repeat this until it stops replacing.
            In summary:

            • (?s) puts the regexp engine in dot matches newline mode meaning scans for “.” also include end of line characters. Normally a scan for “.” stops at the end of the line.
            • (<title>.*?</title>.*?) grab the first title and everything after the first title using a non-greedy scan.
            • <title>.*?</title> is the second title.

            Thus we are saving the first title and everything after it up to the second title and discarding the second title. You will find that as you do the search/replace that it re-positions the cursor meaning the second search replace will save the third and discard the fourth title. Keep repeating. Eventually there will be just one title left and it will always be the first one.

            H 1 Reply Last reply Dec 5, 2023, 10:42 AM Reply Quote 0
            • H
              Hellena Crainicu @mkupper
              last edited by Dec 5, 2023, 10:42 AM

              @mkupper

              yes, but if I have text before the first <title> instance, it will delete exactly the <title> instances that are not needed.

              try your regex with this example. You will see that the first instance of <title> will be deleted. And I need exactly that one to keep.

              <p>但是减脂期可不一定就意味着每天只吃水煮鸡胸和水煮西蓝花,完全苦行僧一样的生活。如果在营养均衡的三餐之间适当的尝试一些健康的小零食,不仅能为减脂期提供动力和新鲜感,还可以为生活增添不少的趣味呢。</p>除此之外,从营养学角度,适当的加餐可以预防三餐之间出现低血糖的现象,还能防止因为饥饿而在下一餐中暴饮暴食,摄入过多热量的情况发生。</p>因此,健康的小零食不仅有助于完成减脂目标,还能让你的减肥期丰富多彩,何乐而不为呢?下面就来推荐给大家10种好吃又健康的小零食。</p>
              <title>用正确方式打开 MyGainer增健肌粉! - MYPROTEIN™</title>
              blah bhla
              blah bhla
              <title>Home is me</title>
              blah bhla
              <title>Payton is your name</title>
              
              M 1 Reply Last reply Dec 5, 2023, 5:00 PM Reply Quote 0
              • G
                guy038
                last edited by guy038 Dec 5, 2023, 2:34 PM Dec 5, 2023, 12:40 PM

                Hi, @hellena-crainicu, @coises, @terry-r, @mkupper and All,

                @hellena-crainicu, we must take in account two cases :

                A) The <title>.......</title> line is the very first one of your file(s) :

                Then, I personally, found out two other solutions :

                • SEARCH (?s)(?!\A)<title>.*?</title>\R

                • REPLACE Leave EMPTY

                AND

                • SEARCH (?s)\A(<title>.*?</title>\R)(*SKIP)(*F)|(?1)

                • REPLACE Leave EMPTY

                However, the @coises’s formulation, with the leading modifier (?s)

                • SEARCH (?s)\R<title>.*?</title>

                • REPLACE Leave EMPTY

                is really clever and definitively the best one, as the \R syntax is quicker to execute than the negative look-ahead (?!\A) anyway and could be of importance if numerous files are concerned !


                B) The <title>.......</title> line may NOT be, necessarily, the very first one of your file(s) :

                In this case, a solution, derived from my second formulation above, could be :

                • SEARCH (?s)\A.*?(<title>.*?</title>\R)(*SKIP)(*F)|(?1)

                • REPLACE Leave EMPTY

                Best Regards,

                guy038

                1 Reply Last reply Reply Quote 2
                • M
                  mkupper @Hellena Crainicu
                  last edited by Dec 5, 2023, 5:00 PM

                  @Hellena-Crainicu I was puzzled by your comment. I suspect you were testing by having the expression you were testing with at the top of the file. In that case the first title was in the expression itself.

                  I modified my search expression slightly to replace < with \x3c so that we can have the search/replace expression within the file for testing. I put it at the bottom in these examples.

                  Here is the test I ran:

                  Original data

                  <p>但是减脂期可不一定就意味着每天只吃水煮鸡胸和水煮西蓝花,完全苦行僧一样的生活。如果在营养均衡的三餐之间适当的尝试一些健康的小零食,不仅能为减脂期提供动力和新鲜感,还可以为生活增添不少的趣味呢。</p>除此之外,从营养学角度,适当的加餐可以预防三餐之间出现低血糖的现象,还能防止因为饥饿而在下一餐中暴饮暴食,摄入过多热量的情况发生。</p>因此,健康的小零食不仅有助于完成减脂目标,还能让你的减肥期丰富多彩,何乐而不为呢?下面就来推荐给大家10种好吃又健康的小零食。</p>
                  <title>用正确方式打开 MyGainer增健肌粉! - MYPROTEIN™</title>
                  blah bhla
                  blah bhla
                  <title>Home is me</title>
                  blah bhla
                  <title>Payton is your name</title>
                  
                  
                  Search: (?s)(\x3ctitle>.*?\x3c/title>.*?)\x3ctitle>.*?\x3c/title>
                  Replace: $1
                  

                  ###First pass
                  This is after doing search-replace-all one time. It removed the second title that was on line 5.

                  <p>但是减脂期可不一定就意味着每天只吃水煮鸡胸和水煮西蓝花,完全苦行僧一样的生活。如果在营养均衡的三餐之间适当的尝试一些健康的小零食,不仅能为减脂期提供动力和新鲜感,还可以为生活增添不少的趣味呢。</p>除此之外,从营养学角度,适当的加餐可以预防三餐之间出现低血糖的现象,还能防止因为饥饿而在下一餐中暴饮暴食,摄入过多热量的情况发生。</p>因此,健康的小零食不仅有助于完成减脂目标,还能让你的减肥期丰富多彩,何乐而不为呢?下面就来推荐给大家10种好吃又健康的小零食。</p>
                  <title>用正确方式打开 MyGainer增健肌粉! - MYPROTEIN™</title>
                  blah bhla
                  blah bhla
                  
                  blah bhla
                  <title>Payton is your name</title>
                  
                  
                  Search: (?s)(\x3ctitle>.*?\x3c/title>.*?)\x3ctitle>.*?\x3c/title>
                  Replace: $1
                  

                  ###Second pass
                  This is after doing search-replace-all twice. The first pass removed second title that was on line 5 and the second pass removed the third title that was on line 7.

                  <p>但是减脂期可不一定就意味着每天只吃水煮鸡胸和水煮西蓝花,完全苦行僧一样的生活。如果在营养均衡的三餐之间适当的尝试一些健康的小零食,不仅能为减脂期提供动力和新鲜感,还可以为生活增添不少的趣味呢。</p>除此之外,从营养学角度,适当的加餐可以预防三餐之间出现低血糖的现象,还能防止因为饥饿而在下一餐中暴饮暴食,摄入过多热量的情况发生。</p>因此,健康的小零食不仅有助于完成减脂目标,还能让你的减肥期丰富多彩,何乐而不为呢?下面就来推荐给大家10种好吃又健康的小零食。</p>
                  <title>用正确方式打开 MyGainer增健肌粉! - MYPROTEIN™</title>
                  blah bhla
                  blah bhla
                  
                  blah bhla
                  
                  
                  
                  Search: (?s)(\x3ctitle>.*?\x3c/title>.*?)\x3ctitle>.*?\x3c/title>
                  Replace: $1
                  

                  If you watch the status line at the bottom of the search/replace box you will see:

                  After pass 1: Replace All: 1 occurrence was replaced in entire file
                  After pass 2: Replace All: 1 occurrence was replaced in entire file
                  After pass 3: Replace All: 0 occurrences were replaced in entire file
                  

                  While your examples had the titles on their own lines I had coded to allow them to be anywhere in a line and for them to span lines as that’s what HTML allows. If you want to only support titles on a line by itself then we can add some anchoring:
                  Search: (?s)^(\x3ctitle>.*?\x3c/title>\R.*?\R)\x3ctitle>.*?\x3c/title>$
                  Replace: $1

                  Even that is not perfect as it allows titles to span or more lines. If you insists on only matching titles on one line and not to span them then toggle the dot/EOL spanner flag:
                  Search: ^(\x3ctitle>(?-s).*?\x3c/title>\R(?s).*?\R)\x3ctitle>(?-s).*?\x3c/title>$
                  Replace: $1

                  As you can see, the expression is getting more complicated to deal with the edge cases and requirements.

                  1 Reply Last reply Reply Quote 0
                  8 out of 8
                  • First post
                    8/8
                    Last post
                  The Community of users of the Notepad++ text editor.
                  Powered by NodeBB | Contributors