Community
    • Login

    Regex: Delete all the instances of <title> html tag, except the first one

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    8 Posts 5 Posters 388 Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Hellena CrainicuH
      Hellena Crainicu
      last edited by Hellena Crainicu

      I have several files that looks like this:

      <title>用正确方式打开 MyGainer增健肌粉! - MYPROTEIN™</title>
      blah bhla
      blah bhla
      <title>Home is me</title>
      blah bhla
      <title>Payton is your name</title>
      

      I want to find a regex that to delete all lines that contains <title>.*</title> except the first line:

      My regex is not very good:

      FIND: (<title>.*?</title>)(?=(?:<title>|$)) or
      (?s-i)\A.*\K<title>(.*?)(.*?</title>)
      Replace by: \1

      I made a Python code, very good, but I need the regex for this job:
      --------------------------

      import re
      
      def keep_first_title_tag(extracted_content):
          # Find all `<title>` tags
          title_tags = re.findall(r'<title>(.*?)</title>', extracted_content, re.DOTALL)
      
          # Keep only the first `<title>` tag
          extracted_content = title_tags[0]
      
          return extracted_content
      
      
      extracted_content = """
      <title>用正确方式打开 MyGainer增健肌粉! - MYPROTEIN™</title>
      blah bhla
      blah bhla
      <title>Home is me</title>
      blah bhla
      <title>Payton is your name</title>
      """
      
      extracted_content = keep_first_title_tag(extracted_content)
      print(extracted_content)
      
      CoisesC Terry RT mkupperM 3 Replies Last reply Reply Quote 0
      • CoisesC
        Coises @Hellena Crainicu
        last edited by

        @Hellena-Crainicu

        If your example is precise in that the title line you want to keep is the very first line of the file, then you can use the fact that all the other title lines will be preceded by line ending characters; so use:

        \R<title>.*?</title>

        and replace with an empty string.

        Hellena CrainicuH 1 Reply Last reply Reply Quote 4
        • Hellena CrainicuH
          Hellena Crainicu @Coises
          last edited by Hellena Crainicu

          @Coises super formula, was so easy. Thanks a lot !

          Also, I update with anothe Python code that makes the same thing:

              import regex
          
              def remove_last_title_tags(text):
                  # Find all instances of the `<title>` tag
                  title_tags = regex.findall(r"(?<=^|\n)<title>.*?</title>", text, flags=regex.DOTALL)
          
                  # Replace the last instance of each `<title>` tag with an empty string
                  for i in range(len(title_tags) - 1, -1, -1):
                      if i == 0:
                          continue
                      text = text.replace(title_tags[i], "")
          
                  return text
          
              extracted_content = remove_last_title_tags(extracted_content)
          
          1 Reply Last reply Reply Quote 0
          • Terry RT
            Terry R @Hellena Crainicu
            last edited by Terry R

            @Hellena-Crainicu

            I agree with @Coises. In fact I had exactly the same regex. As asked, if it is definitely at the very start of the file, that should work.

            Otherwise if the first <title> isn’t at the very start of the file regex won’t be able to do this in 1 pass. The other option would be to just find the first instance and tag it. then a second pass to remove all other instances and a third pass to remove the tag on the remaining <title>.

            Terry

            1 Reply Last reply Reply Quote 1
            • mkupperM
              mkupper @Hellena Crainicu
              last edited by

              @Hellena-Crainicu Another way to do this is:

              Search: (?s)(<title>.*?</title>.*?)<title>.*?</title>
              Replace: $1

              You would need to repeat this until it stops replacing.
              In summary:

              • (?s) puts the regexp engine in dot matches newline mode meaning scans for “.” also include end of line characters. Normally a scan for “.” stops at the end of the line.
              • (<title>.*?</title>.*?) grab the first title and everything after the first title using a non-greedy scan.
              • <title>.*?</title> is the second title.

              Thus we are saving the first title and everything after it up to the second title and discarding the second title. You will find that as you do the search/replace that it re-positions the cursor meaning the second search replace will save the third and discard the fourth title. Keep repeating. Eventually there will be just one title left and it will always be the first one.

              Hellena CrainicuH 1 Reply Last reply Reply Quote 0
              • Hellena CrainicuH
                Hellena Crainicu @mkupper
                last edited by

                @mkupper

                yes, but if I have text before the first <title> instance, it will delete exactly the <title> instances that are not needed.

                try your regex with this example. You will see that the first instance of <title> will be deleted. And I need exactly that one to keep.

                <p>但是减脂期可不一定就意味着每天只吃水煮鸡胸和水煮西蓝花,完全苦行僧一样的生活。如果在营养均衡的三餐之间适当的尝试一些健康的小零食,不仅能为减脂期提供动力和新鲜感,还可以为生活增添不少的趣味呢。</p>除此之外,从营养学角度,适当的加餐可以预防三餐之间出现低血糖的现象,还能防止因为饥饿而在下一餐中暴饮暴食,摄入过多热量的情况发生。</p>因此,健康的小零食不仅有助于完成减脂目标,还能让你的减肥期丰富多彩,何乐而不为呢?下面就来推荐给大家10种好吃又健康的小零食。</p>
                <title>用正确方式打开 MyGainer增健肌粉! - MYPROTEIN™</title>
                blah bhla
                blah bhla
                <title>Home is me</title>
                blah bhla
                <title>Payton is your name</title>
                
                mkupperM 1 Reply Last reply Reply Quote 0
                • guy038G
                  guy038
                  last edited by guy038

                  Hi, @hellena-crainicu, @coises, @terry-r, @mkupper and All,

                  @hellena-crainicu, we must take in account two cases :

                  A) The <title>.......</title> line is the very first one of your file(s) :

                  Then, I personally, found out two other solutions :

                  • SEARCH (?s)(?!\A)<title>.*?</title>\R

                  • REPLACE Leave EMPTY

                  AND

                  • SEARCH (?s)\A(<title>.*?</title>\R)(*SKIP)(*F)|(?1)

                  • REPLACE Leave EMPTY

                  However, the @coises’s formulation, with the leading modifier (?s)

                  • SEARCH (?s)\R<title>.*?</title>

                  • REPLACE Leave EMPTY

                  is really clever and definitively the best one, as the \R syntax is quicker to execute than the negative look-ahead (?!\A) anyway and could be of importance if numerous files are concerned !


                  B) The <title>.......</title> line may NOT be, necessarily, the very first one of your file(s) :

                  In this case, a solution, derived from my second formulation above, could be :

                  • SEARCH (?s)\A.*?(<title>.*?</title>\R)(*SKIP)(*F)|(?1)

                  • REPLACE Leave EMPTY

                  Best Regards,

                  guy038

                  1 Reply Last reply Reply Quote 2
                  • mkupperM
                    mkupper @Hellena Crainicu
                    last edited by

                    @Hellena-Crainicu I was puzzled by your comment. I suspect you were testing by having the expression you were testing with at the top of the file. In that case the first title was in the expression itself.

                    I modified my search expression slightly to replace < with \x3c so that we can have the search/replace expression within the file for testing. I put it at the bottom in these examples.

                    Here is the test I ran:

                    Original data

                    <p>但是减脂期可不一定就意味着每天只吃水煮鸡胸和水煮西蓝花,完全苦行僧一样的生活。如果在营养均衡的三餐之间适当的尝试一些健康的小零食,不仅能为减脂期提供动力和新鲜感,还可以为生活增添不少的趣味呢。</p>除此之外,从营养学角度,适当的加餐可以预防三餐之间出现低血糖的现象,还能防止因为饥饿而在下一餐中暴饮暴食,摄入过多热量的情况发生。</p>因此,健康的小零食不仅有助于完成减脂目标,还能让你的减肥期丰富多彩,何乐而不为呢?下面就来推荐给大家10种好吃又健康的小零食。</p>
                    <title>用正确方式打开 MyGainer增健肌粉! - MYPROTEIN™</title>
                    blah bhla
                    blah bhla
                    <title>Home is me</title>
                    blah bhla
                    <title>Payton is your name</title>
                    
                    
                    Search: (?s)(\x3ctitle>.*?\x3c/title>.*?)\x3ctitle>.*?\x3c/title>
                    Replace: $1
                    

                    ###First pass
                    This is after doing search-replace-all one time. It removed the second title that was on line 5.

                    <p>但是减脂期可不一定就意味着每天只吃水煮鸡胸和水煮西蓝花,完全苦行僧一样的生活。如果在营养均衡的三餐之间适当的尝试一些健康的小零食,不仅能为减脂期提供动力和新鲜感,还可以为生活增添不少的趣味呢。</p>除此之外,从营养学角度,适当的加餐可以预防三餐之间出现低血糖的现象,还能防止因为饥饿而在下一餐中暴饮暴食,摄入过多热量的情况发生。</p>因此,健康的小零食不仅有助于完成减脂目标,还能让你的减肥期丰富多彩,何乐而不为呢?下面就来推荐给大家10种好吃又健康的小零食。</p>
                    <title>用正确方式打开 MyGainer增健肌粉! - MYPROTEIN™</title>
                    blah bhla
                    blah bhla
                    
                    blah bhla
                    <title>Payton is your name</title>
                    
                    
                    Search: (?s)(\x3ctitle>.*?\x3c/title>.*?)\x3ctitle>.*?\x3c/title>
                    Replace: $1
                    

                    ###Second pass
                    This is after doing search-replace-all twice. The first pass removed second title that was on line 5 and the second pass removed the third title that was on line 7.

                    <p>但是减脂期可不一定就意味着每天只吃水煮鸡胸和水煮西蓝花,完全苦行僧一样的生活。如果在营养均衡的三餐之间适当的尝试一些健康的小零食,不仅能为减脂期提供动力和新鲜感,还可以为生活增添不少的趣味呢。</p>除此之外,从营养学角度,适当的加餐可以预防三餐之间出现低血糖的现象,还能防止因为饥饿而在下一餐中暴饮暴食,摄入过多热量的情况发生。</p>因此,健康的小零食不仅有助于完成减脂目标,还能让你的减肥期丰富多彩,何乐而不为呢?下面就来推荐给大家10种好吃又健康的小零食。</p>
                    <title>用正确方式打开 MyGainer增健肌粉! - MYPROTEIN™</title>
                    blah bhla
                    blah bhla
                    
                    blah bhla
                    
                    
                    
                    Search: (?s)(\x3ctitle>.*?\x3c/title>.*?)\x3ctitle>.*?\x3c/title>
                    Replace: $1
                    

                    If you watch the status line at the bottom of the search/replace box you will see:

                    After pass 1: Replace All: 1 occurrence was replaced in entire file
                    After pass 2: Replace All: 1 occurrence was replaced in entire file
                    After pass 3: Replace All: 0 occurrences were replaced in entire file
                    

                    While your examples had the titles on their own lines I had coded to allow them to be anywhere in a line and for them to span lines as that’s what HTML allows. If you want to only support titles on a line by itself then we can add some anchoring:
                    Search: (?s)^(\x3ctitle>.*?\x3c/title>\R.*?\R)\x3ctitle>.*?\x3c/title>$
                    Replace: $1

                    Even that is not perfect as it allows titles to span or more lines. If you insists on only matching titles on one line and not to span them then toggle the dot/EOL spanner flag:
                    Search: ^(\x3ctitle>(?-s).*?\x3c/title>\R(?s).*?\R)\x3ctitle>(?-s).*?\x3c/title>$
                    Replace: $1

                    As you can see, the expression is getting more complicated to deal with the edge cases and requirements.

                    1 Reply Last reply Reply Quote 0
                    • First post
                      Last post
                    The Community of users of the Notepad++ text editor.
                    Powered by NodeBB | Contributors