• Login
Community
  • Login

Regex: How to remove enewline character from a particular html tags?

Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
9 Posts 4 Posters 499 Views
Loading More Posts
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • R
    Robin Cruise
    last edited by Feb 27, 2022, 8:48 PM

    I have this html tag, which is interrupted by /n at some point after word masuri:

    <p class="mb-40px">Aceasta este o melodie alcatuita din patru masuri:
    reluata apoi de catre instrumentul solist cu un cintec popular.</p>
    

    THE OUTPUT must be:

    <p class="mb-40px">Aceasta este o melodie alcatuita din patru masuri: reluata apoi de catre instrumentul solist cu un cintec popular.</p>
    

    I try this regex, but doesn’t work too good, because is also change the entire html code, not just that particular tag.

    FIND: (?:<p class="mb-40px">|\G)(?:(?!</p>).)*?\K(\r\n|\r|\n)

    REPLACE BY: \x20

    Also, I find a solution of @neil-schipper from a page on this forum, but I don’t know how to integrate with my html tag :

    FIND: (?<=[^\r\n])\R(?=[^\r\n])
    REPLACE BY: (LEAVE EMPTY)

    A 1 Reply Last reply Feb 27, 2022, 8:53 PM Reply Quote 0
    • A
      Alan Kilborn @Robin Cruise
      last edited by Feb 27, 2022, 8:53 PM

      @robin-cruise

      This is just a (by now) simple replace-but-only-between-delimiters problem; see HERE for the templatized solution.

      R 1 Reply Last reply Feb 27, 2022, 10:10 PM Reply Quote 1
      • R
        Robin Cruise @Alan Kilborn
        last edited by Feb 27, 2022, 10:10 PM

        @alan-kilborn THANKS, it works !!

        Find: (?-i:<p class="mb-40px">|(?!\A)\G)(?s:(?!</p>).)*?\K(?-i:(?<=[^\r\n])\R(?=[^\r\n]))

        Replace by: \x20

        R 1 Reply Last reply Feb 28, 2022, 7:12 AM Reply Quote 2
        • R
          Robin Cruise @Robin Cruise
          last edited by Robin Cruise Feb 28, 2022, 7:12 AM Feb 28, 2022, 7:12 AM

          Another solution: (\r\n|\r|\n)

          FIND: (<p class="mb-40px">)+(.)+\K(\r\n|\r|\n)(?=.*<\/p>)

          REPLACE BY: \x20

          The below GENERIC regex formula can be much simple made then @guy038 made in many other of his GENERIC regex formulas:

          (REGION-START)+(.)+\K(FIND REGEX)(?=.*REGION-FINAL)

          A 1 Reply Last reply Feb 28, 2022, 12:35 PM Reply Quote 1
          • A
            Alan Kilborn @Robin Cruise
            last edited by Feb 28, 2022, 12:35 PM

            @robin-cruise said in Regex: How to remove enewline character from a particular html tags?:

            The below GENERIC regex formula can be much simple made then @guy038 made

            Why should you be believed over @guy038 ?

            A 1 Reply Last reply Mar 10, 2022, 1:25 PM Reply Quote 0
            • H
              Hellena Crainicu
              last edited by Mar 9, 2022, 7:10 AM

              @alan-kilborn @guy038

              another alternative of Robin’s generic, a better version, can be:

              (REGION-START)+(.)+\K(FIND REGEX)(?s:(?=.*(REGION-FINAL)))

              1 Reply Last reply Reply Quote 0
              • G
                guy038
                last edited by guy038 Mar 10, 2022, 1:56 PM Mar 10, 2022, 1:20 PM

                Hello, @rovbin-cruise, @alan-kilborn, @hellena-crainicu and All,

                Refering to my first blog post about a generic regex, below :

                https://community.notepad-plus-plus.org/post/75007

                and as Robin want to search for line-ending chars, we need to use, of course the complete generic regex S/R :

                SEARCH (?-si:BSR|(?!\A)\G)(?s-i:(?!ESR).)*?\K(?-si:FR)

                REPLACE RR

                and not the simplified single-line version


                So :

                • The FR regex is just \R, as the non-capturing associated group, beginning with (?_si:..., is useless in this case

                • The RR regex is \x20

                • The BSR regex may be strictly the string <p class="mb-40px"> but may also be expressed as <p class=".+?">

                • The ESR regex is, of course, the ending tag </p>, which must never occurs before the next line-ending to replace

                giving the functional regex S/R :

                SEARCH (?-si:<p class=".+?">|(?!\A)\G)(?s-i:(?!</p>).)*?\K\R

                REPLACE \x20

                Test it against that text :

                <a href="https://www.w3schools.com/">We strongly suggest
                to visit the
                w3schools.com
                site</a>
                
                <p class="mb-40px">Aceasta
                este o melodie alcatuita
                din patru masuri:
                reluata apoi de catre instrumentul solist
                cu un cintec popular.</p>
                
                <p class="Test">A SINGLE line</p>
                
                <h1>this is
                my very
                first heading
                </h1>
                
                <p class="123-456 789">This is	
                a quick
                text to
                verify if it
                replaces line-endings
                by a space char in <p>
                tags ONLY</p>
                

                ONLY the <p class.............<p>, multi-lines or not, should be concerned by the replacement !

                Of course, these HTML commands do not represent a legal HTML file and are just used to verify the regex S/R !


                Now, the generic variants, proposed by @Robin-cruise and @hellena-crainicu, with a final look-ahead only, containing the ESR region, will not work, most of a time :-(

                SEARCH (?-si:BSR|(?!\A)\G).*?\K(?-si:FR)(?=(?s-i:.*?ESR))

                In our case, the functional regex S/R becomes :

                SEARCH (?-si:<p class=".+?">|(?!\A)\G).*?\K\R(?=(?s-i:.*?</p>))

                REPLACE \x20

                But if you test it against, for instance :

                
                <p class="Test">Several
                consecutive
                lines</p>
                
                <h1>this is
                my very
                first heading
                </h1>
                
                <p class="Test">A SINGLE line</p>
                
                <h2>this is
                my second
                heading
                </h2>
                

                It would concatenate all text till the last </p> of the file, just leaving the last <h2> tag untouched. You could say : But I did add a final question mark in order to get a lazy range of chars before </p> !

                You’re right ! But remember that the regex engine tries, by all means, to get a solution. So, it matches the CRLF chars, which follow lines</p>, because the regex engine considers that the .*? lazy range of chars begins immediately after the line-ending and continues till right before the third and final </p>, so defining a correct look-ahead assertion !

                Thus, testing if the ESR region is not reached at any position, till a NEXT FR match, seems the only method which works properly !

                Best Regards

                guy038

                Reminder : Move to the very beginning of text before clicking on the Find Next or Replace All button !

                1 Reply Last reply Reply Quote 2
                • A
                  Alan Kilborn @Alan Kilborn
                  last edited by Mar 10, 2022, 1:25 PM

                  @alan-kilborn said in Regex: How to remove enewline character from a particular html tags?:

                  Why should you be believed over @guy038 ?

                  @guy038 said in Regex: How to remove enewline character from a particular html tags?:

                  Now, the generic variants, proposed by @Robin-cruise and @hellena-crainicu, with a final look-ahead only, containing the ESR region will not work, most of a time :-(


                  @Robin-cruise and @hellena-crainicu :

                  Be careful of posting simplifications.

                  Probably best to leave these things to the “Master”. :-)

                  1 Reply Last reply Reply Quote 2
                  • H
                    Hellena Crainicu
                    last edited by Hellena Crainicu Aug 12, 2024, 9:46 AM Aug 12, 2024, 9:43 AM

                    The best solution is this:

                    (?-si:<p class=".+?">|(?!\A)\G)(?s-i:(?!</p>).)*?\K\s+

                    General regex: (?-si:BSR|(?!\A)\G)(?s-i:(?!ESR).)*?\KFR

                    1 Reply Last reply Reply Quote 0
                    • First post
                      Last post
                    The Community of users of the Notepad++ text editor.
                    Powered by NodeBB | Contributors