Community
    • Login

    Regex: Remove particular words from tags in several text pages

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    12 Posts 4 Posters 6.1k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Vasile CarausV
      Vasile Caraus
      last edited by Vasile Caraus

      hi, I need a little help, I want to select and delete all   from many files that have this particular tag, and to replace it with space:

      <p class="my_class">An Extension&nbsp;of Java for Event Correlation. 571 geographical/logical coordinates, or sources. Henceforth,&nbsp;we will use the term&nbsp;events to refer to&nbsp;both the incidents underlying such&nbsp;events as well as to their incarnations&nbsp;and notifications. </p>

      I made a regex for this, but is not quite good:

      Search: (?:([<p class="my_class">&</p>]*)&nbsp;)?
      Replace by: \1 \2

      Can anyone help me a little bit?

      1 Reply Last reply Reply Quote 0
      • guy038G
        guy038
        last edited by guy038

        Hi, @vasile-Caraus and All,

        UPDATED on 01/12/2018 : This post is, from now on, obsolete and you should look the newer posts, below, which offer a better solution :-))

        First, Vasile, I supposed that your particular tag, <p class="my_class">........</p>, lies on a single line, only !

        Then, as all the &nbsp; strings have to be replaced by a regular space character, till the end of the line, if this current line contains the particular tag my_class, I slipt the problem into 3 smaller ones :

        • First, add a dummy character, right after the ending </p> of this particular tag

        • Then, change any block &nbsp; by a \x20 character, ONLY IF exists, further on, in the current line, this dummy character

        • Finally, remove this dummy character

        Remarks :

        • As usual, you may choose any symbol to represent this dummy character. Only one condition : it must not already exist in your file !

        • I chose the # character, but, of course, the &, @, ~,… characters would have been OK, too !


        Each part needs a regex global S/R, with the options Wrap around and Regular expresion and a click on the Replace All button

        SEARCH (?-si)<p class="my_class".+?</p>

        REPLACE $0#

        SEARCH &nbsp;(?=.+#)

        REPLACE \x20

        SEARCH #

        REPLACE EMPTY


        But the nice thing, Vasile, is that we can concatenate these three regexes in an unique one :

        SEARCH (?-si)<p class="(my_class)".+?</p>(?!#)|(&nbsp;)(?=.+#)|#

        REPLACE (?1$0#)(?2\x20)

        IMPORTANT : you’ll have to click TWICE, on the Replace All button


        How, this regex S/R works ?

        • At, beginning, the modifiers (?-si) ensures you that :

          • The search is performed in a sensitive way ( -i = NON-insentive )

          • The dot . will match standard characters, ONLY ( -s = NOsingle line )

        • Then, the search is an alternative between 3 possibilities :

          • Your particular tag <p class="(my_class)".+?</p>(?!#), ONLY IF not followed by a # symbol ! Note that, in this case, group 1, my_class, exits

          • The regex (&nbsp;)(?=.+#), which tries to match any &nbsp; string, IF exists, after, on the same line, a # character. Note that, in this case, group 2, &nbsp;, exists

          • The literal # symbol

        • It is interesting to notice that, while clicking a first time, on the Replace All button, ONLY the first alternative is matched, because no # exist, at this time !

        • After a second click on the Replace All button, due to the (?!#) syntax, the first alternative never occurs, and the second alternative may occur one or several times, on the current line, if the &nbsp; string is found

        • Finally, the third alternative is always found, after the closing tag </p> !

        Now :

        • During the first global replacement, as group 1 exists, the part (?1$0#) is executed : so the entire particular tag is rewritten, followed by the # symbol

        • During the second global replacement :

          • When group 2 exists, the part (?2\x20) is executed : so any &nbsp; string is replaced by a space character

          • When the # symbol is matched, after the ending tag, as no group exists, it is not replaced, at all and therefore, deleted !


        Remarks :

        • Once all the replacements done, you may re-execute this S/R again. It doesn’t matter ! Indeed :

          • A first click on Replace All adds the # symbol, right after the range <p class="my_class">........</p>

          • A second click on Replace All removes the # symbol, right after the range <p class="my_class">........</p>

        • Of course, this regex works, also, if your particular tag <p class="my_class">........</p>, is “glued” within other text, in the same line !

        Cheers,

        guy038

        1 Reply Last reply Reply Quote 0
        • Vasile CarausV
          Vasile Caraus
          last edited by Vasile Caraus

          super, thanks a lot, guy038 !!!

          1 Reply Last reply Reply Quote 0
          • Vasile CarausV
            Vasile Caraus
            last edited by

            I just find another beautiful answer:

            Search: (?:\G(?!^)|<p\s+class="my_class">)(?:(?!</p>).)*?\K&nbsp;
            Replace: SPACE

            1 Reply Last reply Reply Quote 0
            • guy038G
              guy038
              last edited by guy038

              Hi, @vasile-caraus and All,

              You found a very very clever regex, indeed ! Bravo :-))

              This nice regex make good use of the special \G assertion. This assertion represents the zero-length location of either :

              • The very beginning of the file

              • The end of the PREVIOUS search

              • The cursor, deliberately moved by the user


              To give you a rapid example of how this assertion works, open a new tab and type in these three identical lines, below :

              1234567890
              1234567890
              1234567890
              
              • Now, move the caret at the very beginning ( Ctrl + Home )

              • Open the Find dialog ( Ctrl + F )

              • Type in the regex (?-s)\G...

              • Then click, repeatedly, on the Find next button

              You should match, the srings 123, 456, 789 then it should stop, with cursor between digits 9 and 0, in the first line. Why ?

              • Well, as the string 123 begins the file, it was normally matched. Then, as the string 456 immediately follows the string 123, it is also matched by the regex and the same for the 789 string.

              • Now, in order to get an other range of 3 standard characters, the cursor should jump to the beginning of the second line ! But this new location does not meet any of the three possible \G locations, given above !

              • Suppose you move the cursor, on purpose, between the digits 4 and 5 of the second line : again, the next clicks on the Find Next button would match the strings 567, because of the new cursor location and the 890 string because it follows the previous search. But, again, the search would stop, at the end of the second line !

              Even if you can’t see the immediate advantage of the \G assertion, there are examples where next search location MUST immediately follow the previous range of characters matched ! I will give you an interesting example, at the end of that post, which involves DNA genetic sequences !


              Vasile, After studying ( hard ! ) your regex, I think that some parts are not necessary, while keeping the power the of \G assertion :-))

              From your regex (?:\G(?!^)|<p\s+class="my_class">)(?:(?!</p>).)*?\K&nbsp;, I think that the two non-capturing groups are rather useless, because they do not store a great amount of text :

              • (?:\G(?!^)|<p\s+class="my_class">) matches a zero-length location OR the string <p class=“my_class”>

              • (?:(?!</p>).) matches a single standard character

              But I would add the (?-s), at beginning, giving the shortened regex

              (?-s)(\G(?!^)|<p\s+class="my_class">)((?!</p>).)*?\K&nbsp;

              Now, the negative look-ahead (?!^) is not necessary, too. Indeed, you just have to place the cursor on a blank line, at the beginning of the file, and, necessarily, the regex would have to match the string <p\s+class="my_class">, first, anyway !!

              So, a final form of your regex could be, for instance :

              (?-s)(\G|<p\s+class="my_class">)((?!</p>).)*?\K&nbsp;


              To correctly understand how the Vasile’s regex works, get rid of the \K part, near the end of the regex :

              (?-s)(\G|<p\s+class="my_class">)((?!</p>).)*?&nbsp;

              and use the regex against the text, of two identical lines, below :

              A part of&nbsp;text BEFORE the&nbsp;main part<p class="my_class">An Extension&nbsp;of Java for Event Correlation. 571 geographical/logical coordinates, or sources. Henceforth,&nbsp;we will use the term&nbsp;events to refer to&nbsp;both the incidents underlying such&nbsp;events as well as to their incarnations&nbsp;and notifications. </p>A part of&nbsp;text AFTER the&nbsp;main part
              
              A part of&nbsp;text BEFORE the&nbsp;main part<p class="my_class">An Extension&nbsp;of Java for Event Correlation. 571 geographical/logical coordinates, or sources. Henceforth,&nbsp;we will use the term&nbsp;events to refer to&nbsp;both the incidents underlying such&nbsp;events as well as to their incarnations&nbsp;and notifications. </p>A part of&nbsp;text AFTER the&nbsp;main part
              

              It’s important to note that this regex correctly avoids to match the &nbsp; strings, which are located outside the range <p class="my_class">.........</p> :-))

              And, after matching the last authorized &nbsp; string of a line, the next match must, necessarily, be the range <p\s+class="my_class"> of the next line :-))

              Finally, adding the \K syntax implies that it searches for any authorized &nbsp; string, only and replaces it by a normal space character


              To end with, here is an example which clearly shows the advantage of the \G assertion :

              • In a new tab, add this two lines, below :
              TGAATTCTTGAATTCAAATGAAGGTTCTGACGTCATGATGAC
              °°°     '''       °°°      °°°     ''''''
              
              • In that DNA genetic sequence code, the TGA bases range :

                • is a CODON, at °°° locations ( positions 1, 19 et 28 , of the form 3k + 1 )

                • is NOT a CODON, at ''' locations ( positions 9, 36 et 39 )


              Then, starting from beginning of file, or the current cursor position :

              • The regex .*TGA matches any longest range of bases, ending with the 3 bases TGA

              • The regex (\w\w\w)*TGA matches any longest range of bases, multiples of 3 bases, ending with the 3 bases TGA

              • The regex \G(\w\w\w)*TGA matches any longest range of codons, ending with the TGA codon

              and :

              • The regex .*?TGA matches any shortest range of bases, ending with the 3 bases TGA

              • The regex (\w\w\w)*?TGA matches any shortest range of bases, multiples of 3 bases, ending with the 3 bases TGA

              • The regex \G(\w\w\w)*?TGA matches any shortest range of codons, ending with the TGA codon


              Remark : You may, also, give a try to the regexes \G(\w\w\w)*TGA|.{0} and \G(\w\w\w)*?TGA|.{0}, to see the differences !

              Best Regards,

              guy038

              1 Reply Last reply Reply Quote 2
              • Jim DaileyJ
                Jim Dailey
                last edited by

                @guy038

                My mind just blew. I’m pretty sure you have a mutation in your genetic sequence! :-)

                I am guessing even a stable genius would have to work at understanding this.

                1 Reply Last reply Reply Quote 1
                • guy038G
                  guy038
                  last edited by

                  Hi @Jim-dailey,

                  To be honest, this DNA example is not of my own, of course ! You’ll find a similar version, in the Global matching paragraph, from the link :

                  https://perldoc.perl.org/perlretut.html#Using-regular-expressions-in-Perl

                  And, probably, the original example comes from the Mastering Regular Expressions book, by Jeffrey E.F. Friedl, though not sure !

                  Cheers,

                  guy038

                  1 Reply Last reply Reply Quote 0
                  • Vasile CarausV
                    Vasile Caraus
                    last edited by

                    guy038 , thanks a lot. You are the guru in the regex formulas !!

                    1 Reply Last reply Reply Quote 0
                    • Scott SumnerS
                      Scott Sumner
                      last edited by

                      Since this thread has sort of morphed into a talk about \G…not a bad thing…

                      In a limited way, the \G syntax can help with the task of matching the first 5 lines of a file, as discussed in this thread. [Maybe I should have put this posting there…but, oh, well…]

                      Because of the following restrictions (which are probably TOO restrictive), the discussion may be more of a theoretical one on how \G could assist, rather than a practical solution to the problem:

                      • The file must have 5+ lines…okay, maybe not really a restriction given the “spec”
                      • The 1st and 5th lines of the file cannot be empty; if 5th line IS empty only 4 lines will be matched, violating “spec”
                      • The line-ending on the 5th line is NOT part of the match; maybe a “spec” violation :-)

                      Given that, try the following regex in a “Find All In…” search:

                      \G(?!\R)(?-s)(.*\R){4}.*
                      

                      Try it with and without the \G to see how the \G makes it succeed.

                      1 Reply Last reply Reply Quote 0
                      • guy038G
                        guy038
                        last edited by

                        Hi, @vasile-caraus and All

                        Contrary, to what I said, in my first reply to Vasile, my regex, below :

                        (?-si)<p class="(my_class)".+?</p>(?!#)|(&nbsp;)(?=.+#)|#

                        fails when the range <p class="my_class">............</p> does not begin the current line !

                        In addition, the regex also fails when two, or more, ranges <p class="my_class">.......</p> are located in the same line :-((

                        As I verified that the Vasile’s version works when consecutive ranges <p class="my_class">.......</p> exist, on a same line, I updated my first post, on this topic !


                        So, the regex S/R, below, with the \G construction, changes any &nbsp; string into a space character, when it is found inside any <p class="my_class">........</p> range, EXCLUSIVELY

                        SEARCH (?-s)(\G|<p class="my_class">)((?!</p>).)*?\K&nbsp;

                        REPLACE \x20

                        This regex is, definitively, very powerful ! For instance, with the initial text :

                        12345&nbsp;12345&nbsp;12345<p class="my_class">ABCDEFGH&nbsp;ABCEDGH&nbsp;ABCDEFGH</p>67890&nbsp;67890&nbsp;67890<p class="my_class">IJKLMNOP&nbsp;IJKLMNOP&nbsp;IJKLMNOP</p>02468&nbsp;02468&nbsp;02468<p class="my_class">QRSTUVW&nbsp;QRSTUVW&nbsp;QRSTUVW</p>13579&nbsp;13579&nbsp;13579
                        
                        12345&nbsp;12345&nbsp;12345<p class="my_class">ABCDEFGH&nbsp;ABCEDGH&nbsp;ABCDEFGH</p>67890&nbsp;67890&nbsp;67890<p class="my_class">IJKLMNOP&nbsp;IJKLMNOP&nbsp;IJKLMNOP</p>02468&nbsp;02468&nbsp;02468<p class="my_class">QRSTUVW&nbsp;QRSTUVW&nbsp;QRSTUVW</p>13579&nbsp;13579&nbsp;13579
                        

                        And, after a single click on the Replace All, we get , at once :

                        12345&nbsp;12345&nbsp;12345<p class="my_class">ABCDEFGH ABCEDGH ABCDEFGH</p>67890&nbsp;67890&nbsp;67890<p class="my_class">IJKLMNOP IJKLMNOP IJKLMNOP</p>02468&nbsp;02468&nbsp;02468<p class="my_class">QRSTUVW QRSTUVW QRSTUVW</p>13579&nbsp;13579&nbsp;13579
                        
                        12345&nbsp;12345&nbsp;12345<p class="my_class">ABCDEFGH ABCEDGH ABCDEFGH</p>67890&nbsp;67890&nbsp;67890<p class="my_class">IJKLMNOP IJKLMNOP IJKLMNOP</p>02468&nbsp;02468&nbsp;02468<p class="my_class">QRSTUVW QRSTUVW QRSTUVW</p>13579&nbsp;13579&nbsp;13579
                        

                        It’s easy to see that, ONLY the &nbsp; strings, inside all <p class="my_class">............</p> ranges, have been changed with a space :-))

                        Cheers,

                        guy038

                        P.S. :

                        Scott, I’m about to read your post… :-)

                        1 Reply Last reply Reply Quote 0
                        • guy038G
                          guy038
                          last edited by guy038

                          Hi, @scott-sumner and All,

                          Ah yes, clever use of the \G concept ! It just match ONCE, on each file, because of the existence of, both, the \G assertion and the negative look-ahead (?!\R), which are exclusive !

                          Indeed, once the regex engine have matched the entire fourth lines and the contents of the fifth line, without its line-ending the assertion \G forces the regex engine to match a new range of characters, which begins right after the end of the previous match. But no other match occurs, because at this location, it must not exist End of Line chars (?!\R) !

                          Just note that the (?!\R) syntax is not necessary, when we use, for instance the simple regex \G.+, which match any non-empty first line, of all the scanned files. Indeed, once the regex engine have matched all the contents of the first line, the cursor location is between the last standard character of this first line and its End of Lines character(s). But, in order to get the next standard character, it would be necessary to cross the line-ending, which would break the \G assertion, which supposes that the next match begins, EXACTLY, at the location where the previous match ended !

                          Therefore, if you’re sure that all your files, to scan, have no blank line, in the first five ones, you may use the regex :

                          (?-s)\G(.+\R){4}.+


                          Now, the five regexes, that I gave, in the post, below :

                          https://notepad-plus-plus.org/community/topic/15036/regex-copy-first-5-lines-from-all-open-documents/4

                          can be changed, according, to the Scott’s syntax, into :

                          • (?-s)\G(?!\R)(.*\R){0}\K.* , which finds any non-empty first line, of all scanned files

                          • (?-s)\G(?!\R)(.*\R){1}\K.* , which finds any second line, of all scanned files

                          • (?-s)\G(?!\R)(.*\R){2}\K.* , which finds any third line, of all scanned files

                          • (?-s)\G(?!\R)(.*\R){3}\K.* , which finds any fourth line, of all scanned files

                          • (?-s)\G(?!\R)(.*\R){4}\K.* , which finds any fifth line, of all scanned files


                          And, after using the Find All in All Opened Documents, we would get five results in the Find result panel.

                          Then, perform the following actions :

                          • Paste all the results in a new tab ( Ctrl + A , Ctrl + C , Ctrl + N , Ctrl + V )

                          • SEARCH x20\(1 hit\)\R\t and REPLACE 70 SPACE characters

                          • Edit > Line Operations > Sort lines Lexicographically Ascending

                          • SEARCH ^.{60}\K\x20+ and REPLACE Leave Empty

                          • SEARCH Line 5:.+\R\K and REPLACE \r\n

                          You should get something similar to :

                            .....
                          
                            C:\_751\change.log                                        Line 1: Notepad++ 7.5.1 new features/enhancements & bug-fixes:
                            C:\_751\change.log                                        Line 2: 
                            C:\_751\change.log                                        Line 3: 1.  Fix some excluded language cannot be remembered bug.
                            C:\_751\change.log                                        Line 4: 2.  Fix a localization regression bug.
                            C:\_751\change.log                                        Line 5: 3.  Fix the bug that Notepad++ create "%APPDATA%\local\notepad++" folder in local conf mode.
                          
                            C:\_751\license.txt                                       Line 1: /***************************************************************************
                            C:\_751\license.txt                                       Line 2:  * COPYING -- Describes the terms under which Notepad++ is distributed.    *
                            C:\_751\license.txt                                       Line 3:  * A copy of the GNU GPL is appended to this file.                         *
                            C:\_751\license.txt                                       Line 4:  *                                                                         *
                            C:\_751\license.txt                                       Line 5:  ****************** IMPORTANT NOTEPAD++ LICENSE TERMS **********************
                          
                            C:\_751\readme.txt                                        Line 1: What is Notepad++?
                            C:\_751\readme.txt                                        Line 2: ******************
                            C:\_751\readme.txt                                        Line 3: 
                            C:\_751\readme.txt                                        Line 4: Notepad++ is a free (as in "free speech" and also as in "free beer") source code editor and Notepad replacement that supports several programming languages and natural languages. Running in the MS Windows environment, its use is governed by GPL License.
                            C:\_751\readme.txt                                        Line 5:
                          
                            .....  
                          

                          Cheers,

                          guy038

                          1 Reply Last reply Reply Quote 1
                          • guy038G
                            guy038
                            last edited by guy038

                            Hello, @vasile-Caraus and All,

                            The general syntax of your regex is :

                            SEARCH (?-s)(\G|BR)((?!ER).)*?\KSR

                            REPLACE RR

                            where :

                            • BR ( Begining Regex ) is the regex which defines the start of the defined zone, for search/replacement

                            • ER ( Ending Regex ) is the regex which defines the end of the defined zone, for search/replacement

                            • SR ( Search Regex ) is the regex which defines the regex to search, in any defined zone

                            • RR ( Replace Regex ) is the regex which defines the regex replacing the search regex, in any defined zone

                            For instance, before the S/R, if we have :

                            ..SR.....SR...BR......SR............SR.....ER...SR......SR...BR.......SR........ER....BR..SR......SR.....SR...ER....SR......SR..
                            

                            it will give the results :

                            ..SR.....SR...BR......RR............RR.....ER...SR......SR...BR.......RR........ER....BR..RR......RR.....RR...ER....SR......SR..
                            

                            Of course, in the Vasile’s example, we have :

                            BR = <p class="my_class">

                            ER = </p>

                            SR = &nbsp;

                            RR = \x20

                            But, let’s suppose that we take :

                            BR = [01234]+

                            ER = [56789]+

                            SR = (?i)[a-z]+

                            RR = |$0|

                            So, assuming the text :

                            ..SR.......SR....BR....SR....SR.......ER...SR........SR.....BR....SR.....ER..BR....SR......SR.....SR......ER....SR......SR.......
                            
                            ..Here.....is....111...a.....Simple...6....Example...of.....44....some...9999000...TEXT....to.....Search..7.....and.....Modify...
                            

                            and the transformed regex S/R :

                            SEARCH (?-s)(\G|[01234]+)((?![56789]+).)*?\K(?i)[a-z]+

                            REPLACE |$0|

                            it would produce, as expected, the text :

                            ..SR.......SR....BR....RR......RR.........ER...SR........SR.....BR....RR.......ER..BR....RR........RR.......RR........ER....SR......SR.......
                            
                            ..Here.....is....111...|a|.....|Simple|...6....Example...of.....44....|some|...9999000...|TEXT|....|to|.....|Search|..7.....and.....Modify...
                            

                            Cheers,

                            guy038

                            1 Reply Last reply Reply Quote 1
                            • First post
                              Last post
                            The Community of users of the Notepad++ text editor.
                            Powered by NodeBB | Contributors