Community
    • Login

    Remove everything outside of string including string

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    13 Posts 4 Posters 5.1k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • colonialboyC
      colonialboy
      last edited by

      Hi All,

      I would like to remove:

      • everything outside a string (eg. <Word>);
      • the string itself;

      And keep only the phrase that was between the string.

      See below:

      Source data:
      Lorem ipsum dolor sit amet, <Word>consectetur</Word> adipiscing elit. In non porta nulla. Praesent auctor tellus sit amet libero auctor interdum. Morbi pulvinar, lorem vel volutpat scelerisque, orci magna rhoncus est, tempus <Word>sollicitudin</Word> metus ligula vitae eros. Phasellus ultricies blandit <Word>felis nec</Word> malesuada. Nulla quis neque efficitur, suscipit lacus vitae, ornare massa. <Word>Proin</Word> at blandit enim, nec vulputate leo. Aliquam sed nisl in <Word>nibh placerat</Word> fringilla.

      Resultant:
      consectetur
      sollicitudin
      felis nec
      Proin
      nibh placerat

      Scott SumnerS 1 Reply Last reply Reply Quote 0
      • Scott SumnerS
        Scott Sumner @colonialboy
        last edited by

        @colonialboy

        Hmmm. Maybe unnecessarily complicated, but this seems to work to meet the spec:

        Find-what zone: (?s)(?:.*?<Word>(.+?)</Word>)|(?!.+?<Word>).+
        Replace-with zone: (?1\1\r\n)
        Search mode: Regular expression

        What it does is match a string of any characters followed by <Word> followed by a string of characters followed by </Word>. When that matches, it is replaced by what was remembered as group 1 occurring between the delimiters.

        The trouble comes from the extra text at the end of the string (after the last set of delimiters)–so we turn the whole thing into an alternation (using |) and assert that no more opening delimiters occur and we gobble up the remaining text this way. Replacement is only done if group 1 exists, meaning that the match occurred on the left side of the alternation (i.e. valid delimited word rather than just a remaining text gobble).

        1 Reply Last reply Reply Quote 2
        • guy038G
          guy038
          last edited by guy038

          Hello @colonialboy, @scott-sumner, and All,

          I think that the simple regex S/R, below, should do the job, nicely :-))

          SEARCH (?s).+?<Word>(.+?)</Word>|.+

          REPLACE \1\r\n

          OPTIONS Wrap around and Regular expression set, only

          Of course, all the text, at the end of the file, after the last </Word> boundary is replaced by a single line-break.

          In theory, I should have used, in replacement, the Scott’s syntax : ?1\1\r\n Note that the enclosing parentheses are useless, as all the replace regex depends on the existence of group 1, only. Just my laziness moment !

          Also, colonialboy, if you deal with Unix files, just change, in replacement, the \r\n syntax by \n

          So from your initial text, that I split on several lines :

          Lorem ipsum
          
           dolor sit amet, <Word>consectetur</Word> adipiscing
           elit. In non porta nulla. Praesent auctor tellus sit amet libero 
           
           auctor interdum. Morbi pulvinar, lorem vel volutpat scelerisque, orci magna rhoncus est, tempus <Word>sollicitudin</Word>
          
           metus ligula vitae eros. Phasellus
          
          
           ultricies blandit <Word>felis nec</Word> malesuada. Nulla quis neque efficitur, suscipit lacus vitae,
           ornare massa. <Word>Proin</Word> at blandit enim, nec vulputate leo. 
           Aliquam sed nisl in <Word>nibh placerat</Word>
          

          This S/R would get, after a click on the Replace All button, the text :

          consectetur
          sollicitudin
          felis nec
          Proin
          nibh placerat
          
          

          Best Regards,

          guy038

          Scott SumnerS 1 Reply Last reply Reply Quote 2
          • Scott SumnerS
            Scott Sumner @guy038
            last edited by

            @guy038

            Nice regex simplification, although I would leave in the ?1 in the replace part so that the extra line-ending isn’t added. :-D
            And good point about the *nix line endings–sure would be nice if there was a syntax for simply a “line-ending” and it would do the right thing, but I understand and appreciate why you can use \R in the Find what zone but not in the Replace with zone.

            @colonialboy

            Rather than trying to select (for deletion) text that isn’t what you want to have, I like the technique discussed here which provides a method to copy out the text you are interested in (although it takes an additional plugin+script to do it).

            1 Reply Last reply Reply Quote 1
            • colonialboyC
              colonialboy
              last edited by

              Thanks to @Scott Sumner, and @guy038 for your great answers. Just to push this a little further: my ultimate aim is to put all this info into a spreadsheet. My data looks something like this:

              Source’Text

              And I am looking for locations in book titles - in this case each line is a book titie and each location is contained within <Word></Word>.
              I would like to preserve the line breaks even when some lines don’t contain <Word>. Also if a line contains two or more instances of <Word></Word> , then I would like to place a tab between them for translating it to another column in a spreadsheet. Something like the following:

              Resultant

              Many thanks again,

              colonialboy

              *apologies, i’m not sure how to show images inline.

              Scott SumnerS 1 Reply Last reply Reply Quote 1
              • colonialboyC
                colonialboy
                last edited by

                Or at the very least to preserve the line breaks would be fine

                1 Reply Last reply Reply Quote 0
                • Scott SumnerS
                  Scott Sumner @colonialboy
                  last edited by Scott Sumner

                  @colonialboy

                  Here’s how I would attack your revised need (aside from advising you to experiment on your own with the solutions originally provided…so you learn)–it is just a simplification of the original solution:

                  Find-what zone: (?:.*?<Word>(.+?)</Word>)|.+
                  Replace-with zone: ?1\1\t
                  Post-replace action: Edit (menu) -> Blank Operations -> Trim Trailing Space

                  The only real difference between this and the earlier is the removal of the leading (?s) from the FW part (to keep line-breaks intact), and the substitution of \t for \r\n in the RW part (to put tab characters in between “Words” occurring on the same line). Putting the tab character in results in a trailing tab character on each line with one or more "Word"s, which the “Trim Trailing Space” action removes.

                  BTW, inline images may be done as follows:

                  ![](https://i.imgur.com/gxsG8RS.png)

                  will embed as:

                  1 Reply Last reply Reply Quote 1
                  • colonialboyC
                    colonialboy
                    last edited by

                    thank you again @Scott Sumner. I will for sure learn from this and also I hope it provides others too with some help. This is particularly helpful for those that work with “Named entity recognition”!

                    For convenience here are the two images previously link. @Scott Sumner was able to provide the conversion from this:
                    Source Text

                    to this:

                    Resultant

                    1 Reply Last reply Reply Quote 1
                    • guy038G
                      guy038
                      last edited by guy038

                      Hello @colonialboy, @scott-sumner, and All

                      UPDATE : Please, do not take in account the S/R explained in this post and see my next post :-)

                      Ah, OK ! Now, I clearly see the goal :-))

                      So you could use the following S/R regex :

                      SEARCH (?-is).*?<Word>(.*?)</Word>(?=.*<Word>)|.*?<Word>(.*?)</Word>.*\R?|.*\R?

                      REPLACE \1\2?1\t:\r\n

                      Then, after a click on the Replace All button, your source text, below :

                      Lorem <Word>ipsum</Word> dolor sit amet, consectetuer <Word>adipiscing</Word> elit.
                      Aenean<Word>comodo</Word> ligura eget dolor.
                      Aenean massa.
                      Cum sociis <Word>natoque</Word>penatibus et magnis<Word>dis</Word>parturient<Word>montes</Word> nascetur ridiculus mus.
                      Dinec quam felis <Word>ultricies</Word> nec, pellentesque eu, pretium quis, sem.
                      Nulla consequat massa quis enim.
                      Donec pede justo, fringilla vel, aliquet nec, vulputate eget, arcu.
                      In enim justo, <Word>rhoncus</Word> ut, imperdiet a, venetatis vitae, justo.
                      Nulam dictum felis eu pede mollis pretium. Integer tincidunt.
                      Cras dapibus.
                      Vivamus elementum <Word>semper</Word> <Word>nisi</Word>
                      Aenean vulputate eleifend tellus.
                      Aenean leo ligula, porttitor eu, consequat vitae,<Word>eleifend</Word> ac, enim.
                      Aliquam lorem ante, <Word>dapibus</Word> in, viverra quis, feugiat a, tellus.
                      

                      will be changed, at once, into :

                      ipsum	adipiscing
                      comodo
                      
                      natoque	dis	montes
                      ultricies
                      
                      
                      rhoncus
                      
                      
                      semper	nisi
                      
                      eleifend
                      dapibus
                      

                      Nice, isn’t it ?


                      Notes :

                      • To begin, the (?-is) modifiers ensure that :

                        • The dot ( . ) meta-character will match standard characters, only

                        • The search will be processed, in a non-insensitive way. So, the string <WORD>, for instance, would not be considered as a tag !

                      • Then the search regex is made of three alternatives :

                        • The first part .*?<Word>(.*?)</Word>(?=.*<Word>) looks for anything, even nothing, followed by a <Word>(...)</Word> block ( with its contents, possibly null, stored in group 1 ) if, further on, an other tag <Word> can be found, on the same line

                        • The second part .*?<Word>(.*?)</Word>.*\R? looks for anything, even nothing, followed by a <Word>(...)</Word> block ( with its contents,possibly null, stored in group 2 ) without any other tag <Word>, further on, in the same line

                        • The last part .*\R? looks for any kind of line, even empty, as this line does not match the first nor the second alternative

                      • In replacement :

                        • First, the groups 1 and 2, standing for the range of chars between the two tags <Word> and </Word>, and which are mutually exclusive, are just rewritten ( These two groups cannot be non null, at the same time ! )

                        • Then, if group 1 exits, we write a tabulation character. In all other cases , we do not rewrite any text but simply write a line break ( \r\n ) ( or \n for Unix files ).

                        • Note that the full syntax, of this conditional replacement, is (?1\t:\r\n), with the colon standing for the limit between the case THEN ( when group1 exists ) and the case ELSE !


                      Remarks :

                      • Any line, even empty, which does not contain any block <Word>(...)</Word> block, is simply replaced with a line break

                      • Any empty block <Word></Word> is replaced with :

                        • A tabulation character, if other blocks <Word>...</Word> exist, on the same line

                        • A line break, if no other block <Word>...</Word> exist, on the same line

                      Best Regards,

                      guy038

                      P.S. :

                      Why, can you simply use the symbols < and > to delimit the ranges of characters to keep ? Of course, this implies that these characters are not part of your initial text !

                      Thus, the search regex would become (?-s).*?<(.*?)>(?=.*<)|.*?<(.*?)>.*\R?|.*\R?. This should work, although not tested yet !

                      1 Reply Last reply Reply Quote 0
                      • Alan KilbornA
                        Alan Kilborn
                        last edited by

                        I reserve downvoting for rudeness, but isn’t this solution way more complicated than it needs to be? The earlier solution proposed seemed to work (I guess) and it was accepted by the OP. So what’s the point in coming up with a more complicated Reg. Exp. solution? What do I miss?

                        1 Reply Last reply Reply Quote 0
                        • guy038G
                          guy038
                          last edited by

                          Hello @colonialboy, @scott-sumner, @alan-kilborn and All

                          Alan, you’re a thousand right, on that matter. I didn’t even try the Scott’s regex and I wrongly presumed that lines without any <Word>...</Word> were not replaced by a simple line-break !

                          So, I apologize and, of course, the Scott regex is must more elegant, despite of the supplementary but easy trimming operation !

                          Cheers,

                          guy038

                          Scott SumnerS 2 Replies Last reply Reply Quote 0
                          • Scott SumnerS
                            Scott Sumner @guy038
                            last edited by Scott Sumner

                            @guy038 , @Alan-Kilborn :

                            I agree that unless a poster specifically says “I must have ‘____’ functionality in ONE (regex) operation” , then good multi-step solutions are perfectly viable. Especially if it keeps a regex step much more readable. Some might argue that if they have to do a bunch of individual steps constantly, it gets annoying–to those I’d say that is what the macro recording and playback feature is for. :-)

                            Side note: If the trim command I used earlier also does tab characters (and it does), shouldn’t it be called “Trim Trailing Whitespace”?

                            1 Reply Last reply Reply Quote 1
                            • Scott SumnerS
                              Scott Sumner @guy038
                              last edited by guy038

                              @guy038

                              I’m guessing you don’t check it very often (hence this posting), but I just sent you an email (to your gmail account) regarding this thread with a “new” technique in it; if it works out as an approach (let me know what you think), we’ll share it here.

                              1 Reply Last reply Reply Quote 0
                              • First post
                                Last post
                              The Community of users of the Notepad++ text editor.
                              Powered by NodeBB | Contributors