Community
    • Login

    Finding multiple lines in multiple files and deleting just those lines

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    26 Posts 6 Posters 9.9k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Scott SumnerS
      Scott Sumner @Steve Wilson
      last edited by

      @Steve-Wilson

      It’s cool. The large font didn’t help my understanding (still quite confused), but maybe someone else will jump in. I’m out. Cheers, bro.

      1 Reply Last reply Reply Quote 0
      • Steve WilsonS
        Steve Wilson
        last edited by

        Sorry. I’ve got no idea what caused the large font. I assure you I didn’t do it on purpose.

        But thanks again!

        1 Reply Last reply Reply Quote 0
        • Steve WilsonS
          Steve Wilson
          last edited by

          Okay, I think I’m making my request/explanation confusing and over-complicated. Say I have the following file:

          1
          00:00:06,000 --> 00:00:12,074
          <font color=“#ffff00”>Sync by honeybunny - corrected by chamallow35</font> [line to remove]
          <font color=“#ffff00”>www.Addic7ed.Com</font> [line to remove]

          2
          00:00:12,920 --> 00:00:14,420
          Now we’re talking. Yeah, please.

          3
          00:00:15,870 --> 00:00:16,980
          Right here, baby. Aw…

          4
          00:00:20,580 --> 00:00:21,480
          Over here.

          5
          00:00:21,480 --> 00:00:23,140
          Over here, yeah, yeah.

          6
          00:00:32,020 --> 00:00:32,990
          Over here, Kelli.

          7
          00:00:33,990 --> 00:00:35,810
          Sync by yyets.net - corrected by chamallow35 [line to remove]
          www.addic7ed.com [line to remove]

          8
          00:00:36,010 --> 00:00:38,390
          Over here, Kelli. You
          look beautiful. Right here.

          9
          00:00:38,390 --> 00:00:40,190
          Please rate this subtitle at www.osdb.link/6hdjt [line to remove]
          Help other users to choose the best subtitles [line to remove]

          And in that file I’m trying to remove just the lines that I’ve marked with [line to remove].

          I’m just trying to figure out what I would put in the “Find what” box, the “Replace with” box and what I would have the “Search mode” set to. I can’t figure it out.

          Thanks

          10
          00:00:44,180 --> 00:00:45,170
          Bupkes.

          1 Reply Last reply Reply Quote 0
          • guy038G
            guy038
            last edited by guy038

            @steve-wilson, @scott-sumner, @terry-r and All,

            Thanks for your last post which gives us useful information. However there is still a point unclear !

            You previously said that you wanted to get rid of line 3. But, from your last post, it seems that your want to get rid, also, of all lines, located after the line 3 ! Am I right about it ?


            Anyway, the regex S/R, below, supposes that you want to get rid of all lines :

            • Containing the Sync by string, with that exact case

            OR

            • Containing the string www., of an Internet address

            as well as any subsequent lines, until a true empty line


            So, assuming your example, placed in a N++ new tab :

            1
            00:00:06,000 --> 00:00:12,074
            <font color="#ffff00">Sync by honeybunny - corrected by chamallow35</font> [line to remove]
            <font color="#ffff00">www.Addic7ed.Com</font> [line to remove]
            
            2
            00:00:12,920 --> 00:00:14,420
            Now we’re talking. Yeah, please.
            
            3
            00:00:15,870 --> 00:00:16,980
            Right here, baby. Aw…
            
            4
            00:00:20,580 --> 00:00:21,480
            Over here.
            
            5
            00:00:21,480 --> 00:00:23,140
            Over here, yeah, yeah.
            
            6
            00:00:32,020 --> 00:00:32,990
            Over here, Kelli.
            
            7
            00:00:33,990 --> 00:00:35,810
            Sync by yyets.net - corrected by chamallow35 [line to remove]
            www.addic7ed.com [line to remove]
            
            8
            00:00:36,010 --> 00:00:38,390
            Over here, Kelli. You
            look beautiful. Right here.
            
            9
            00:00:38,390 --> 00:00:40,190
            Please rate this subtitle at www.osdb.link/6hdjt [line to remove]
            Help other users to choose the best subtitles [line to remove]
            
            
            10
            00:00:44,180 --> 00:00:45,170
            Bupkes.
            
            • Open the Replace dialog ( CTRL + H )

            • Type, or copy/paste the regex (?-s)^.*\b(Sync by\x20|www\.).*\R(.+\R)+ in the Find what: zone

            • Leave the Replace with: zone EMPTY

            • Ticked the Wrap around option

            • Select the Regular expression search mode

            • Click once, on the Replace All button

            You should obtain the expected text :

            1
            00:00:06,000 --> 00:00:12,074
            
            2
            00:00:12,920 --> 00:00:14,420
            Now we’re talking. Yeah, please.
            
            3
            00:00:15,870 --> 00:00:16,980
            Right here, baby. Aw…
            
            4
            00:00:20,580 --> 00:00:21,480
            Over here.
            
            5
            00:00:21,480 --> 00:00:23,140
            Over here, yeah, yeah.
            
            6
            00:00:32,020 --> 00:00:32,990
            Over here, Kelli.
            
            7
            00:00:33,990 --> 00:00:35,810
            
            8
            00:00:36,010 --> 00:00:38,390
            Over here, Kelli. You
            look beautiful. Right here.
            
            9
            00:00:38,390 --> 00:00:40,190
            
            
            10
            00:00:44,180 --> 00:00:45,170
            Bupkes.
            

            Voilà !


            If we’re not far from the goal, I could, next time, explain my search regex !

            Best Regards,

            guy038

            1 Reply Last reply Reply Quote 1
            • Steve WilsonS
              Steve Wilson
              last edited by

              No. I’m simply trying to get rid of the lines I’ve marked in that example with [line to remove] at the end of the line. <sigh> I’m not making myself clear. What I’m trying to do is, I think, really simple. I simply want to remove multiple lines of text from within files without having to remove them one at a time.

              i said I wanted to get rid of line 3 simply to indicate that it was the third line of text that I was attempting to remove. Not the third and fourth, fifth, etc. Multiple lines containing specific text strings. In my last example I indicated those lines by appending [line to be removed] to the end of the line/string that I was wanting to be gone.

              Thanks

              1 Reply Last reply Reply Quote 0
              • Alan KilbornA
                Alan Kilborn
                last edited by

                I would go with a find field of ^.+?\[line to remove\].*?\R and a replace box that is totally empty. That should eliminate all of the desired to be deleted lines.

                1 Reply Last reply Reply Quote 2
                • guy038G
                  guy038
                  last edited by

                  Hi, @steve-wilson, @alan-kilborn, @scott-sumner, @terry-r and All,

                  Many thanks, Alan ! Oh my god, so simple ! Then, Steve, actually, you would like to get rid of all lines containing the literal string [line to remove], wouldn’t you ? Is it, really, the single rule needed for the regex ?

                  If so, of course, the Alan’s regex works fine. You could, also use, the regex (?-is)^.+\[line to remove\].*\R, which :

                  • Catches single-line text, only, due to the (?-s) modifier

                  • Matches the literal string [line to remove], with that exact case, due to the (?-i) modifier

                  Remember that the Replace with: zone remains Empty

                  Cheers

                  guy038

                  1 Reply Last reply Reply Quote 1
                  • Steve WilsonS
                    Steve Wilson
                    last edited by

                    Could I use the regex “(?-is)^.+[first line to remove][second line to remove].*\R” (etc on the lines to remove? There are probably at least a dozen or more lines I’m trying to remove from a lot of files. I just want to avoid having to do it one file (or one line) at a time.

                    And, many thanks.

                    1 Reply Last reply Reply Quote 0
                    • Steve WilsonS
                      Steve Wilson
                      last edited by

                      So, if I want to get rid of each of the following lines that may be contained
                      "
                      <font color=“#ffff00”>Sync by honeybunny - corrected by chamallow35</font>
                      <font color=“#ffff00”>www.Addic7ed.Com</font>
                      Sync by yyets.net - corrected by chamallow35
                      www.addic7ed.com
                      Please rate this subtitle at www.osdb.link/6hdjt
                      Help other users to choose the best subtitles
                      "
                      could I just use the regex (?-is)^.+<font color=“#ffff00”>Sync by honeybunny - corrected by chamallow35</font>\<font color=“#ffff00”>www.Addic7ed.Com</font>\Sync by yyets.net - corrected by chamallow35\www.addic7ed.com\Please rate this subtitle at www.osdb.link/6hdjt\Help other users to choose the best subtitles.*\R

                      1 Reply Last reply Reply Quote 0
                      • guy038G
                        guy038
                        last edited by

                        Hi, @steve-wilson,

                        Ah, yes ! Of course, if you already placed strings, like First line to remove or Second line to remove…, in your files, just change my previous regex, as below :

                        (?-is)^.+\[.*line to remove\].*\R

                        Notes :

                        • The modifiers (?-is) were explained previously

                        • Then the regex matches, from beginning of line ( ^ ), any non-empty range of standard characters ( .+ ), ending with an opening square bracket symbol ( \[ )

                        • Then matching any range, possibly empty, of standard characters ( .*), ending with the string line to remove, with that exact case, and the ending square bracket symbol ( \] )

                        • And, finally, matching any remaining range of characters, possibly empty, of the current line ( .* ) , along with its End of Line characters ( \R ), which may be \r\n for Windows files, \n for Unix files or \r for Macintosh files

                        • And, as the Replacement field is empty, the complete matched line, with its line-break, is, thus, deleted

                        Remarks :

                        • The square bracket symbols, being regex symbols, must be escaped to be considered as literals !

                        • Any syntax [...... line to remove], whatever text is, between the [ symbol and the string line to remove will be taken in account by the regex and the corresponding line selected for deletion

                        Cheers,

                        guy038

                        1 Reply Last reply Reply Quote 1
                        • Steve WilsonS
                          Steve Wilson
                          last edited by

                          I REALLY do appreciate all the time you’ve put in here. I’m just not getting it. I’m going to need to input at least a dozen lines that I want gone, but if you could show me an example of a regex to remove the following six lines, it’d be terribly helpful.

                          <font color=“#ffff00”>Sync by honeybunny - corrected by chamallow35</font>
                          <font color=“#ffff00”>www.Addic7ed.Com</font>
                          Sync by yyets.net - corrected by chamallow35
                          www.addic7ed.com
                          Please rate this subtitle at www.osdb.link/6hdjt
                          Help other users to choose the best subtitles

                          I DO appreciate it. I’m just not getting it.

                          1 Reply Last reply Reply Quote 0
                          • Terry RT
                            Terry R
                            last edited by Terry R

                            I’ll quickly wade in here.
                            Of the examples provided and knowing that the subtitle files are for movies I’d suggest you can group some of the lines you wish to remove. For example it’s very unlikely some dialogue would include www. or <font color or even Sync by. So in effect you may not need to actually write out the lines in full. you just need enough information to uniquely identify the lines you want to remove.

                            May I also suggest you combine ALL the files together, order them and remove duplicates. Look at what’s left. This could quickly identify what you’re trying to remove. Then using that information you get a regex to run over the original files.

                            Terry

                            1 Reply Last reply Reply Quote 1
                            • guy038G
                              guy038
                              last edited by guy038

                              @steve-wilson, and All,

                              Very sorry, because we posted, rather simultaneously :-((

                              Your last post goes towards a completely different direction ! The syntax, that you described, cannot be used in that form !! The regex would be quite invalid :-((

                              So, first, are you searching these six sentences, below, with that exact syntax ?

                              <font color="#ffff00">Sync by honeybunny - corrected by chamallow35</font>
                              <font color="#ffff00">www.Addic7ed.Com</font>
                              Sync by yyets.net - corrected by chamallow35
                              www.addic7ed.com
                              Please rate this subtitle at www.osdb.link/6hdjt
                              Help other users to choose the best subtitles
                              

                              I mean, could it be that, sometimes, you get lines with chamallow73 instead of chamallow35, OR these six lines never change, in all your files ?


                              If these 6 lines have a fix form, the generic regex, to use, is :

                              SEARCH (?-is)^(.*\Q...Line 1 Contents...\E.*|.*\Q...Line 2 Contents...\E.*|.*\Q...Line 3 Contents...\E.*|..........|.*\Q...Line 6 Contents...\E.*)\R

                              Notes :

                              • ...Line #n Contents... represents any the exact part, of the nth line, what you want to search

                              • The \Q and \E escaped sequences ensure you that any text placed between these two boundaries, is taken, literally

                              • The | symbol is a regex symbol to separate different alternatives to search, simultaneously

                              • The .* syntaxes, located, before \Q and after \E are the areas, possibly empty, located before and after your different sentences to search. Note that, if a sentence represents all the contents of a line, you may suppress these .* syntaxes, in the corresponding alternative

                              • Finally, any possible alternative, between parentheses, must begin a line ( ^ ) and ends with its line-break characters ( \R )


                              If we apply this generic regex to your real example, we get the following regex :

                              (?-is)^(.*\Q<font color="#ffff00">Sync by honeybunny - corrected by chamallow35</font>\E.*|.*\Q<font color="#ffff00">www.Addic7ed.Com</font>\E.*|.*\QSync by yyets.net - corrected by chamallow35\E.*|.*\Qwww.addic7ed.com\E.*|.*\QPlease rate this subtitle at www.osdb.link/6hdjt\E.*|.*\QHelp other users to choose the best subtitles\E.*)\R

                              Et voilà :-))

                              Cheers,

                              guy038

                              P.S. :

                              I strongly advice you to read this FAQ post, on regexes, below :

                              https://notepad-plus-plus.org/community/topic/15765/faq-desk-where-to-find-regex-documentation/1

                              1 Reply Last reply Reply Quote 1
                              • Terry RT
                                Terry R
                                last edited by Terry R

                                To remove the 6 lines in your example the following regex would work.
                                Find what: ^((?|<font color|Sync by|Please rate|Help other|www.).+\R)
                                Replace with: empty line here

                                The search mode is regular expression and wrap around is ticked.

                                Note that I haven’t included the complete line as I think the strings I’m searching for are unique enough. The .+\R sequence at the end means as long as it starts with one of the strings, also grab the remainder of the line. The ^ at the start makes sure we are starting a search at the start of a line. Thus if these strings are NOT at the start of the line, they will not be removed.

                                The regex includes a pipe character between the different strings (|), this allows the regexe to look for different strings all within the one expression, so you would only need to run it once to get all those alternatives removed. You can extend the regex by adding more pipe characters and other strings to search for.

                                Terry

                                1 Reply Last reply Reply Quote 1
                                • Steve WilsonS
                                  Steve Wilson
                                  last edited by

                                  Many Many thanks. I’ve managed to remove a few hundred bothersome lines from a few hundred srt files. Much faster than doing it one by one.

                                  I DO appreciate it.

                                  Steve

                                  1 Reply Last reply Reply Quote 0
                                  • Steve WilsonS
                                    Steve Wilson
                                    last edited by

                                    OK. Many thanks again. And I DID read that documentation. It explained a lot, but I didn’t see an answer to this:
                                    I’ve got the following regex.

                                    (?-is)^(.\Q<font color=“#ffff00”>Sync by honeybunny - corrected by chamallow35</font>\E.|.\Q<font color=“#ffff00”>www.Addic7ed.Com</font>\E.|.\QSync by yyets.net - corrected by chamallow35\E.|.\Qwww.addic7ed.com\E.|.\QPlease rate this subtitle at www.osdb.link/6hdjt\E.|.\QSync & corrections by\E.|.www.addic7ed.com\E.|.\QPlease rate this subtitle\E.|.\Q== sync, corrected by <font color=“#00FF00”>elderman</font> ==\E.|.\Q <font color=“#00FFFF”>@elder_man\E.|.\Q<font color=“#00FFFF”>@elder_man</font> \E.|.\QWWW.MY-SUBS.COM\E.|.\QAdvertise your product or brand here\E.|.\Qcontact www.OpenSubtitles.org today\E.|.\QAmericasCardroom.com brings poker back\E.|.\QMillion Dollar Sunday Tournament every Sunday\E.|.\QSynced & corrected by\E.|.\QSynced and corrected by Octavia\E.|.\QHelp other users to choose the best subtitles\E.)\R

                                    It works, but since the list of lines I’d like to remove grows, for the sake of simplicity I’d like to use this regex - it has the same expressions but the lines are separate. And reason that wouldn’t work?

                                    (?-is)^(.\Q
                                    <font color=“#ffff00”>Sync by honeybunny - corrected by chamallow35</font>\E.
                                    |.\Q
                                    <font color=“#ffff00”>www.Addic7ed.Com</font>\E.
                                    |.\Q
                                    Sync by yyets.net - corrected by chamallow35\E.
                                    |.\Q
                                    www.addic7ed.com\E.
                                    |.\Q
                                    Please rate this subtitle at www.osdb.link/6hdjt\E.
                                    |.\Q
                                    Sync & corrections by\E.
                                    |.\Q*
                                    www.addic7ed.com\E.|.\Q
                                    Please rate this subtitle\E.|.\Q
                                    == sync, corrected by <font color=“#00FF00”>elderman</font> ==\E.|.\Q
                                    <font color=“#00FFFF”>@elder_man\E.|.\Q
                                    <font color=“#00FFFF”>@elder_man</font> \E.|.\Q
                                    WWW.MY-SUBS.COM\E.|.\Q
                                    Advertise your product or brand here\E.|.\Q
                                    contact www.OpenSubtitles.org today\E.|.\Q
                                    AmericasCardroom.com brings poker back\E.|.\Q
                                    Million Dollar Sunday Tournament every Sunday\E.|.\Q
                                    Synced & corrected by\E.|.\Q
                                    Synced and corrected by Octavia\E.|.\Q
                                    Help other users to choose the best subtitles\E.*)\R

                                    That way I could easily scan to see if a line already existed and if not just put it in before the last line followed by a macro for \E .|.\Q

                                    Thanks

                                    1 Reply Last reply Reply Quote 0
                                    • Terry RT
                                      Terry R
                                      last edited by

                                      The following regex will get every one of those lines and it doesn’t care about the case of the character.

                                      (?i)^.*?(<font|sync.*?correct|www\.|help other.*?subtitle|please rate|advertise your|\.com|million dollar).*?\R

                                      So the search is an ‘insensitive’ search, it doesn’t care whether it’s an a or an A. So first off that would save you looking for ‘WWW.’ and ‘www.’ with 2 sub expressions.

                                      You see, there isn’t a need to type every single line you need to search for individually. If you tried you would quickly find you exceeded the limit of the regex allowed. My example identifies a complete line so long as it has the characters defined within each sub expression. They are shown between the ‘|’ characters.

                                      As a couple of the lines are very similar to possible dialogue I’ve made the sub expression look for 2 words with ‘something’ between them. I don’t care what the ‘something’ is, only that the 2 words appear on the same line. This may also be something you wish to try.

                                      The only issue with my example is that it will NOT grab the very last line if that is one of the lines you want. That’s because the last line doesn’t finish on a ‘\R’. I don’t think that would be a problem though, these ‘advertising’ lines would generally be in the first 100 or so lines of each subtitle file I think.

                                      The last 3 tests (advertise your, \.com and million dollar) might potentially occur within dialogue, so you may want to expand on those, but you still should not need to include the WHOLE line.

                                      As to why you expression didn’t work, I found it hard to read, far too much text. I just found it easier to do an example for you, maybe also because I think you are trying too hard to identify the lines. Regex is all about trying to bunch groups/strings of characters into neat buckets, that’s where it’s power lies. In your case you’re removing most of that power and trying to search for each unique line individually.

                                      Terry

                                      1 Reply Last reply Reply Quote 1
                                      • guy038G
                                        guy038
                                        last edited by guy038

                                        Hi, @steve-wilson, @terry-r and All,

                                        Indeed, you can use separated lines if you include the (?x) modifier, which enables the free-spacing regex mode !

                                        So, copy/paste the search regex, below, in the Find what: zone, with, of course the two options Regular expression and Wrap around ticked

                                        Notes :

                                        • The \R syntax does not work in Free-spacing mode. So, you must write \r\n ( or \n if Unix files )

                                        • In Free-spacing mode :

                                          • The # is the comment-line symbol. To use it, literally, simply write \#

                                          • The space character is not taken in account. To use it, literally, simply write \x20 or [ ]

                                        • However, when using the \Q.......\E syntax, either, the # and the space symbols are searched, as literals !

                                        So, your multi-lines search regex could be, as below :

                                        (?x)
                                        (?-is)
                                        ^(
                                        .*\Q<font color="#ffff00">Sync by honeybunny - corrected by chamallow35</font>\E.*|
                                        .*\Q<font color="#ffff00">www.Addic7ed.Com</font>\E.*|
                                        .*\QSync by yyets.net - corrected by chamallow35\E.*|
                                        .*\Qwww.addic7ed.com\E.*|
                                        .*\QPlease rate this subtitle at www.osdb.link/6hdjt\E.*|
                                        .*\QSync & corrections by\E.*|
                                        .*\QPlease rate this subtitle\E.*|
                                        .*\Q== sync, corrected by <font color="#00FF00">elderman</font> ==\E.*|
                                        .*\Q<font color="#00FFFF">@elder_man\E.*|
                                        .*\Q<font color="#00FFFF">@elder_man</font> \E.*|
                                        .*\QWWW.MY-SUBS.COM\E.*|
                                        .*\QAdvertise your product or brand here\E.*|
                                        .*\Qcontact www.OpenSubtitles.org today\E.*|
                                        .*\QAmericasCardroom.com brings poker back\E.*|
                                        .*\QMillion Dollar Sunday Tournament every Sunday\E.*|
                                        .*\QSynced & corrected by\E.*|
                                        .*\QSynced and corrected by Octavia\E.*|
                                        .*\QHelp other users to choose the best subtitles\E.*
                                        )\r\n
                                        

                                        Remarks :

                                        • The entire search selection must not exceed 2,046 characters

                                        • Unfortunately, the multi-lines replacement is NOT allowed, with our N++ regex engine !

                                        Et voilà !

                                        Cheers,

                                        guy038

                                        1 Reply Last reply Reply Quote 1
                                        • Steve Wilson105S
                                          Steve Wilson105
                                          last edited by

                                          OK. I’ve read all the help files and understand a lot more, but I’m finding that a lot of the subtitles I’ve edited are missing lines. For instance, the line:
                                          Help other users to choose the best subtitles
                                          might be in the subtitle and the regex would find it and remove it. But in some of the subtitles, that line is
                                          Help other users to choose the best subtitles
                                          That leading empty space is stopping the regex from catching the line. And of course, that’s just one of the lines that it happens to. I’ve tried changing the line in the regex to .\QHelp other users to choose the best subtitles\E.|, but that doesn’t seem to have any effect. I’ve tried
                                          .Help other users to choose the best subtitles| and .?Help other users to choose the best subtitles| , and neither of those seem to help either. I don’t have any idea what I’m missing. I just want to remove any line in the subtitle that contains Help other users to choose the best subtitles . I don’t care what (if anything) precedes it.

                                          1 Reply Last reply Reply Quote 0
                                          • Steve Wilson105S
                                            Steve Wilson105
                                            last edited by

                                            OK, I understand a little better what is happening. I think. If I run a regex on a subtitle file containing 100 lines, single lines of text that are in the regex ARE removed. But sections of the subtitle that have 2 consecutive lines of text listed in the regex would only have the first line removed and a space would be prefixed to the next line. If I run the regex a second time, the second line WOULD be removed. If there were a third line that was listed in the regex (a rare occurrence), that third line would have TWO spaces prefixed. Running the regex a third time would catch that line. Except in cases of there being multiple consecutive lines at the very end of the file. In that case the line just doesn’t get removed no matter how many times I run the regex.

                                            It’s much better than doing this all by hand though, so thanks again.

                                            1 Reply Last reply Reply Quote 1
                                            • First post
                                              Last post
                                            The Community of users of the Notepad++ text editor.
                                            Powered by NodeBB | Contributors