• Login
Community
  • Login

How to detect lines lacking "..." at end?

Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
detect replace
7 Posts 5 Posters 1.6k Views
Loading More Posts
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • S
    Silent Resident
    last edited by Silent Resident Aug 13, 2019, 10:01 PM Aug 13, 2019, 9:58 PM

    I use Notepad++ as my primary program for editing movie subtitles.

    But for years I have been struggling with an issue that plagues me for a long time. Often the subtitles lack the three dots (…) at the end of the dialog breaks. How can I tell notepad to detect the lines lacking dots, and fix them?

    For example we have 2 dialogs, broken in 2 parts each, resulting in four subtitles lines:

    1
    00:00:10,000 --> 00:00:13,000
    All the living organisms
    can co-exist

    2
    00:00:15,000 --> 00:00:18,000
    on earth’s surface.

    3
    00:00:20,000 --> 00:00:23,000
    Earth is home to more than 7 billion humans

    4
    00:00:25,000 --> 00:00:28,000
    who live together, forming communities.

    Subtitle lines 2 and 4, are fine. So we ignore them. However, subtitle lines 1 and 3 are lacking three dots (…) at their end. How can I tell Notepad++ to add “…”, like that:

    1
    00:00:10,000 --> 00:00:13,000
    All the living organisms
    can co-exist…

    3
    00:00:20,000 --> 00:00:23,000
    Earth is home to more than 7 billion humans…

    A 1 Reply Last reply Aug 13, 2019, 10:24 PM Reply Quote 0
    • P
      PeterJones
      last edited by Aug 13, 2019, 10:18 PM

      You seem to be asking for adding “…” to any that doesn’t already end in a “.”. However, studying your data, I am guessing you’re really wanting “add an ellipsis to any subtitle that doesn’t end with a sentence-ender”. Unfortunately, detecting all possible sentence-enders might be more difficult than you think. I mean, there’s ., but there’s also ! and ?. But there’s also lines that end in quotes – and an endquote doesn’t necessarily mean end of sentence…

      I also cannot tell in your example data whether there’s a blank line between subtitles, but there probably is, and I’m going to assume that all subtitles end with a blank line.

      I think, after you clarify that, we’ll be able to come up with a better solution

      -----

      My naive guess for just “subtitles that don’t end with a period need an ellipsis” would have been

      • FIND = [^\.]\K(?=\R\R\d+$)
      • replace = ...
      • mode = regular expression

      but that seems to add two sets of ... for each, and I cannot quickly figure out why. But that’s a starting point. And if you need real end-of-sentence rather than just .-end-of-sentence detection, we can probably improve things… but it won’t be perfect. (though @guy038 will come close to perfection, I am sure.)

      (Like that! That last parenthetical sentence is the end of a sentence, and wouldn’t need an ellipsis, but ends with something that I hadn’t thought of as being an allowable sentence-ender.)

      1 Reply Last reply Reply Quote 2
      • A
        Alan Kilborn @Silent Resident
        last edited by Aug 13, 2019, 10:24 PM

        @Silent-Resident

        What’s the definitive pattern? A lack of a trailing period on the last one of a sequence?

        1 Reply Last reply Reply Quote 2
        • D
          dinkumoil
          last edited by dinkumoil Aug 13, 2019, 11:08 PM Aug 13, 2019, 11:03 PM

          @Silent-Resident
          My suggestion:

          Find: [^[:punct:]]$\K(?=\R\R\d+$)
          Replace: ...
          Mode: Regular expression

          @PeterJones
          Due to the addition of $ it doesn’t add two sets of .... The [:punct:] character class should match all characters in question.

          1 Reply Last reply Reply Quote 3
          • G
            guy038
            last edited by guy038 Aug 13, 2019, 11:41 PM Aug 13, 2019, 11:27 PM

            Hello @silent-resident, and All,

            Quite easy ! For instance, assuming your example, below, where I added 4 subtitles lines ( from 5 to 8 )

            1
            00:00:10,000 --> 00:00:13,000
            All the living organisms
            can co-exist
            
            2
            00:00:15,000 --> 00:00:18,000
            on earth’s surface.
            
            3
            00:00:20,000 --> 00:00:23,000
            Earth is home to more than 7 billion humans
            
            4
            00:00:25,000 --> 00:00:28,000
            who live together, forming communities.
            
            5
            00:00:30,000 --> 00:00:33,000
            In the first volume of the 10th edition of "Systema Naturae",
            
            6
            00:00:36,000 --> 00:00:39,000
            written by the Swedish naturalist Carolus Linnaeus, the Animal kingdom
            
            7
            00:00:42,000 --> 00:00:45,000
            is broken down into six original classes of animals :
            
            8
            00:00:48,000 --> 00:00:51,000
            Mammalia, Aves, Amphibia, Pisces, Insecta, & Vermes.
            

            Note : The text, included in the last 4 subtitles, comes from :

            https://en.wikipedia.org/wiki/10th_edition_of_Systema_Naturae#Animals

            So, in current language, your request could be : Add a … sign at the end of any line, which does not end with a dot and which is followed by, at least 2 line-breaks. In that case, here is my solution :


            • Open the Replace dialog ( Ctrl + H )

            • SEARCH [^.\r\n]\K(?=\R{2,})

            • REPLACE … OR \x{2026}

            • Tick the Wrap around option

            • Select the Regular expression search mode

            • Click on the Replace All button, exclusively ( Do not use the Replace button ! )

            And you’ll get :

            1
            00:00:10,000 --> 00:00:13,000
            All the living organisms
            can co-exist…
            
            2
            00:00:15,000 --> 00:00:18,000
            on earth’s surface.
            
            3
            00:00:20,000 --> 00:00:23,000
            Earth is home to more than 7 billion humans…
            
            4
            00:00:25,000 --> 00:00:28,000
            who live together, forming communities.
            
            5
            00:00:30,000 --> 00:00:33,000
            In the first volume of the 10th edition of "Systema Naturae",…
            
            6
            00:00:36,000 --> 00:00:39,000
            written by the Swedish naturalist Carolus Linnaeus, the Animal kingdom…
            
            7
            00:00:42,000 --> 00:00:45,000
            is broken down into six original classes of animals :…
            
            8
            00:00:48,000 --> 00:00:51,000
            Mammalia, Aves, Amphibia, Pisces, Insecta, & Vermes.
            

            Now, If you do not want to add an horizontal ellipsis ( … ) after a punctuation sign, prefer the following regex S/R :

            SEARCH \w\K(?=\R{2,})

            REPLACE … OR \x{2026}

            This time, the text will be changed into :

            1
            00:00:10,000 --> 00:00:13,000
            All the living organisms
            can co-exist…
            
            2
            00:00:15,000 --> 00:00:18,000
            on earth’s surface.
            
            3
            00:00:20,000 --> 00:00:23,000
            Earth is home to more than 7 billion humans…
            
            4
            00:00:25,000 --> 00:00:28,000
            who live together, forming communities.
            
            5
            00:00:30,000 --> 00:00:33,000
            In the first volume of the 10th edition of "Systema Naturae",
            
            6
            00:00:36,000 --> 00:00:39,000
            written by the Swedish naturalist Carolus Linnaeus, the Animal kingdom…
            
            7
            00:00:42,000 --> 00:00:45,000
            is broken down into six original classes of animals :
            
            8
            00:00:48,000 --> 00:00:51,000
            Mammalia, Aves, Amphibia, Pisces, Insecta, & Vermes.
            

            Best Regards,

            guy038

            1 Reply Last reply Reply Quote 3
            • S
              Silent Resident
              last edited by Aug 14, 2019, 12:45 AM

              Amazing! Thank you all very very much, I am grateful. This is what I was looking for.

              1 Reply Last reply Reply Quote 1
              • G
                guy038
                last edited by guy038 Aug 14, 2019, 1:00 AM Aug 14, 2019, 12:57 AM

                Hi, @silent-resident, @peterjones, @alan-kilborn, @dinkumoil, and All,

                @peterjones :

                Peter, your search regex [^\.]\K(?=\R\R\d+$) ( which could be written [^.]\K(?=\R\R\d+$) ) does not work as expected ! Why ?

                • Well, considering the first sub-title, in a Windows file, your regex first matches the zero-length gap, between letter t and the \r EOL character and it correctly adds the … symbol. OK !

                • Now, the regex engine location is between the two characters … and \r. At this location, is there a single char, different from a dot, which can be followed with two line-breaks. The answer is YES : it’s just the \r EOL char, which is followed with \n ( so the first \R ) and the \r\n couple ( So, the second \R ). Thus, it adds a ... symbol between the \r and the \n of the first line-break

                • This situation cannot occur, again, right after as, if it had chosen the \r char ( as [^.] ), then the remaining EOL characters were, only, the \n character, which cannot match the \R\R part of your regex ! So, the next match happens, wrongly, between the \r and \n EOL characters, at the end of the line on earth’s surface. and so on ... -((

                Luckily, 3 solutions are possible to get the right behaviour :

                • The [^.\r\n]\K(?=\R\R\d+$) syntax to forces the character, before the EOL characters, to be different from EOL chars !

                • The [^.]$\K(?=\R\R\d+$) syntax. Adding the $ anchor forces the [^.] character to be located right before an end of line ( so, obviously, not between the two EOL characters \r and \n). It’s the solution adopted by @dinkumoil !

                • Finally, use the exact Windows EOL definition, with the [^.]\K(?=\r\n\r\n\d+$) syntax

                Cheers,

                guy038

                1 Reply Last reply Reply Quote 3
                1 out of 7
                • First post
                  1/7
                  Last post
                The Community of users of the Notepad++ text editor.
                Powered by NodeBB | Contributors