Community
    • Login

    How to detect lines lacking "..." at end?

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    detect replace
    7 Posts 5 Posters 2.3k Views 3 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Silent ResidentS Offline
      Silent Resident
      last edited by Silent Resident

      I use Notepad++ as my primary program for editing movie subtitles.

      But for years I have been struggling with an issue that plagues me for a long time. Often the subtitles lack the three dots (…) at the end of the dialog breaks. How can I tell notepad to detect the lines lacking dots, and fix them?

      For example we have 2 dialogs, broken in 2 parts each, resulting in four subtitles lines:

      1
      00:00:10,000 --> 00:00:13,000
      All the living organisms
      can co-exist

      2
      00:00:15,000 --> 00:00:18,000
      on earth’s surface.

      3
      00:00:20,000 --> 00:00:23,000
      Earth is home to more than 7 billion humans

      4
      00:00:25,000 --> 00:00:28,000
      who live together, forming communities.

      Subtitle lines 2 and 4, are fine. So we ignore them. However, subtitle lines 1 and 3 are lacking three dots (…) at their end. How can I tell Notepad++ to add “…”, like that:

      1
      00:00:10,000 --> 00:00:13,000
      All the living organisms
      can co-exist…

      3
      00:00:20,000 --> 00:00:23,000
      Earth is home to more than 7 billion humans…

      Alan KilbornA 1 Reply Last reply Reply Quote 0
      • PeterJonesP Online
        PeterJones
        last edited by

        You seem to be asking for adding “…” to any that doesn’t already end in a “.”. However, studying your data, I am guessing you’re really wanting “add an ellipsis to any subtitle that doesn’t end with a sentence-ender”. Unfortunately, detecting all possible sentence-enders might be more difficult than you think. I mean, there’s ., but there’s also ! and ?. But there’s also lines that end in quotes – and an endquote doesn’t necessarily mean end of sentence…

        I also cannot tell in your example data whether there’s a blank line between subtitles, but there probably is, and I’m going to assume that all subtitles end with a blank line.

        I think, after you clarify that, we’ll be able to come up with a better solution

        -----

        My naive guess for just “subtitles that don’t end with a period need an ellipsis” would have been

        • FIND = [^\.]\K(?=\R\R\d+$)
        • replace = ...
        • mode = regular expression

        but that seems to add two sets of ... for each, and I cannot quickly figure out why. But that’s a starting point. And if you need real end-of-sentence rather than just .-end-of-sentence detection, we can probably improve things… but it won’t be perfect. (though @guy038 will come close to perfection, I am sure.)

        (Like that! That last parenthetical sentence is the end of a sentence, and wouldn’t need an ellipsis, but ends with something that I hadn’t thought of as being an allowable sentence-ender.)

        1 Reply Last reply Reply Quote 2
        • Alan KilbornA Offline
          Alan Kilborn @Silent Resident
          last edited by

          @Silent-Resident

          What’s the definitive pattern? A lack of a trailing period on the last one of a sequence?

          1 Reply Last reply Reply Quote 2
          • dinkumoilD Offline
            dinkumoil
            last edited by dinkumoil

            @Silent-Resident
            My suggestion:

            Find: [^[:punct:]]$\K(?=\R\R\d+$)
            Replace: ...
            Mode: Regular expression

            @PeterJones
            Due to the addition of $ it doesn’t add two sets of .... The [:punct:] character class should match all characters in question.

            1 Reply Last reply Reply Quote 3
            • guy038G Online
              guy038
              last edited by guy038

              Hello @silent-resident, and All,

              Quite easy ! For instance, assuming your example, below, where I added 4 subtitles lines ( from 5 to 8 )

              1
              00:00:10,000 --> 00:00:13,000
              All the living organisms
              can co-exist
              
              2
              00:00:15,000 --> 00:00:18,000
              on earth’s surface.
              
              3
              00:00:20,000 --> 00:00:23,000
              Earth is home to more than 7 billion humans
              
              4
              00:00:25,000 --> 00:00:28,000
              who live together, forming communities.
              
              5
              00:00:30,000 --> 00:00:33,000
              In the first volume of the 10th edition of "Systema Naturae",
              
              6
              00:00:36,000 --> 00:00:39,000
              written by the Swedish naturalist Carolus Linnaeus, the Animal kingdom
              
              7
              00:00:42,000 --> 00:00:45,000
              is broken down into six original classes of animals :
              
              8
              00:00:48,000 --> 00:00:51,000
              Mammalia, Aves, Amphibia, Pisces, Insecta, & Vermes.
              

              Note : The text, included in the last 4 subtitles, comes from :

              https://en.wikipedia.org/wiki/10th_edition_of_Systema_Naturae#Animals

              So, in current language, your request could be : Add a … sign at the end of any line, which does not end with a dot and which is followed by, at least 2 line-breaks. In that case, here is my solution :


              • Open the Replace dialog ( Ctrl + H )

              • SEARCH [^.\r\n]\K(?=\R{2,})

              • REPLACE … OR \x{2026}

              • Tick the Wrap around option

              • Select the Regular expression search mode

              • Click on the Replace All button, exclusively ( Do not use the Replace button ! )

              And you’ll get :

              1
              00:00:10,000 --> 00:00:13,000
              All the living organisms
              can co-exist…
              
              2
              00:00:15,000 --> 00:00:18,000
              on earth’s surface.
              
              3
              00:00:20,000 --> 00:00:23,000
              Earth is home to more than 7 billion humans…
              
              4
              00:00:25,000 --> 00:00:28,000
              who live together, forming communities.
              
              5
              00:00:30,000 --> 00:00:33,000
              In the first volume of the 10th edition of "Systema Naturae",…
              
              6
              00:00:36,000 --> 00:00:39,000
              written by the Swedish naturalist Carolus Linnaeus, the Animal kingdom…
              
              7
              00:00:42,000 --> 00:00:45,000
              is broken down into six original classes of animals :…
              
              8
              00:00:48,000 --> 00:00:51,000
              Mammalia, Aves, Amphibia, Pisces, Insecta, & Vermes.
              

              Now, If you do not want to add an horizontal ellipsis ( … ) after a punctuation sign, prefer the following regex S/R :

              SEARCH \w\K(?=\R{2,})

              REPLACE … OR \x{2026}

              This time, the text will be changed into :

              1
              00:00:10,000 --> 00:00:13,000
              All the living organisms
              can co-exist…
              
              2
              00:00:15,000 --> 00:00:18,000
              on earth’s surface.
              
              3
              00:00:20,000 --> 00:00:23,000
              Earth is home to more than 7 billion humans…
              
              4
              00:00:25,000 --> 00:00:28,000
              who live together, forming communities.
              
              5
              00:00:30,000 --> 00:00:33,000
              In the first volume of the 10th edition of "Systema Naturae",
              
              6
              00:00:36,000 --> 00:00:39,000
              written by the Swedish naturalist Carolus Linnaeus, the Animal kingdom…
              
              7
              00:00:42,000 --> 00:00:45,000
              is broken down into six original classes of animals :
              
              8
              00:00:48,000 --> 00:00:51,000
              Mammalia, Aves, Amphibia, Pisces, Insecta, & Vermes.
              

              Best Regards,

              guy038

              1 Reply Last reply Reply Quote 3
              • Silent ResidentS Offline
                Silent Resident
                last edited by

                Amazing! Thank you all very very much, I am grateful. This is what I was looking for.

                1 Reply Last reply Reply Quote 1
                • guy038G Online
                  guy038
                  last edited by guy038

                  Hi, @silent-resident, @peterjones, @alan-kilborn, @dinkumoil, and All,

                  @peterjones :

                  Peter, your search regex [^\.]\K(?=\R\R\d+$) ( which could be written [^.]\K(?=\R\R\d+$) ) does not work as expected ! Why ?

                  • Well, considering the first sub-title, in a Windows file, your regex first matches the zero-length gap, between letter t and the \r EOL character and it correctly adds the … symbol. OK !

                  • Now, the regex engine location is between the two characters … and \r. At this location, is there a single char, different from a dot, which can be followed with two line-breaks. The answer is YES : it’s just the \r EOL char, which is followed with \n ( so the first \R ) and the \r\n couple ( So, the second \R ). Thus, it adds a ... symbol between the \r and the \n of the first line-break

                  • This situation cannot occur, again, right after as, if it had chosen the \r char ( as [^.] ), then the remaining EOL characters were, only, the \n character, which cannot match the \R\R part of your regex ! So, the next match happens, wrongly, between the \r and \n EOL characters, at the end of the line on earth’s surface. and so on ... -((

                  Luckily, 3 solutions are possible to get the right behaviour :

                  • The [^.\r\n]\K(?=\R\R\d+$) syntax to forces the character, before the EOL characters, to be different from EOL chars !

                  • The [^.]$\K(?=\R\R\d+$) syntax. Adding the $ anchor forces the [^.] character to be located right before an end of line ( so, obviously, not between the two EOL characters \r and \n). It’s the solution adopted by @dinkumoil !

                  • Finally, use the exact Windows EOL definition, with the [^.]\K(?=\r\n\r\n\d+$) syntax

                  Cheers,

                  guy038

                  1 Reply Last reply Reply Quote 3

                  Hello! It looks like you're interested in this conversation, but you don't have an account yet.

                  Getting fed up of having to scroll through the same posts each visit? When you register for an account, you'll always come back to exactly where you were before, and choose to be notified of new replies (either via email, or push notification). You'll also be able to save bookmarks and upvote posts to show your appreciation to other community members.

                  With your input, this post could be even better 💗

                  Register Login
                  • First post
                    Last post
                  The Community of users of the Notepad++ text editor.
                  Powered by NodeBB | Contributors