Community
    • Login

    How to detect lines lacking "..." at end?

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    detect replace
    7 Posts 5 Posters 1.5k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Silent ResidentS
      Silent Resident
      last edited by Silent Resident

      I use Notepad++ as my primary program for editing movie subtitles.

      But for years I have been struggling with an issue that plagues me for a long time. Often the subtitles lack the three dots (…) at the end of the dialog breaks. How can I tell notepad to detect the lines lacking dots, and fix them?

      For example we have 2 dialogs, broken in 2 parts each, resulting in four subtitles lines:

      1
      00:00:10,000 --> 00:00:13,000
      All the living organisms
      can co-exist

      2
      00:00:15,000 --> 00:00:18,000
      on earth’s surface.

      3
      00:00:20,000 --> 00:00:23,000
      Earth is home to more than 7 billion humans

      4
      00:00:25,000 --> 00:00:28,000
      who live together, forming communities.

      Subtitle lines 2 and 4, are fine. So we ignore them. However, subtitle lines 1 and 3 are lacking three dots (…) at their end. How can I tell Notepad++ to add “…”, like that:

      1
      00:00:10,000 --> 00:00:13,000
      All the living organisms
      can co-exist…

      3
      00:00:20,000 --> 00:00:23,000
      Earth is home to more than 7 billion humans…

      Alan KilbornA 1 Reply Last reply Reply Quote 0
      • PeterJonesP
        PeterJones
        last edited by

        You seem to be asking for adding “…” to any that doesn’t already end in a “.”. However, studying your data, I am guessing you’re really wanting “add an ellipsis to any subtitle that doesn’t end with a sentence-ender”. Unfortunately, detecting all possible sentence-enders might be more difficult than you think. I mean, there’s ., but there’s also ! and ?. But there’s also lines that end in quotes – and an endquote doesn’t necessarily mean end of sentence…

        I also cannot tell in your example data whether there’s a blank line between subtitles, but there probably is, and I’m going to assume that all subtitles end with a blank line.

        I think, after you clarify that, we’ll be able to come up with a better solution

        -----

        My naive guess for just “subtitles that don’t end with a period need an ellipsis” would have been

        • FIND = [^\.]\K(?=\R\R\d+$)
        • replace = ...
        • mode = regular expression

        but that seems to add two sets of ... for each, and I cannot quickly figure out why. But that’s a starting point. And if you need real end-of-sentence rather than just .-end-of-sentence detection, we can probably improve things… but it won’t be perfect. (though @guy038 will come close to perfection, I am sure.)

        (Like that! That last parenthetical sentence is the end of a sentence, and wouldn’t need an ellipsis, but ends with something that I hadn’t thought of as being an allowable sentence-ender.)

        1 Reply Last reply Reply Quote 2
        • Alan KilbornA
          Alan Kilborn @Silent Resident
          last edited by

          @Silent-Resident

          What’s the definitive pattern? A lack of a trailing period on the last one of a sequence?

          1 Reply Last reply Reply Quote 2
          • dinkumoilD
            dinkumoil
            last edited by dinkumoil

            @Silent-Resident
            My suggestion:

            Find: [^[:punct:]]$\K(?=\R\R\d+$)
            Replace: ...
            Mode: Regular expression

            @PeterJones
            Due to the addition of $ it doesn’t add two sets of .... The [:punct:] character class should match all characters in question.

            1 Reply Last reply Reply Quote 3
            • guy038G
              guy038
              last edited by guy038

              Hello @silent-resident, and All,

              Quite easy ! For instance, assuming your example, below, where I added 4 subtitles lines ( from 5 to 8 )

              1
              00:00:10,000 --> 00:00:13,000
              All the living organisms
              can co-exist
              
              2
              00:00:15,000 --> 00:00:18,000
              on earth’s surface.
              
              3
              00:00:20,000 --> 00:00:23,000
              Earth is home to more than 7 billion humans
              
              4
              00:00:25,000 --> 00:00:28,000
              who live together, forming communities.
              
              5
              00:00:30,000 --> 00:00:33,000
              In the first volume of the 10th edition of "Systema Naturae",
              
              6
              00:00:36,000 --> 00:00:39,000
              written by the Swedish naturalist Carolus Linnaeus, the Animal kingdom
              
              7
              00:00:42,000 --> 00:00:45,000
              is broken down into six original classes of animals :
              
              8
              00:00:48,000 --> 00:00:51,000
              Mammalia, Aves, Amphibia, Pisces, Insecta, & Vermes.
              

              Note : The text, included in the last 4 subtitles, comes from :

              https://en.wikipedia.org/wiki/10th_edition_of_Systema_Naturae#Animals

              So, in current language, your request could be : Add a … sign at the end of any line, which does not end with a dot and which is followed by, at least 2 line-breaks. In that case, here is my solution :


              • Open the Replace dialog ( Ctrl + H )

              • SEARCH [^.\r\n]\K(?=\R{2,})

              • REPLACE … OR \x{2026}

              • Tick the Wrap around option

              • Select the Regular expression search mode

              • Click on the Replace All button, exclusively ( Do not use the Replace button ! )

              And you’ll get :

              1
              00:00:10,000 --> 00:00:13,000
              All the living organisms
              can co-exist…
              
              2
              00:00:15,000 --> 00:00:18,000
              on earth’s surface.
              
              3
              00:00:20,000 --> 00:00:23,000
              Earth is home to more than 7 billion humans…
              
              4
              00:00:25,000 --> 00:00:28,000
              who live together, forming communities.
              
              5
              00:00:30,000 --> 00:00:33,000
              In the first volume of the 10th edition of "Systema Naturae",…
              
              6
              00:00:36,000 --> 00:00:39,000
              written by the Swedish naturalist Carolus Linnaeus, the Animal kingdom…
              
              7
              00:00:42,000 --> 00:00:45,000
              is broken down into six original classes of animals :…
              
              8
              00:00:48,000 --> 00:00:51,000
              Mammalia, Aves, Amphibia, Pisces, Insecta, & Vermes.
              

              Now, If you do not want to add an horizontal ellipsis ( … ) after a punctuation sign, prefer the following regex S/R :

              SEARCH \w\K(?=\R{2,})

              REPLACE … OR \x{2026}

              This time, the text will be changed into :

              1
              00:00:10,000 --> 00:00:13,000
              All the living organisms
              can co-exist…
              
              2
              00:00:15,000 --> 00:00:18,000
              on earth’s surface.
              
              3
              00:00:20,000 --> 00:00:23,000
              Earth is home to more than 7 billion humans…
              
              4
              00:00:25,000 --> 00:00:28,000
              who live together, forming communities.
              
              5
              00:00:30,000 --> 00:00:33,000
              In the first volume of the 10th edition of "Systema Naturae",
              
              6
              00:00:36,000 --> 00:00:39,000
              written by the Swedish naturalist Carolus Linnaeus, the Animal kingdom…
              
              7
              00:00:42,000 --> 00:00:45,000
              is broken down into six original classes of animals :
              
              8
              00:00:48,000 --> 00:00:51,000
              Mammalia, Aves, Amphibia, Pisces, Insecta, & Vermes.
              

              Best Regards,

              guy038

              1 Reply Last reply Reply Quote 3
              • Silent ResidentS
                Silent Resident
                last edited by

                Amazing! Thank you all very very much, I am grateful. This is what I was looking for.

                1 Reply Last reply Reply Quote 1
                • guy038G
                  guy038
                  last edited by guy038

                  Hi, @silent-resident, @peterjones, @alan-kilborn, @dinkumoil, and All,

                  @peterjones :

                  Peter, your search regex [^\.]\K(?=\R\R\d+$) ( which could be written [^.]\K(?=\R\R\d+$) ) does not work as expected ! Why ?

                  • Well, considering the first sub-title, in a Windows file, your regex first matches the zero-length gap, between letter t and the \r EOL character and it correctly adds the … symbol. OK !

                  • Now, the regex engine location is between the two characters … and \r. At this location, is there a single char, different from a dot, which can be followed with two line-breaks. The answer is YES : it’s just the \r EOL char, which is followed with \n ( so the first \R ) and the \r\n couple ( So, the second \R ). Thus, it adds a ... symbol between the \r and the \n of the first line-break

                  • This situation cannot occur, again, right after as, if it had chosen the \r char ( as [^.] ), then the remaining EOL characters were, only, the \n character, which cannot match the \R\R part of your regex ! So, the next match happens, wrongly, between the \r and \n EOL characters, at the end of the line on earth’s surface. and so on ... -((

                  Luckily, 3 solutions are possible to get the right behaviour :

                  • The [^.\r\n]\K(?=\R\R\d+$) syntax to forces the character, before the EOL characters, to be different from EOL chars !

                  • The [^.]$\K(?=\R\R\d+$) syntax. Adding the $ anchor forces the [^.] character to be located right before an end of line ( so, obviously, not between the two EOL characters \r and \n). It’s the solution adopted by @dinkumoil !

                  • Finally, use the exact Windows EOL definition, with the [^.]\K(?=\r\n\r\n\d+$) syntax

                  Cheers,

                  guy038

                  1 Reply Last reply Reply Quote 3
                  • First post
                    Last post
                  The Community of users of the Notepad++ text editor.
                  Powered by NodeBB | Contributors