How to detect lines lacking "..." at end?



  • I use Notepad++ as my primary program for editing movie subtitles.

    But for years I have been struggling with an issue that plagues me for a long time. Often the subtitles lack the three dots () at the end of the dialog breaks. How can I tell notepad to detect the lines lacking dots, and fix them?

    For example we have 2 dialogs, broken in 2 parts each, resulting in four subtitles lines:

    1
    00:00:10,000 --> 00:00:13,000
    All the living organisms
    can co-exist

    2
    00:00:15,000 --> 00:00:18,000
    on earth’s surface.

    3
    00:00:20,000 --> 00:00:23,000
    Earth is home to more than 7 billion humans

    4
    00:00:25,000 --> 00:00:28,000
    who live together, forming communities.

    Subtitle lines 2 and 4, are fine. So we ignore them. However, subtitle lines 1 and 3 are lacking three dots () at their end. How can I tell Notepad++ to add “”, like that:

    1
    00:00:10,000 --> 00:00:13,000
    All the living organisms
    can co-exist…

    3
    00:00:20,000 --> 00:00:23,000
    Earth is home to more than 7 billion humans…



  • You seem to be asking for adding “…” to any that doesn’t already end in a “.”. However, studying your data, I am guessing you’re really wanting “add an ellipsis to any subtitle that doesn’t end with a sentence-ender”. Unfortunately, detecting all possible sentence-enders might be more difficult than you think. I mean, there’s ., but there’s also ! and ?. But there’s also lines that end in quotes – and an endquote doesn’t necessarily mean end of sentence…

    I also cannot tell in your example data whether there’s a blank line between subtitles, but there probably is, and I’m going to assume that all subtitles end with a blank line.

    I think, after you clarify that, we’ll be able to come up with a better solution

    -----

    My naive guess for just “subtitles that don’t end with a period need an ellipsis” would have been

    • FIND = [^\.]\K(?=\R\R\d+$)
    • replace = ...
    • mode = regular expression

    but that seems to add two sets of ... for each, and I cannot quickly figure out why. But that’s a starting point. And if you need real end-of-sentence rather than just .-end-of-sentence detection, we can probably improve things… but it won’t be perfect. (though @guy038 will come close to perfection, I am sure.)

    (Like that! That last parenthetical sentence is the end of a sentence, and wouldn’t need an ellipsis, but ends with something that I hadn’t thought of as being an allowable sentence-ender.)



  • @Silent-Resident

    What’s the definitive pattern? A lack of a trailing period on the last one of a sequence?



  • @Silent-Resident
    My suggestion:

    Find: [^[:punct:]]$\K(?=\R\R\d+$)
    Replace: ...
    Mode: Regular expression

    @PeterJones
    Due to the addition of $ it doesn’t add two sets of .... The [:punct:] character class should match all characters in question.



  • Hello @silent-resident, and All,

    Quite easy ! For instance, assuming your example, below, where I added 4 subtitles lines ( from 5 to 8 )

    1
    00:00:10,000 --> 00:00:13,000
    All the living organisms
    can co-exist
    
    2
    00:00:15,000 --> 00:00:18,000
    on earth’s surface.
    
    3
    00:00:20,000 --> 00:00:23,000
    Earth is home to more than 7 billion humans
    
    4
    00:00:25,000 --> 00:00:28,000
    who live together, forming communities.
    
    5
    00:00:30,000 --> 00:00:33,000
    In the first volume of the 10th edition of "Systema Naturae",
    
    6
    00:00:36,000 --> 00:00:39,000
    written by the Swedish naturalist Carolus Linnaeus, the Animal kingdom
    
    7
    00:00:42,000 --> 00:00:45,000
    is broken down into six original classes of animals :
    
    8
    00:00:48,000 --> 00:00:51,000
    Mammalia, Aves, Amphibia, Pisces, Insecta, & Vermes.
    

    Note : The text, included in the last 4 subtitles, comes from :

    https://en.wikipedia.org/wiki/10th_edition_of_Systema_Naturae#Animals

    So, in current language, your request could be : Add a sign at the end of any line, which does not end with a dot and which is followed by, at least 2 line-breaks. In that case, here is my solution :


    • Open the Replace dialog ( Ctrl + H )

    • SEARCH [^.\r\n]\K(?=\R{2,})

    • REPLACE OR \x{2026}

    • Tick the Wrap around option

    • Select the Regular expression search mode

    • Click on the Replace All button, exclusively ( Do not use the Replace button ! )

    And you’ll get :

    1
    00:00:10,000 --> 00:00:13,000
    All the living organisms
    can co-exist…
    
    2
    00:00:15,000 --> 00:00:18,000
    on earth’s surface.
    
    3
    00:00:20,000 --> 00:00:23,000
    Earth is home to more than 7 billion humans…
    
    4
    00:00:25,000 --> 00:00:28,000
    who live together, forming communities.
    
    5
    00:00:30,000 --> 00:00:33,000
    In the first volume of the 10th edition of "Systema Naturae",…
    
    6
    00:00:36,000 --> 00:00:39,000
    written by the Swedish naturalist Carolus Linnaeus, the Animal kingdom…
    
    7
    00:00:42,000 --> 00:00:45,000
    is broken down into six original classes of animals :…
    
    8
    00:00:48,000 --> 00:00:51,000
    Mammalia, Aves, Amphibia, Pisces, Insecta, & Vermes.
    

    Now, If you do not want to add an horizontal ellipsis ( ) after a punctuation sign, prefer the following regex S/R :

    SEARCH \w\K(?=\R{2,})

    REPLACE OR \x{2026}

    This time, the text will be changed into :

    1
    00:00:10,000 --> 00:00:13,000
    All the living organisms
    can co-exist…
    
    2
    00:00:15,000 --> 00:00:18,000
    on earth’s surface.
    
    3
    00:00:20,000 --> 00:00:23,000
    Earth is home to more than 7 billion humans…
    
    4
    00:00:25,000 --> 00:00:28,000
    who live together, forming communities.
    
    5
    00:00:30,000 --> 00:00:33,000
    In the first volume of the 10th edition of "Systema Naturae",
    
    6
    00:00:36,000 --> 00:00:39,000
    written by the Swedish naturalist Carolus Linnaeus, the Animal kingdom…
    
    7
    00:00:42,000 --> 00:00:45,000
    is broken down into six original classes of animals :
    
    8
    00:00:48,000 --> 00:00:51,000
    Mammalia, Aves, Amphibia, Pisces, Insecta, & Vermes.
    

    Best Regards,

    guy038



  • Amazing! Thank you all very very much, I am grateful. This is what I was looking for.



  • Hi, @silent-resident, @peterjones, @alan-kilborn, @dinkumoil, and All,

    @peterjones :

    Peter, your search regex [^\.]\K(?=\R\R\d+$) ( which could be written [^.]\K(?=\R\R\d+$) ) does not work as expected ! Why ?

    • Well, considering the first sub-title, in a Windows file, your regex first matches the zero-length gap, between letter t and the \r EOL character and it correctly adds the symbol. OK !

    • Now, the regex engine location is between the two characters and \r. At this location, is there a single char, different from a dot, which can be followed with two line-breaks. The answer is YES : it’s just the \r EOL char, which is followed with \n ( so the first \R ) and the \r\n couple ( So, the second \R ). Thus, it adds a ... symbol between the \r and the \n of the first line-break

    • This situation cannot occur, again, right after as, if it had chosen the \r char ( as [^.] ), then the remaining EOL characters were, only, the \n character, which cannot match the \R\R part of your regex ! So, the next match happens, wrongly, between the \r and \n EOL characters, at the end of the line on earth’s surface. and so on ... -((

    Luckily, 3 solutions are possible to get the right behaviour :

    • The [^.\r\n]\K(?=\R\R\d+$) syntax to forces the character, before the EOL characters, to be different from EOL chars !

    • The [^.]$\K(?=\R\R\d+$) syntax. Adding the $ anchor forces the [^.] character to be located right before an end of line ( so, obviously, not between the two EOL characters \r and \n). It’s the solution adopted by @dinkumoil !

    • Finally, use the exact Windows EOL definition, with the [^.]\K(?=\r\n\r\n\d+$) syntax

    Cheers,

    guy038


Log in to reply