How to detect lines lacking "..." at end?
-
I use Notepad++ as my primary program for editing movie subtitles.
But for years I have been struggling with an issue that plagues me for a long time. Often the subtitles lack the three dots (…) at the end of the dialog breaks. How can I tell notepad to detect the lines lacking dots, and fix them?
For example we have 2 dialogs, broken in 2 parts each, resulting in four subtitles lines:
1
00:00:10,000 --> 00:00:13,000
All the living organisms
can co-exist2
00:00:15,000 --> 00:00:18,000
on earth’s surface.3
00:00:20,000 --> 00:00:23,000
Earth is home to more than 7 billion humans4
00:00:25,000 --> 00:00:28,000
who live together, forming communities.Subtitle lines 2 and 4, are fine. So we ignore them. However, subtitle lines 1 and 3 are lacking three dots (…) at their end. How can I tell Notepad++ to add “…”, like that:
1
00:00:10,000 --> 00:00:13,000
All the living organisms
can co-exist…3
00:00:20,000 --> 00:00:23,000
Earth is home to more than 7 billion humans… -
You seem to be asking for adding “…” to any that doesn’t already end in a “.”. However, studying your data, I am guessing you’re really wanting “add an ellipsis to any subtitle that doesn’t end with a sentence-ender”. Unfortunately, detecting all possible sentence-enders might be more difficult than you think. I mean, there’s
.
, but there’s also!
and?
. But there’s also lines that end in quotes – and an endquote doesn’t necessarily mean end of sentence…I also cannot tell in your example data whether there’s a blank line between subtitles, but there probably is, and I’m going to assume that all subtitles end with a blank line.
I think, after you clarify that, we’ll be able to come up with a better solution
-----
My naive guess for just “subtitles that don’t end with a period need an ellipsis” would have been
- FIND =
[^\.]\K(?=\R\R\d+$)
- replace =
...
- mode = regular expression
but that seems to add two sets of
...
for each, and I cannot quickly figure out why. But that’s a starting point. And if you need real end-of-sentence rather than just.
-end-of-sentence detection, we can probably improve things… but it won’t be perfect. (though @guy038 will come close to perfection, I am sure.)(Like that! That last parenthetical sentence is the end of a sentence, and wouldn’t need an ellipsis, but ends with something that I hadn’t thought of as being an allowable sentence-ender.)
- FIND =
-
What’s the definitive pattern? A lack of a trailing period on the last one of a sequence?
-
@Silent-Resident
My suggestion:Find:
[^[:punct:]]$\K(?=\R\R\d+$)
Replace:...
Mode: Regular expression@PeterJones
Due to the addition of$
it doesn’t add two sets of...
. The[:punct:]
character class should match all characters in question. -
Hello @silent-resident, and All,
Quite easy ! For instance, assuming your example, below, where I added
4
subtitles lines ( from5
to8
)1 00:00:10,000 --> 00:00:13,000 All the living organisms can co-exist 2 00:00:15,000 --> 00:00:18,000 on earth’s surface. 3 00:00:20,000 --> 00:00:23,000 Earth is home to more than 7 billion humans 4 00:00:25,000 --> 00:00:28,000 who live together, forming communities. 5 00:00:30,000 --> 00:00:33,000 In the first volume of the 10th edition of "Systema Naturae", 6 00:00:36,000 --> 00:00:39,000 written by the Swedish naturalist Carolus Linnaeus, the Animal kingdom 7 00:00:42,000 --> 00:00:45,000 is broken down into six original classes of animals : 8 00:00:48,000 --> 00:00:51,000 Mammalia, Aves, Amphibia, Pisces, Insecta, & Vermes.
Note : The text, included in the last
4
subtitles, comes from :https://en.wikipedia.org/wiki/10th_edition_of_Systema_Naturae#Animals
So, in current language, your request could be : Add a
…
sign at the end of any line, which does not end with a dot and which is followed by, at least2
line-breaks. In that case, here is my solution :
-
Open the Replace dialog (
Ctrl + H
) -
SEARCH
[^.\r\n]\K(?=\R{2,})
-
REPLACE
…
OR\x{2026}
-
Tick the
Wrap around
option -
Select the
Regular expression
search mode -
Click on the
Replace All
button, exclusively ( Do not use theReplace
button ! )
And you’ll get :
1 00:00:10,000 --> 00:00:13,000 All the living organisms can co-exist… 2 00:00:15,000 --> 00:00:18,000 on earth’s surface. 3 00:00:20,000 --> 00:00:23,000 Earth is home to more than 7 billion humans… 4 00:00:25,000 --> 00:00:28,000 who live together, forming communities. 5 00:00:30,000 --> 00:00:33,000 In the first volume of the 10th edition of "Systema Naturae",… 6 00:00:36,000 --> 00:00:39,000 written by the Swedish naturalist Carolus Linnaeus, the Animal kingdom… 7 00:00:42,000 --> 00:00:45,000 is broken down into six original classes of animals :… 8 00:00:48,000 --> 00:00:51,000 Mammalia, Aves, Amphibia, Pisces, Insecta, & Vermes.
Now, If you do not want to add an horizontal ellipsis (
…
) after a punctuation sign, prefer the following regex S/R :SEARCH
\w\K(?=\R{2,})
REPLACE
…
OR\x{2026}
This time, the text will be changed into :
1 00:00:10,000 --> 00:00:13,000 All the living organisms can co-exist… 2 00:00:15,000 --> 00:00:18,000 on earth’s surface. 3 00:00:20,000 --> 00:00:23,000 Earth is home to more than 7 billion humans… 4 00:00:25,000 --> 00:00:28,000 who live together, forming communities. 5 00:00:30,000 --> 00:00:33,000 In the first volume of the 10th edition of "Systema Naturae", 6 00:00:36,000 --> 00:00:39,000 written by the Swedish naturalist Carolus Linnaeus, the Animal kingdom… 7 00:00:42,000 --> 00:00:45,000 is broken down into six original classes of animals : 8 00:00:48,000 --> 00:00:51,000 Mammalia, Aves, Amphibia, Pisces, Insecta, & Vermes.
Best Regards,
guy038
-
-
Amazing! Thank you all very very much, I am grateful. This is what I was looking for.
-
Hi, @silent-resident, @peterjones, @alan-kilborn, @dinkumoil, and All,
Peter, your search regex
[^\.]\K(?=\R\R\d+$)
( which could be written[^.]\K(?=\R\R\d+$)
) does not work as expected ! Why ?-
Well, considering the first sub-title, in a Windows file, your regex first matches the zero-length gap, between letter
t
and the\r
EOL character and it correctly adds the…
symbol. OK ! -
Now, the regex engine location is between the two characters
…
and\r
. At this location, is there a single char, different from a dot, which can be followed with two line-breaks. The answer is YES : it’s just the\r
EOL char, which is followed with\n
( so the first\R
) and the\r\n
couple ( So, the second\R
). Thus, it adds a...
symbol between the\r
and the\n
of the first line-break -
This situation cannot occur, again, right after as, if it had chosen the
\r
char ( as[^.]
), then the remaining EOL characters were, only, the\n
character, which cannot match the\R\R
part of your regex ! So, the next match happens, wrongly, between the\r
and\n
EOL characters, at the end of the lineon earth’s surface.
and so on...
-((
Luckily,
3
solutions are possible to get the right behaviour :-
The
[^.\r\n]\K(?=\R\R\d+$)
syntax to forces the character, before the EOL characters, to be different from EOL chars ! -
The
[^.]$\K(?=\R\R\d+$)
syntax. Adding the$
anchor forces the[^.]
character to be located right before an end of line ( so, obviously, not between the two EOL characters\r
and\n
). It’s the solution adopted by @dinkumoil ! -
Finally, use the exact Windows EOL definition, with the
[^.]\K(?=\r\n\r\n\d+$)
syntax
Cheers,
guy038
-