Community
    • Login

    Delete all-cap words

    Scheduled Pinned Locked Moved General Discussion
    22 Posts 7 Posters 1.5k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • LordP666L
      LordP666
      last edited by

      I’m trying to see if I can use the program to edit subtitle files.

      One of the things I come across quite a bit is finding hearing-impaired subtitles where the caption words are in all caps. Is there a way to find/delete all words composed entirely of capital letters?

      I did tons of searching but most of what I found was about changing cases.

      Example:

      14
      00:00:26,960 --> 00:00:29,040
      CALL DISCONNECTS

      15
      00:00:33,920 --> 00:00:37,400
      DRAMATIC MUSIC

      16
      00:00:46,360 --> 00:00:48,040
      KNOCK AT DOOR

      Terry RT 1 Reply Last reply Reply Quote 0
      • Terry RT
        Terry R @LordP666
        last edited by Terry R

        @LordP666 said in Delete all-cap words:

        Is there a way to find/delete all words composed entirely of capital letters?

        Yes you can. The regular expression in the Replace function would be:
        Find What:(?-i)\b\u+\b
        Replace With: empty field here.
        As it’s a regular expression the search mode MUST be regular expression.

        However at this point it will leave one/some spaces on what might otherwise be a blank line that you will likely want to deal with, especially if the line was ONLY capitalised words. This is easily accomplished by using Edit, Line Operation, Remove empty lines (containing blank characters).

        Terry

        PS I should also add that I think you might underestimate what exactly you asked for. I suggest you run my solution over some copies of your files. I believe you might be surprised at what exactly occurs. I’ll say no more. By all means come back with a revised version of your question and the boundaries you’d like to impose.

        1 Reply Last reply Reply Quote 4
        • Mark OlsonM
          Mark Olson
          last edited by Mark Olson

          I think ^\d+\R(?'time'\d\d:\d\d:\d\d,\d{3}) --> (?&time)\R\h*(?-i:\u+(?:\h+|\h*$))+(?:\R\R|\Z)

          would work as a regular expression that would recognize all of the following:

          2
          59:27:64,961 --> 95:70:01,800
          GURX 
          
          665
          44:47:89,100 --> 59:53:22,074
          	A VCL	AJI
          
          

          but would not recognize either of the following:

          7
          26:24:77,203 --> 98:21:33,850
          kkDQvz	KCvAV gHDrp 
          
          94
          59:46:00,913 --> 12:69:24,441
           cBLyrUj W S	px
          
          

          The above regular expression recognizes a number, then a newline, then two dates in two digits:two digits:two digits,three digits format separated by a -->, then a newline, then a series of all-uppercase words, then two newlines or the end of the file.

          LordP666L 1 Reply Last reply Reply Quote 2
          • CoisesC
            Coises
            last edited by Coises

            You would find it much easier to use Subtitle Edit. Choose Remove text for hearing impaired… from the Tools menu. Check Remove line if UPPERCASE in the dialog.

            LordP666L 1 Reply Last reply Reply Quote 3
            • LordP666L
              LordP666 @Mark Olson
              last edited by LordP666

              @Terry R

              Thanks, Terry. That worked great except for capitalized “I”'s.

              Example:
              Original: And I’d tell you that that was
              Result: And 'd tell you that that was

              The expression also removed the capitalized “I”'s standing alone:
              Original: “OK, I won’t go then,” but then
              Result: “, won’t go then,” but then

              It’s a start though.

              Another oddity is this:
              Original: DOOR CLOSES, KEY TURNS
              Result: , (somehow the comma was kept.

              EDIT: Just for the sake of being complete…the expression also deletes all instances of “OK”, in which both letters are always capitalized. So far these are the only two cases where the expression works a little too well.

              I’m doing a file comparison and so far those two particular errors are all that I’ve seen.

              Terry RT 1 Reply Last reply Reply Quote 0
              • LordP666L
                LordP666 @Coises
                last edited by

                @Coises

                I asked for help on this precisely because Subtitle Edit didn’t work. It does not see these likes as “hearing impaired” because they are not in any sort of brackets or parentheses.

                CoisesC 1 Reply Last reply Reply Quote 0
                • Terry RT
                  Terry R @LordP666
                  last edited by

                  @LordP666
                  You will also get any A in sentences removed. It was as I eluded to in my post. You need to better define the question. What defines a line or words which must be deleted, vice normal capitalised words. These are edge cases and will often make the final solution more difficult to find.

                  Is it that it’s always a full line of capitals?
                  Would they ever contain non-alpha characters?

                  These are things you need to consider, but which you never provided in your original request. There may also be other criteria to consider.

                  Terry

                  LordP666L 1 Reply Last reply Reply Quote 1
                  • LordP666L
                    LordP666 @Terry R
                    last edited by LordP666

                    @Terry-R

                    I see. I’ll try to work it out.

                    Thanks for the help.

                    EDIT: I just realized that the majority of the lines that I need to edit have the letter “I” - if I could exempt that, then editing the resulting file would be tremendously easier - so far I have run into a couple of “OK"s and one “DI” (Detective Inspector”.

                    Terry RT 2 Replies Last reply Reply Quote 0
                    • Terry RT
                      Terry R @LordP666
                      last edited by

                      @LordP666

                      We are happy to help. It’s just that to get a great solution requires a very well defined question. Currently you have already found that you did not account for some edge cases.

                      Have a think, consider my last 2 questions and then come back with further definition of the question. Sometimes we can’t get you everything but maybe just highlighting the relevant lines might be a better idea. It’s possible to mark lines for checking, then to quickly skip through the marked lines, allowing you to unmark lines you don’t want removed before removing what ever is left marked.

                      Terry

                      LordP666L 1 Reply Last reply Reply Quote 0
                      • Terry RT
                        Terry R @LordP666
                        last edited by

                        @LordP666
                        Here is one possible solution.

                        First remove all lines containing no lower case characters, but they could contain punctuation characters. Then mark other lines if they contain say more than 3 (or some suitable number) of upper case characters. You would then check those and unmark ones you want to keep, removing the lines left marked.

                        Terry

                        LordP666L 1 Reply Last reply Reply Quote 1
                        • LordP666L
                          LordP666 @Terry R
                          last edited by LordP666

                          @Terry-R

                          “Sometimes we can’t get you everything” - I’m very OK with being close ;)

                          As I said, having an exception would be a massive improvement - I would still have to edit odds and end like “OK”, or “DCI”…but it seems that 99% of what I had to edit were freestanding "I"s and also in “I’ve”, “I’m”, “I’d” - and these are not 100% all caps.

                          I suppose what I want is to delete words in all caps, and not ones where there are a mix, such as proper names. But that first person “I” is a killer.

                          1 Reply Last reply Reply Quote 0
                          • LordP666L
                            LordP666 @Terry R
                            last edited by

                            @Terry-R

                            I decided to try OpenAI where I could keep refining my objective and finally arrived at: I want the expression to delete lines where all the letters are capitalized but to exclude lines that have lowercase letters.

                            It gave me this: ^(?!.[a-z])([A-Z]+(\s+[A-Z]+))?$

                            I tried it and nothing changed.

                            LOL

                            1 Reply Last reply Reply Quote 0
                            • guy038G
                              guy038
                              last edited by guy038

                              Hello, @lordp666, @terry-R, @mark-olson, @coises and All,

                              I’m a bit of a hurry because I’ve got an appointement at a ski shop to buy some new ski boots ! So I cannot develop my solution, presently.


                              But just try the following regex S/R which suppresses all subtitle lines containing uppercase letters or digits ONLY, except for specific lines containing the words OK, DI and I.

                              • SEARCH (?-si)^(?!\d\d)(?!\b(OK|DI|I)\b)[\u\d]+(?:[[:punct:]\h]+(?!\b(OK|DI|I)\b)[\u\d]+)*$(\R|\z)

                              • REPLACE Leave EMPTY

                              For example, from this INPUT text :

                              14
                              00:00:26,960 -> 00:00:29,040
                              CALL DISCONNECTS
                              DRAMATIC MUSIC
                              KNOCK AT DOOR
                              And I’d tell you that that was
                              "OK, I won’t go then," but then
                              DOOR CLOSES, KEY TURNS
                              I
                              T
                              TE
                              TEST
                              TEST1
                              T OK
                              TE OK
                              TEST OK
                              TEST1 OK
                              WORD1 WORD2 BLA
                              WORD1 WORD2 BLA
                              WORD1 WORD2 BLA
                              WORD1 WORD2 OK BLA
                              WORD1 WORD2 DI BLA
                              WORD1 WORD2 I  BLA
                              TOKT
                              ABC TOKT
                              

                              You would be left with this OUTPUT one :

                              14
                              00:00:26,960 -> 00:00:29,040
                              And I’d tell you that that was
                              "OK, I won’t go then," but then
                              I
                              T OK
                              TE OK
                              TEST OK
                              TEST1 OK
                              WORD1 WORD2 OK BLA
                              WORD1 WORD2 DI BLA
                              WORD1 WORD2 I  BLA
                              

                              As you can see, after the sequence number and duration, it just remains lines containing the word I, OK or DI, in uppercase !

                              See you later

                              Best Regards

                              guy038

                              LordP666L 1 Reply Last reply Reply Quote 1
                              • LordP666L
                                LordP666 @guy038
                                last edited by

                                @guy038

                                Wow. That worked really well - I’ll have to ask OpenAI to parse it for me as I only have the tiniest knowledge of this stuff.

                                Amazingly, one line that should not have been removed…was not removed. It was: PE.

                                Another one that passed the test was this: I L-O-V-E.

                                Where the character is reading the first letters of separate lines.

                                I am sure that there will be other words than “DI” or “OK” - but man, it beats editing 70-80 lines with that pesky first person “I”.

                                I am in awe - thank you so much for your help.

                                1 Reply Last reply Reply Quote 0
                                • CoisesC
                                  Coises @LordP666
                                  last edited by

                                  @LordP666 said in Delete all-cap words:

                                  @Coises

                                  I asked for help on this precisely because Subtitle Edit didn’t work. It does not see these likes as “hearing impaired” because they are not in any sort of brackets or parentheses.

                                  Are you sure you checked the box Remove line if UPPERCASE (not Only if text is UPPERCASE under Remove text before a colon (:), which is something different) in the Remove text for hearing impaired dialog?

                                  Which lines are listed in the bottom half of the dialog changes as you change settings at the top.

                                  Testing with Subtitle Edit 4.0.8 after adding the lines you gave as examples, I find them to be detected in the dialog when I check that box, and removed when Apply or OK is clicked.

                                  LordP666L 1 Reply Last reply Reply Quote 0
                                  • guy038G
                                    guy038
                                    last edited by guy038

                                    Hi, @lordp666, @terry-R, @mark-olson, @coises and All,

                                    Glad to see that this search regex meets your needs, globally !


                                    I didn’t understand a sentence of your reply. You said :

                                    Amazingly, one line that should not have been removed…was not removed. It was: PE.
                                    

                                    Well …

                                    • If you do not want to keep the lines containing also the word PE in uppercase, simply use my previous search regex :

                                      • SEARCH (?-si)^(?!\d\d)(?!\b(OK|DI|I)\b)[\u\d]+(?:[[:punct:]\h]+(?!\b(OK|DI|I)\b)[\u\d]+)*(\R|\z)

                                    • If you do want to keep the lines containing also the word PE in uppercase, use the following search regex :

                                      • SEARCH (?-si)^(?!\d\d)(?!\b(PE|OK|DI|I)\b)[\u\d]+(?:[[:punct:]\h]+(?!\b(PE|OK|DI|I)\b)[\u\d]+)*(\R|\z)

                                    Regarding the line containing the string I L-O-V-E, it wouldn’t be sensible to find a suitable regex for such a case ! You’ll have to parse your text and decide whether you keep, or not, this kind of line and some other very specific ones :-(

                                    However, in order to help you with this task, AFTER the global first replacement above, you could follow this simple road map :

                                    • Move to the very beginning of your file to parse

                                    • Open the Mark dialog ( Ctrl + M )

                                      • SEARCH (?-i)^[\u\h[:punct:]]+$

                                      • Uncheck all box options

                                      • Check the Bookmark line, Purge for each search and Wrap around box options

                                      • Select the Regular expression mode

                                      • Click on the Mark All button

                                    • Now hit, successively, on the F2 key to reach all the remaining lines which could possibly be deleted or kept :-)

                                    Cheers,

                                    guy038

                                    P.S. :

                                    Next time, I’ll give you some hints about the regexes used !

                                    LordP666L 1 Reply Last reply Reply Quote 0
                                    • LordP666L
                                      LordP666 @guy038
                                      last edited by LordP666

                                      @guy038

                                      Sorry, I meant that it worked very well - that line in the dialogue was just this: PE

                                      Meaning “Physical Education”.

                                      It was NOT a “Hearing Impaired” part of the subtitles and should not have been deleted.

                                      As I said before, I played with OpenAI trying to clarify what I wanted to do and had come to the (false) conclusion that I wanted to delete ALL lines that had no lowercase letters.

                                      That one single line in the dialogue showed me that it would be a mistake to delete all lines without lowercase.

                                      Your suggestion worked better than I realized.

                                      Thanks again.

                                      EDIT: I also use OpenAI to decode these search strings and I wonder what you think of its abilities? It gave me this:

                                      ChatGPT
                                      Certainly! Let’s break down the regular expression (?-si)^(?!\d\d)(?!\b(OK|DI|I)\b)[\u\d]+(?:[[:punct:]\h]+(?!\b(OK|DI|I)\b)[\u\d]+)*$(\R|\z) piece by piece:

                                      (?-si): This is a mode modifier.

                                      s: This option allows . to match newline characters.
                                      i: This option makes the matching case insensitive.
                                      -si: The minus sign indicates that both options are turned off. So, . will not match newlines, and the matching will be case sensitive.
                                      ^: Asserts the start of a line.

                                      (?!\d\d): A negative lookahead assertion that checks if the string does not start with two digits. If the string starts with two digits, the match will fail.

                                      (?!\b(OK|DI|I)\b): Another negative lookahead that ensures that the string does not contain the whole words “OK”, “DI”, or “I”. The \b asserts a word boundary, meaning that “OK”, “DI”, or “I” must not be present as complete words.

                                      [\u\d]+: Matches one or more characters that are either:

                                      \u: A Unicode character (the exact definition may depend on the specific implementation, often it means any character except control characters).
                                      \d: Any digit (0-9).
                                      (?: … ): This is a non-capturing group, meaning it groups part of the regex together without capturing the matched text for back-referencing.

                                      [[:punct:]\h]+: Inside the non-capturing group, this matches one or more characters that are:

                                      [:punct:]: Any punctuation character.
                                      \h: Any horizontal whitespace (spaces or tabs).
                                      (?!\b(OK|DI|I)\b): Again, another negative lookahead to ensure that after the punctuation and whitespace, the words “OK”, “DI”, or “I” do not appear.

                                      [\u\d]+: After the punctuation and whitespace, this matches one or more characters that are either Unicode characters or digits.

                                      *: The non-capturing group can repeat zero or more times, allowing for multiple segments of the pattern.

                                      $: Asserts the end of a line.

                                      (\R|\z): This matches either:

                                      \R: Any line break (newline character).
                                      \z: The end of the string, which ensures that the match goes to the very end of the input.
                                      Summary
                                      This regex matches a string that:

                                      Does not start with two digits.
                                      Does not contain the whole words “OK”, “DI”, or “I”.
                                      Contains one or more Unicode characters or digits.
                                      May have segments of punctuation/whitespace followed by more Unicode characters or digits.
                                      The whole match must end at the line’s end or the end of the string.
                                      Let me know if you need further clarification on any part!

                                      Alan KilbornA 1 Reply Last reply Reply Quote 0
                                      • LordP666L
                                        LordP666 @Coises
                                        last edited by

                                        @Coises

                                        Thanks for the tip. I don’t normally mess around with the setting as I’m not 100% sure about some of them. I will check it out in the future.

                                        Now I’m wondering if SubtitleEdit would have removed that one line I described to guy038, among others.

                                        PE

                                        1 Reply Last reply Reply Quote 0
                                        • Alan KilbornA
                                          Alan Kilborn @LordP666
                                          last edited by

                                          @LordP666 said in Delete all-cap words:

                                          I also use OpenAI to decode these search strings and I wonder what you think of its abilities

                                          Postings of AI generated content are disallowed on this forum.
                                          Anyone that wanted to see what AI thought of the regular expression can run it through an AI for themselves.

                                          LordP666L 1 Reply Last reply Reply Quote 1
                                          • PeterJonesP
                                            PeterJones
                                            last edited by

                                            In back-and-forth between @Coises and @LordP666:

                                            You would find it much easier to use Subtitle Edit .
                                            …
                                            I asked for help on this precisely because Subtitle Edit didn’t work.
                                            …
                                            Testing with Subtitle Edit 4.0.8 …
                                            …
                                            Now I’m wondering if SubtitleEdit would have …

                                            And as a reminder, this Forum is for Notepad++, not Subtitle Edit. If you two want to continue to talk about that alternative, you can use the private chat of the forum if you want. But let’s keep the focus of this conversation on possible Notepad++ solutions (without resorting to probabilistic Generative AI hallucinations).

                                            1 Reply Last reply Reply Quote 1
                                            • First post
                                              Last post
                                            The Community of users of the Notepad++ text editor.
                                            Powered by NodeBB | Contributors