Community
    • Login

    Need help with adding </i> to the end of multiple different strings.

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    17 Posts 5 Posters 929 Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • PeterJonesP
      PeterJones @Hank K
      last edited by PeterJones

      @Hank-K,

      467
      00:44:43,929 --> 00:44:47,306
      MAXIM: <i>Hey, do you speak any Russian?
      CURNOW: <i>No.
      
      470
      00:44:51,687 --> 00:44:56,023
      FLOYD: <i>Hey, Curnow, have you heard the one
      about the marathon runner and the chicken?
      

      The problem is trying to distinguish the “MAXIM” line, where you want it to end after the first line, compared to the “FLOYD” line, where you want the <i>...</i> to span to the end of the block of text. Without more knowledge of your edge cases, it’s likely to fail. For example, if the rule was “</i> goes on the same line if the next line starts with a capital”, then I could come up with the exception:

      470
      00:44:51,687 --> 00:44:56,023
      FLOYD: <i>Hey, Curnow, have you heard that
      George is angry?
      

      … where the next line starts with a capital, but it’s obviously (to a human) not the end of the text that needs </i>

      If your rule is “</i> goes on the same line if the next line starts with ALL CAPS, possibly with hyphen-then-space prefix”, then I would have to ask whether there is ever ANY ALLCAPS SHOUTING in these subtitles? Because if there is SHOUTING, then there CAN BE a circumstance where it makes it match the regex even though it should be a multiline italics rather than single line.

      My guess is that for most rules you come up with, I could probably come up with a practical situation in which the natural subtitles will look like they match your rule, but don’t actually match the intent of the rule.

      Hank KH 1 Reply Last reply Reply Quote 0
      • PeterJonesP
        PeterJones @Hank K
        last edited by

        @Hank-K ,

        … however, I think this comes pretty close to what you’ll want.

        FIND = (?-is)<i>.*$(\R^(?!- )(?!\u+:).+)*
        REPLACE = `$0</i>

        Essentially, I started with “match from <i> to the end of the line”, which was easy. Then I added on "and zero or more subsequent lines if there is any text = (\R.+)* , which then got multi-line paragraphs of speech, but also grabbed too much. Then I added in “but the extra line cannot start with hyphen-space” using (?!- ) (which is a “negative lookahead” saying that it cannot match what’s inside that group), so it stopped capturing those; and then I added “and the extra line cannot start with ALLCAPS-then-colon” with (?!\u+:) .

        That comes up with the same answer that you said you wanted, so I think it’s good.

        I tried to explain my methodology: start simple, and then start adding in exceptions. Because really, figuring out how to do that on your own is the only way to get better at regex. And anytime we do give you a regex (which likely won’t be many more times), you need to study every little component from there: look up in the documentation below – especially the N++ User Manual link – what each piece of a regex means, and then play around to see how editing that piece will change the results of your matches.

        ----

        Useful References

        • Notepad++ Online User Manual: Searching/Regex
        • FAQ: Where to find other regular expressions (regex) documentation

        ----

        Please note: This Community Forum is not a data transformation service; you should not expect to be able to always say “I have data like X and want it to look like Y” and have us do all the work for you. If you are new to the Forum, and new to regular expressions, we will often give help on the first one or two data-transformation questions, especially if they are well-asked and you show a willingness to learn; and we will point you to the documentation where you can learn how to do the data transformations for yourself in the future. But if you repeatedly ask us to do your work for you, you will find that the patience of usually-helpful Community members wears thin. The best way to learn regular expressions is by experimenting with them yourself, and getting a feel for how they work; having us spoon-feed you the answers without you putting in the effort doesn’t help you in the long term and is uninteresting and annoying for us.

        1 Reply Last reply Reply Quote 1
        • Hank KH
          Hank K @PeterJones
          last edited by

          @PeterJones
          Not sure I totally understand SHOUTING. Those are the characters names with the following text that is in italics.
          In this case:

          467
          00:44:43,929 --> 00:44:47,306
          MAXIM: <i>Hey, do you speak any Russian?
          CURNOW: <i>No.
          

          Both characters are off screen which means both strings need the closing </i> to produce the following:

          467
          00:44:43,929 --> 00:44:47,306
          MAXIM: <i>Hey, do you speak any Russian?</i>
          CURNOW: <i>No.</i>
          

          All of what I showed in the BEFORE above in 1st post need the closing </i>.

          If two different (or three) FIND/REPLACE are required, that OK.

          PeterJonesP 1 Reply Last reply Reply Quote 0
          • PeterJonesP
            PeterJones @Hank K
            last edited by PeterJones

            @Hank-K said in Need help with adding </i> to the end of multiple different strings.:

            Not sure I totally understand SHOUTING. Those are the characters names with the following text that is in italics.

            123
            11:22:33,444 --> 11:22:34,567
            MAXIM: <i>GET OUT OF MY WAY, YOU IDIOT
            CURNOW: <i>Hey, calm down, no need to shout.
            

            vs

            123
            11:22:33,444 --> 11:22:34,567
            MAXIM: <i>GET OUT OF MY WAY, YOU IDIOT
            CURNOW: I'M STILL TALKING TO YOU
            

            … in the first, MAXIM is shouting at CURNOW, and then CURNOW replies.

            In the second , MAXIM is still talking to CURNOW on the second line of the paragraph (his whole quote being “GET OUT OF MY WAY, YOU IDIOT CURNOW: I’M STILL TALKING TO YOU”). It takes true intelligence, not a regex, to understand the difference between those two lines. Even my regex won’t handle this exception. But I think my regex comes close enough that it will work under most circumstances.

            Hank KH 1 Reply Last reply Reply Quote 0
            • Hank KH
              Hank K @PeterJones
              last edited by

              @PeterJones
              Your regex is very complex an I am NO way near understanding, that’s why I come here for help. I’m still a noob with regex.
              Your regex got the majority which I’m super thankful for.
              It only missed ONE:

              466
              00:44:40,050 --> 00:44:43,928
              - <i>How's his pulse?</i>
              - RUDENKO: It's high. Not to worry too much.
              

              I can work with that. Thank you so much for this, much appreciated.
              FYI: I’m 70 years old an have a hard time wrapping my head around those complex regex command strings.
              Cheers, Hank

              Hank KH PeterJonesP 2 Replies Last reply Reply Quote 0
              • Hank KH
                Hank K @Hank K
                last edited by

                This post is deleted!
                1 Reply Last reply Reply Quote 0
                • PeterJonesP
                  PeterJones @Hank K
                  last edited by PeterJones

                  @Hank-K said in Need help with adding </i> to the end of multiple different strings.:

                  It only missed ONE:
                  …
                  Oops, it missed this one too:

                  It didn’t miss either of those for me:

                  0c22bfae-82b1-45c2-ad2b-15666e8d3322-image.png

                  Did you have your cursor later in the file and not have Wrap Around checkmarked?

                  addendum: here’s a “before” picture using the Mark tab in the dialog box, to show what it’s going to replace
                  02017fb1-3e33-4001-934f-98b7459564ba-image.png

                  I am NO way near understanding

                  The only way to learn is to study and practice what you’ve been given. The same is true at 7 or 70

                  Hank KH 1 Reply Last reply Reply Quote 0
                  • Hank KH
                    Hank K @PeterJones
                    last edited by

                    @PeterJones said in Need help with adding </i> to the end of multiple different strings.:

                    …

                    You are TOTALLY correct … my BAD, old age sorry.
                    This is so AMAZING… thank you so MUCH Peter.

                    Hank KH 1 Reply Last reply Reply Quote 0
                    • Hank KH
                      Hank K @Hank K
                      last edited by

                      @PeterJones
                      I forgot to mention one issue, which is some of the line strings could already have the </i>.
                      In those cases I get double</i></i>.
                      But easy fix afterwards.
                      Thanks again.

                      PeterJonesP 1 Reply Last reply Reply Quote 0
                      • PeterJonesP
                        PeterJones @Hank K
                        last edited by

                        @Hank-K said in Need help with adding </i> to the end of multiple different strings.:

                        But easy fix afterwards.

                        Adding in the restriction to not already have the </i> would make it much more complicated regex. I think it’s best to just run the one above, then remove any duplicate </i></i>

                        1 Reply Last reply Reply Quote 0
                        • CoisesC
                          Coises
                          last edited by

                          You might find it easier to use Subtitle Edit.

                          One of the options under Tools | Fix common errors… is Fix invalid italic tags.

                          1 Reply Last reply Reply Quote 3
                          • guy038G
                            guy038
                            last edited by

                            Hello, @hank-k, @peterjones and All,

                            @peterjones, I think that we can shorten your regex to :

                            SEARCH (?-is)<i>.*(\R(?!- |\u+:).+)?

                            REPLACE $0</i>

                            • Indeed, the $ before \R and the ^ after \R are useless

                            • The two conditions can be combined in a single look-ahead


                            Note that, in your regex, the conditions are :

                            IF ( DIFFERENT from -\x20 ) at beginning of line AND ( IF DIFFERENT from \u+: ) at beginning of line :

                            => Look for the next part \R.+

                            But, in my regex the unique condition is :

                            IF ( DIFFERENT from -\x20 OR DIFFERENT from \u+: ) at beginning of line :

                            => Look for the next part \R.+


                            However, these two formulations are totally identical, because the conditions are mutually exclusive.

                            So, ONLY 3 cases may occur :

                            -\x20 \u+: Result Notes
                            False False MATCH The regex = (?-is)<i>.*\R.+
                            False True NO match The regex = (?-is)<i>.*
                            True False NO match The regex = (?-is)<i>.*

                            Best Regards,

                            guy038

                            PeterJonesP 1 Reply Last reply Reply Quote 0
                            • PeterJonesP
                              PeterJones @guy038
                              last edited by

                              @guy038

                              Logically (per De Morgan’s Laws), NOT(A OR B) is identical to NOT(A) AND NOT(B). Neither one is logically “simpler” or “harder”; they are just alternate formulations of the same logical concept. An individual might more naturally think in terms of one or the other, but they are the same thing. In this case, because of the way I was building it up to try to help the original poster understand, I thought it best to keep it in the same format that I built it up step-by-step in my mind.

                              And I left the ^ and $ in there, again because they were there at an earlier stage in the development process, and there’s no good reason to remove them. (And in my mind, they help distinguish between “the end of the logical line” $ and “the separator character(s) that goes between one line and the next” \R , so they serve a mental purpose, and have no effect on the final regex.)

                              I am quite willing to have things in alternate formats if it helps me understand or explain a regex better.

                              1 Reply Last reply Reply Quote 3
                              • guy038G
                                guy038
                                last edited by guy038

                                Hi, @hank-k, @peterjones and All,

                                @peterjones, I was upset because I initially thought that your two consecutive conditions should be interpreted as IF(NOT (A)) OR IF(NOT (B)), and I was perplex as my regex was interpreted as IF NOT ((A) OR (B)) and I know, because of the Morgan’s laws, that one of the two should be false !

                                Luckily, I was wrong. Your regex must be interpreted as IF (NOT (A)) AND IF (NOT (B)), because each condition must be true, right after the line-break ! And, of course, you’re right : it’s just the two forms of the same logical concept :-)


                                Now, regarding the necessity to add the $ assertion or not, I did some tests and the problem seems quite difficult !

                                Paste the following text in a new tab :

                                
                                
                                
                                a
                                a
                                a
                                az
                                az
                                az
                                a
z
                                a
z
                                a
z
                                a
z
                                a
z
                                a
z
                                a…z
                                a…z
                                a…z
                                

                                In this text :

                                • Each line is repeated with the 3 possible line-breaks ( \r\n, \n, and \r )

                                • We have, successively, 3 empty lines, 3 chars a, 3 strings aFFz, 3 strings aLSz, 3 strings aPSz and 3 strings aNELz


                                Just test the search of $ against these lines : after the first six lines correctly detected, the regex matches a null length after, both, the a letter and the z letter

                                To my mind, a correct regex to grasp all the characters of a line and thus, all the chars till the very end of each line, could be :

                                • SEARCH/MARK (?s)^([^\r\n]+?|)(?=\r\n|\r|\n)

                                And, in order to get the very end of these lines, we would use the regex :

                                • SEARCH/MARK (?s)^([^\r\n]+?|)(?=\r\n|\r|\n)\K

                                As you see, these formulations are not obvious too !

                                In fact, the simple $ regex matches any position right before the CR, LF, FF, LS, PS or NEL character

                                Best Regards,

                                guy038

                                1 Reply Last reply Reply Quote 0
                                • guy038G
                                  guy038
                                  last edited by guy038

                                  Hi, @peterjones and All,

                                  Regarding the mix of special chars, in lines, which can be wrongly interpreted as line-ending chars, I improved my previous regex about :

                                  • The way to catch all the line contents of any complete line, wrapped or not, whatever its line-ending char(s), even none

                                  • The way to catch the true $ assertion, which represents the very end of any line, empty or not


                                  SEARCH / MARK (?s)^(?:[^\r\n]+?|(?<![\x{000C}\x{0085}\x{2028}\x{2029}]))(?=\r\n|\r|\n|\z)

                                  The contents of each line is stored as group 0, which can be re-used, in the replacement part, with the $0 syntax

                                  SEARCH / MARK (?s)^(?:[^\r\n]+?|(?<![\x{000C}\x{0085}\x{2028}\x{2029}]))\K(?=\r\n|\r|\n|\z)

                                  This second regex gives the zero-length location of the very end of each line, empty or not, whatever its line-ending char(s), even none


                                  I added some tests when special chars are alone or begin or end the current line, and this for the 3 line-end syntaxes ( \r\n, \n and \r ), giving the INPUT text, below, which should work in all cases :

                                  
                                  
                                  
                                  a
                                  a
                                  a
                                  
                                  
                                  
                                  …
                                  …
                                  …
                                  

                                  

                                  

                                  

                                  

                                  

                                  az
                                  az
                                  az
                                  a
z
                                  a
z
                                  a
z
                                  a
z
                                  a
z
                                  a
z
                                  a…z
                                  a…z
                                  a…z
                                  z
                                  z
                                  z
                                  
z
                                  
z
                                  
z
                                  
z
                                  
z
                                  
z
                                  …z
                                  …z
                                  …z
                                  a
                                  a
                                  a
                                  a

                                  a

                                  a

                                  a

                                  a

                                  a

                                  a…
                                  a…
                                  a…
                                  

                                  Just test my two new regexes against this INPUT text !

                                  Best Regards

                                  guy038

                                  P.S. :

                                  With the free-spacing mode, the first regex becomes :

                                  SEARCH / MARK (?xs) ^ (?: [^\r\n]+? | (?<! [\x{000C}\x{0085}\x{2028}\x{2029}] ) ) (?= \r\n | \r | \n | \z )

                                  So, this regex searches, either, for :

                                  • The smallest range of characters, different from \r and \n, after beginning of current line, till the line-ending char(s) or the very end of file

                                  • A zero-length string, at beginning of current line, not preceded by any of these four chars \x{00OC}, \x{0085}, \x{2028} and \x{2029} and followed with any line-ending char(s) or by the very end of file

                                  Alan KilbornA 1 Reply Last reply Reply Quote 0
                                  • Alan KilbornA
                                    Alan Kilborn @guy038
                                    last edited by

                                    @guy038 said:

                                    Paste the following text in a new tab … Each line is repeated with the 3 possible line-breaks ( \r\n, \n, and \r )

                                    For me, at least, the line ending types indicated didn’t carry over in the copy, that is, I got this:

                                    b57e3295-ffa2-4686-aaa5-caaf59887fdf-image.png

                                    It’s not really a problem, but to carry on from that point to replicate what @guy038 is testing, one should manually adjust the line-endings before continuing.

                                    1 Reply Last reply Reply Quote 1
                                    • First post
                                      Last post
                                    The Community of users of the Notepad++ text editor.
                                    Powered by NodeBB | Contributors