Community
    • Login

    Regex. Remove headings that has no full stop at the end (.)

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    12 Posts 4 Posters 2.0k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Dumitru S.D
      Dumitru S.
      last edited by Dumitru S.

      Re: Help - Remove a line from a specific text but keep a part of the text

      I read your post above and I thought that it matches my own subject, so I wrote directly here.
      I have many headings in a text and it ends with a Carriage return and Line Feed(CRLN):
      Subject:Beginning of the heading words words words end of header with no full stop at the endCRLN
      Abc 200:1 words words words and the end of sentence.

      I used this regex: Regex.png
      and it matches until Abc. But, I need to remove only the heading without Abc. I would really appreciate your input. Thank you!

      1 Reply Last reply Reply Quote 0
      • guy038G
        guy038
        last edited by

        Hello, @dumitru-s and All,

        If I fully understand you, you would like to remove lines :

        • Which begin with the word Subject, with this exact case

        AND

        • Which does not end with a full stop sign .

        If so, the following S/R should work :

        SEARCH (?-si)^Subject.+(?<!\.)\R

        REPLACE Leave EMPTY

        Best Regards

        guy038

        1 Reply Last reply Reply Quote 0
        • Dumitru S.D
          Dumitru S.
          last edited by

          @guy038 said in Regex. Remove headings that has no full stop at the end (.):

          (?-si)^Subject.+(?<!.)\R

          Yes, you did understand me very well. I tried your code and fully works. So, thank you.

          Only that my text being of 34900 lines I just discovered that some headings are broken in the middle ones, twice or even more times by Carriage Return and Feed Line. Like this:

          Subject:Beginning of the heading wordsCRLN
          words words words words words words wordsCRLN
          words end of header with no full stop at the endCRLN
          Abc 200:1 words words words and the end of sentence.

          So, in this case that match goes only until the first CRLN, and not al the way until the end just before Abc, as I need it to go.

          I tried to tweak your code but I need to study more before I really can understand what you did.

          I further appreciate you help. Thank you a lot @guy038 !

          1 Reply Last reply Reply Quote 0
          • guy038G
            guy038
            last edited by

            Hi, @dumitru-s and All,

            OK ! So, in other words, you expect to delete any block of lines :

            • Beginning with the word Subject, with this exact case

            • Ending with the last complete line which does not end with a full stop sign .

            Try this S/R :

            SEARCH (?s-i)^Subject[^.]+\R

            REPLACE LEave EMPTY

            Notes :

            • The part [^.]+ matches the greatest non-null range of any character, including EOL chars like \r and \n , different from a dot char ( . )

            • The \R syntax matches any EOL char(s), so \r\n in Windows files, \n in Unix files and \r in Mac files

            BR

            guy038

            1 Reply Last reply Reply Quote 1
            • Dumitru S.D
              Dumitru S.
              last edited by

              Thank you for S/R, and thank you for your Notes.

              What I want is also to delete ALL the block of lines and only those block lines:

              • Beginning from the word Subject, with this exact case

              • Ending with the word that is just before the Abc but without Abc.

              …as seen below:

              Subject: Beginning of the heading wordsCRLN
              words words words words words words wordsCRLN
              words end of header with no full stop at the endCRLN
              Abc 200:1 words words words and the end of sentence.

              The trick is that the word just before Abc si at the end of the line, but it does not have a full stop.
              Thank you again!

              PeterJonesP 1 Reply Last reply Reply Quote 0
              • PeterJonesP
                PeterJones @Dumitru S.
                last edited by

                @Dumitru-S said in Regex. Remove headings that has no full stop at the end (.):

                The trick is that the word just before Abc si at the end of the line, but it does not have a full stop

                No, the trick is that you have not defined the rule for when words CRLF words should be treated as not the end, but end CRLF abc should be treated as the end. Because in our reading of your fake text, we cannot see any difference between words and abc – abc is still alphabetic text, just like words is. Guy works magic with regex, but he cannot actually read your mind.

                1 Reply Last reply Reply Quote 2
                • guy038G
                  guy038
                  last edited by guy038

                  Hi, @dumitru-s, @peterjones and All,

                  Sorry, @dumitru-s, But I’m still confused about what you want. I surely miss something obvious !

                  Let’s suppose the sample text below :

                  Subject: BEGINNING of the heading words words words words words       ( Line A )
                  words words words words words words words words words words words     ( Line B )
                  words words words words END of header with NO full stop at the end    ( Line C )
                  Abc 200:1 words words words and the end of sentence with a FULL STOP. ( Line D )
                  
                  Subject: BEGINNING of the heading words words words words words       ( Line  A )
                  words words words words words words words words words words words     ( Line B1 )
                  words words words words words words words words words words words     ( Line B2 )
                  words words words words words words words words words words words     ( Line B3 )
                  words words words words words words words words words words words     ( Line B4 )
                  words words words words END of header with NO full stop at the end    ( Line  C )
                  Abc 200:1 words words words and the end of sentence with a FULL STOP. ( Line  D )
                  

                  The regex, given in my previous post ( (?s-i)^Subject[^.]+\R ), does select from beginning of Line A ( Subject ) till the very end of Line C ( So, till the closing parenthesis AND the EOL chars \r and \n ). In other words it selects till right before the Abc string. Isn’t it ?

                  Best Regards,

                  guy038

                  1 Reply Last reply Reply Quote 0
                  • Dumitru S.D
                    Dumitru S.
                    last edited by

                    Hi @peterjones, @guy038 and All,

                    You both are right. I myself was wondering how is it that you are wrong and also in the same time I was wrong, too. I was sure I was wrong somewhere, but I just did not know where, but I found it.
                    The fact that I was not able to clearly define my text it just shows what a rooky I am in Regex.

                    My text being so long I missed some aspects of it. I apologize, but now I got it. Thank you for patience.
                    As you see it in the picture and in the video link below, the difference is in the punctuation at the end of the lines. That is where, in some of my lines, your code did not work and I did not know why.

                    MyTextStructure.png
                    Watch MyTextStrucure.mp4 and see how the code works on it.

                    So, what I need is to remove text starting from the word Subject all the away until the first Abc without deleting Abc. Note that at the end of the heading, which is just before the start of the first Abc line can be punctuation OR it cannot appear any punctuation, but there will never be a full stop.

                    This is a real challenge, I know, at least for me it is. I hope I defined it clearly this time.

                    PeterJonesP 1 Reply Last reply Reply Quote 0
                    • PeterJonesP
                      PeterJones @Dumitru S.
                      last edited by PeterJones

                      @Dumitru-S ,

                      (I don’t know what’s in your video, because my I.T. department blocks dropbox and most other such services.)

                      You still have not explained how the regex is supposed to tell the difference between a line that starts with words and a line that starts with Abc. Is Abc literal text, and we’re supposed to recognize that the Subject header ends on the line before literal Abc each time? Or can that Abc be different text each time? Is the real indicator that the line starts with text followed by a number and a colon (like Abc 200: or Xyz 555:)? Or is it that the line starts with a capital letter? Or is it that the line starts with a capital letter and has a colon?

                      Please notice that your examples have completely ignored the advice that I gave to the other poster in the “Remove a line from a specific text…” discussion that you linked in the first post in this topic. That advice said that you should always give examples of text/lines that should match and text/lines that should not match, and that the example data should have as much variation as your real data. We cannot tell if you are using Abc as literal text, or as a placeholder meaning that multiple different text can go in that slot. And since you use Abc every time, it keeps re-inforcing that literal Abc is what you’re looking for, but the text Abc looks like a placeholder, so we’re afraid that you do not actually mean that, and that you’re going to come back and complain that it only worked on Abc not on Xyz despite the fact that you’ve never mentioned Xyz.

                      For example, which of these situations should be regarded as a “missing full stop” situation?

                      Chapter 1
                      Subject: This is a 
                      multiline header,
                      Abc and this isn't part of the header
                      
                      Chapter 2
                      Subject: Another multiline
                      header goes here,
                      Abc 200:1 and more text
                      
                      Chapter 3
                      Subject: Third header
                      that goes
                      across multiple
                      lines,
                      Abc 555: how about this one
                      
                      Chapter 4
                      Subject: Fourth multiple lines,
                      all in the same header,
                      as evidenced here
                      Xyz 200: what about this one?
                      
                      Chapter 5
                      Subject: And here's
                      a fifth
                      example
                      Xyz without colon or number
                      
                      Chapter 6
                      Subject: Yet another example,
                      two lines this time,
                      Xyz: with colon, no number
                      
                      Chapter 7
                      Subject: this has multiple
                      lines, and the next will start,
                      abc 200: this has lower case, so is "start," still part of a matching header or not?
                      
                      Chapter 7
                      Subject: what about
                      multiline followed by 
                      abc: with lowercase, and with colon, but no number
                      
                      Chapter 8
                      Subject: here is one
                      last subject,
                      abc without colon and no number and all lowercase
                      

                      This is a real challenge, I know, at least for me it is

                      The real challenge is that you cannot even describe what you want in language and examples. How you can expect us to guess it is beyond me.

                      However, if you hadn’t had the history of this whole confusing thread, if you were to have just said,

                      So, what I need is to remove text starting from the word Subject all the away until the first literal Abc without deleting Abc. The subject section is possibly crossing multiple lines

                      I would have replied,

                      FIND = (?s)Subject:.*?(?=^Abc)
                      REPLACE = empty
                      MODE = regular expression

                      f50ea771-9b0f-45da-9050-c0ccc629a43c-image.png

                      If you had said,

                      So, what I need is to remove text starting from the word Subject all the away until the first Abc without deleting Abc. The subject section is possibly crossing multiple lines. Never delete a subject section if it ends with a full stop.

                      FIND = (?s)Subject:.*?(?!<\.\R)(?=^Abc)
                      REPLACE = empty
                      MODE = regular expression

                      (this adds the restriction that a literal full-stop (\.) cannot be the character just before the newline sequence (\R).)

                      456119a4-624f-45e6-993c-8650ae1ece09-image.png

                      But, once again, this is all guessing, because you need to spend more effort in giving examples that will

                      Also, the regexes so far will have the problem that because they don’t want to match a Subject that does end in full stop, if it finds one that does, it might go across multiple chapters, like in the example
                      3525f9b9-11e4-45b8-8f88-22b022b73b9b-image.png
                      If this isn’t what you want, then you will have to give more examples of text that should match and text that shouldn’t.

                      ----

                      Do you want regex search/replace help? Then please be patient and polite, show some effort, and be willing to learn; answer questions and requests for clarification that are made of you. All example text should be marked as literal text using the </> toolbar button or manual Markdown syntax. To make regex in red (and so they keep their special characters like *), use backticks, like `^.*?blah.*?\z`. Screenshots can be pasted from the clipboard to your post using Ctrl+V to show graphical items, but any text should be included as literal text in your post so we can easily copy/paste your data. Show the data you have and the text you want to get from that data; include examples of things that should match and be transformed, and things that don’t match and should be left alone; show edge cases and make sure you examples are as varied as your real data. Show the regex you already tried, and why you thought it should work; tell us what’s wrong with what you do get. Read the official NPP Searching / Regex docs and the forum’s Regular Expression FAQ. If you follow these guidelines, you’re much more likely to get helpful replies that solve your problem in the shortest number of tries.

                      Dumitru S.D 1 Reply Last reply Reply Quote 0
                      • Alan KilbornA
                        Alan Kilborn
                        last edited by

                        Maybe we need a template that posters with a regular expression question have to fill out successfully in order to move to the first level of help.

                        If they can’t get through the template, then absolutely no help…not a bunch of back-and-forth-(pull-teeth)-to-get-an-accurate-statement-of-the-problem, not a lot of guessing about what is wanted, not a ton of lets-solve-every-possible-problem-this-could-be.

                        I don’t think we have anything to feel bad about in denying help if someone can’t accurately state what they need help with!

                        So, okay, I enjoy reading about these types of problems and their solutions, but if there is a huge amount of the “diversions” enumerated above in a thread, I get confused and lose interest – because it gets darned hard to follow. And maybe that way I miss out on a “gem” of a new technique. And I don’t want that to happen.

                        1 Reply Last reply Reply Quote 0
                        • Dumitru S.D
                          Dumitru S. @PeterJones
                          last edited by

                          @PeterJones said in Regex. Remove headings that has no full stop at the end (.):

                          (?s)Subject:.*?(?=^Abc)

                          Thank you very much, sir, for answering with so much professionalism and in much detail.

                          This code worked very well for me: (?s)Subject:.*?(?=^Abc) , and this is exactly what I desired from the beginning, although it was not easy for me to explain in words what I need; I am just a beginner.

                          I would like to study carefully what you wrote. Thank you @guy038 @PeterJones @Alan-Kilborn and All.

                          Have an excellent day today!

                          1 Reply Last reply Reply Quote 1
                          • guy038G
                            guy038
                            last edited by guy038

                            Hello, @dumitru-s, @peterjones and All,

                            Sorry to reactivate this topic but, @dumitru-s, from the picture, below, could you just bookmark ALL the lines which should be matched, in each chapter, by the “future” regex ?

                            You are also invited to add some other chapters, with the corresponding bookmarks, if this way could improve our comprehension of what you want to match ;-))

                            0903c4a4-b40b-46d0-8358-f512b43d45f6-image.png

                            Thanks for this extra-work !

                            BR

                            guy038

                            1 Reply Last reply Reply Quote 0
                            • First post
                              Last post
                            The Community of users of the Notepad++ text editor.
                            Powered by NodeBB | Contributors