Community
    • Login

    REGEX - Select everything before a particular word included the line with Word ?

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    32 Posts 10 Posters 74.5k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Md Abdullah Al NomanM
      Md Abdullah Al Noman
      last edited by

      I want to delete everything between two points with 36000 line xml files.
      which portion is repeated in files.
      I can insert the points using a simple find an replace so i would be left with…
      <Middle></Middle>
      <WebsiteList></WebsiteList>
      <EventList></EventList>
      <Note></Note>
      <LastName></LastName>
      START-DELETING… <Photo>data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAANsAAADbCAIAAABr4XMXAAAAA3NCSVQICAjb4U/gAAAKBklEQVR4
      nO3df2vjyB3H8akt18KyV8LCVpA5p/USwy3koIX77+7plT6XPpiWDeyCjw2Nl5izg4K1toJ8kUL/
      UEjS/HAcr0bzne98Xv8cy4E9Jm+P9XP0p3/8818CgIya6gEA/B8UCbSgSKAFRQItKBJoQZFAC4oE
      WlAk0IIigRYUCbSgSKAFRQItKBJoQZFAC4oEWlAk0IIigRYUCbSgSKAFRQItKBJoQZFAC4oEWlAk
      0IIigRYUCbSgSKAFRQItKBJoQZFAC4oEWlAk0IIigRYUCbSgSKAFRQItKBJosVQPQEuu2xJCuK5T
      /DOOEyFEsk6z/EblsFhAkbtyHDvou67rtNv2k//ZK/6TZfk6SeP4Ko6TOL6qeIQ8oMhXNJsN3+8E
      fe+5EB+zrLrnOp7rCNHLsjyKVlG0iuMEc+fuUOStoO/1A7ft2JZVL6Y6IUTxz/1e0LLqQeAFgSeE
      KNKMom9I81UoUgghxkdhkU6hmOpKfH3f7/h+R4gwilbLOImi1WZzXeLrc4IiRdD3HuYoVZHm+9HB
      ep3OF0uk+RSKFMNhr/o3bbftdvsAaT5lepF+t2PbDYUDeJjmbHaJbU3TiyyOLFLQbtvjcZhlwXy+
      nE4vjO3S6HM2rtuqbAtyR5ZVHwz8n38+CsOu6rGoYdwc2Ww2Doc93+/sfVinApZVfz86aDv25LeZ
      6rFUzawig743GgWUW3yomL9Ni9KgX+1ms6FRjoUg8PxuR/UoKmVQkeNxqFeOhfE4tOoG/ZlM+ah+
      t1PuaZjKWFZ9PB6oHkV1TClyNDpQPYT9+X6HzlEq2YwoMuh7ag+Df7/xkSnTpBFFKjlPWC7bbjD4
      FLvgXySDCbIwCLsm7OLw/4RsphbLqocDX/UopGNepPILKcplwjTJ/OP5PqvDyyZMkyhSM0Gf1qUh
      peNcpOu2dDxJs51tNxzn9XvQ9MW7SC1P0rwq6LuqhyAR7yJ5nufg+k0rcC7Sbv5Z9RCk2OXOcX2x
      LpLRcZ9HGG9Kci6SMcti+4dj+8FAUygSaEGRQAuKBFo4F7lep6qHAG/Gucgsz1UPAd6Mc5GMpSnb
      ZatQpJYYL6SGIoEWFAm0oEigBUUCLShSP8s4UT0EiTgXuWF6iCTPOC+/y7nIlOkhkuJZO1xxLpLr
      HBnjV1tTXOeShPX5es5FJkmaZdxObafpNe/HOHAuUgjB7wGuCdOJ/w7zIqNopXoIJeO6KXKHeZH8
      Dt3x3q0R7IvcbK6ZXbfLe7dGsC9SCDGbXaoeQpl479YI9kW6bovT7jaz+f5ZnJ/5NfprMOC12mK7
      bf/6ywchxGQymy+WqocjBec5klmOD7FZyvopzkUyxnhJIxSpJU4bx4+gSC3xOxd1h3ORjCcSfkf+
      73AukvEJN8yRWuL6Z8uynPH1FpyL5HeZRYHr5ypwLpLl9ZGC79xf4FykYDqdRNE31UOQCEVqZhkn
      vC+2YF4kv91tft+xR5gXudlcM1vYjvdGpGBfpOB1Y0qaXnP6OM/iXySn0xvsb2kQJhTJaVJhvxEp
      TCgyjq94HJXMsjy6RJEs8JhaeHyKVxlRJI8bAM6mF6qHUAUjiozjK033b+4OXZ2fR4xXw3+I851f
      D33+9NVp21l28/e/jVSP5Q2m04t1klpWjf1hyDumFJnlN8Ufdb1ONXpiehR9433O8CkjfrUf0uiQ
      3nqdmpajMLJIbX7+NPrylMjAIrX5M2v05SmRcUVm+Y0uB/Y0+vKUyLgihSZnuqNoZeBGpDCzSC3m
      SC0GKYOJRW4218T/3lmW8751YQsTixTkZyBjf7KFsUXOF0vKFwTxOBG/H0OLFEKcU117dxknZh73
      KZhb5Ow8ojlNLuax6iGoZG6RWX5DcJpM02uTf7KFyUUKktPk5Ldz1UNQzOgis/yG1GWwUbQyeQuy
      YHSRQojZ7JLIKZwsyycT0ydIgSKFEJ8/faXw2z2ZzIw9BvkQihRZfvPx5ExtlJPJzIT7DHeBIoUQ
      IklShVEyfjjNHlDkrSRJT0/n1b/vfL5Ejg+hyHvp5g8Vb2rEHYa7Q5FAC4pUrO1oc2NkNVDkPafy
      OLIsJ3WIngIUea/i6SrL8o8nZ5yWbisFirxX5RyJHF+CIm9Z9Vpla10gxy1Q5C3Xdap5I+S4HYq8
      5bqtCt4FOb4KRd7y/Xey3wI57gJFCiGE3+3YdkPqWyDHHaFIIYQIB12pr48cd4ciheu2PJm7Ncjx
      TVCkGA578l4cOb6V6UWGYVfeBIkc92B0kY5jvx8dSHpx5Lgfc4tsNhs/HR9KenHkuDdTVsZ/xKrX
      Pvz4g2XVZbz4fjlaVn18FEoaUumWcbKYL2VcbmxikY5j/3R8SC3H4+NDja6VdN3W4bB3Nr2Yln01
      nXFFBn1vNAqQYykOh708y8tdrMagIpvNxvvRge93JL2+aTkWhsP+fBGXeBunEUVa9Vo48AdhV95W
      mpk5CiEsq+Y4domL+DMv0nVbQd/z/Y7UPQZjcyy4bgtFbmPVa77/zvc7rtuqYNfV8BxLx6fIZrPh
      +52g71X52EPkWDrti1QSYgE5yqBxkcUGorx95+2QoyT6Fek49iDsyt5Z2Y5sjlmWn88uH+1ntB17
      JO30fel0KjLoe2HYVf50bLI5RtHqy+nvG80XEtKgyAqOJu6OZo5Jkn45/Z3HitGki2w2G4fDXhB4
      qgdyj1qOWZZPpxcEHzqxN6JFWvXaaHRAqsUCqRzPZ5fT6SLLWK0VTa5IUr/R30lejnGcfDmds7z+
      klaRfrczHmtzjeB2knJM0+vp9ILxsrxUirTqtfF4oOrgYunkzY7//s8XZj/Tj5C4q6HZbBwf/wU5
      7sLvvrj2huu2NDru+BL1c6TUK7qrJ/tAz3DYe/qTfXfpZ4nX4KiiuEip97soMT4KpR4Gt+1G0Pfu
      orSsWhj6hzJvOa+Y4iJ//PCD7AV3KlbBt+tumhyE3eGwx+n7LNQWGfQ9qcubcGXbjdHooILVs5RQ
      WaTU5U14G4Ryl85SSNm+NtevOHwndUVyOdYD5VJTpFWvoUh4lpoiff8dsz1EKIuqIjFBwvNQJNCi
      oEi/ixzhRSqKxAQJL0ORQEvVRTqOjb1s2KLqIjFBwnZVF1nN4wdBX1UXiQVGYLtKi8RGJLyq0iLt
      Ji72gVdUO0eqXrIH6Ku0SGxEwqsqLbJukbgZFyir9K6Gk5OzKt8OdIRJC2hRv4IAlMh1nV9/+aB6
      FN8FcyTQgiKBFhQJtKBIoAVFlqzEp6jqotylflFkyaJopXoIVSv3GREosmTzxZLHUzx2dDa9KPdn
      AUWW79Pnr4ZEeT67nE4vyn1NHCEvX5blH0/+G/S9IPAcx7bYnc2P46s0/WO+iGUs6YsiZZkvloyf
      qCAPt68v6A5FAi0oEmhBkUALigRaUCTQgiKBFhQJtKBIoAVFAi0oEmhBkUALigRaUCTQ8j+9xvaf
      +IjmkgAAAABJRU5ErkJggg==
      </Photo>
      END-DELETING
      <GroupList></GroupList>
      <Job></Job>

      Hoping you have the answer.

      1 Reply Last reply Reply Quote 0
      • Terry RT
        Terry R
        last edited by

        Given the example you provided the following would remove all text between and including the START and END-DELETING lines.
        Find: (START-DELETING.+\R)(.+\R)+(END-DELETING\R)
        Replace: empty string here

        So the assumption is that there must be at least 1 line between the 2 identifying lines (START and END), that’s the (.+\R)+ portion of the regex. Also note that the first group (START-DELETING.+\R) includes the .+ as your example also has 3 period characters after it. I’ve included brackets around each sub-portion just so as it makes it a bit easier to segment out and identify what each group is doing. Only the middle group brackets are absolutely necessary, i.e.(.+\R)+.

        You say you can/have replaced using a simple find and replace to get the START and END lines in there. With my regex you could replace those portions with the original string you used to find. That would save you 1 or 2 additional steps.

        Hope this helps.

        Terry

        1 Reply Last reply Reply Quote 0
        • guy038G
          guy038
          last edited by

          Hello, @md-abdullah-al-noman, @terry-r and All,

          I think, that the following regex S/R, could be used, too :

          SEARCH (?s-i)^\h*START-DELETING.+?END-DELETING\R

          REPLACE Leave EMPTY

          Notes :

          • First, the (?s-i) modifiers, means that, from now on :

            • Any regex dot symbol ( .) will match, absolutely, any single character ( standard ones and EOL ones )

            • The search will be processed in a sensitive way ( Non-insensitive ! )

          • Then, the part ^\h*START-DELETING looks, from beginning of line ( ^ ), for the upper-case string START-DELETING, possibly preceded with some horizontal space characters ( Usual space or tabulation )

          • At end, the part END-DELETING\R searches for the upper-case string END-DELETING, followed with its line-break character(s)

          • And the middle part .+? represents the shortest range, of any character, between the two strings START-DELETING and END-DELETING

          • Finally, as the replacement regex is empty, all the overall match is, simply, deleted

          Best regards,

          guy038

          1 Reply Last reply Reply Quote 0
          • David BennettD
            David Bennett
            last edited by

            @guy038 you truly are a legend, I agree with the other poster. You are so deep into notepad++ regex, impressive!
            I believe you may also know this - IMHO quite common - case, although I can’t find it described anywhere:

            Suppose you have just one large file (wordpress sql database in fact, opened in my favorite editor notepad++) and STRING A and STRING B should always belong together:
            FIND ALL INSTANCES OF ANY TEXT across lines
            WHERE STRING A sometime later
            IS FOLLOWED BY ANOTHER STRING A
            INSTEAD OF THE “CLOSING” STRING B

            Example: Find all instances where, across lines, there’s the literal string [/social]
            and after any kind and number of characters there’s another literal string [/social]
            BUT in between the two is nowhere a literal string [social] although it should be because [social] and [/social] belong together.

            So basically in the example case, string A and string B always belong together, there must never follow two A’s or two B’s. Always the A string, then the B string. Then again the A string, then the B string. Etc. And so you need to find any “fault”: where A is followed sometime later by another A, instead of first a B string.

            Did I explain this well enough?

            I am sure none of the above, nor anything else I have found, works because I’ve tried them all. Would you have an idea how to go about this?

            1 Reply Last reply Reply Quote 1
            • guy038G
              guy038
              last edited by guy038

              Hello @david-bennett, and All,

              Thanks, David You explained your problem very well. So you’re looking for ranges [social].......[/social], where, unfortunately, one boundary;, either [social] or [/social] is missing, aren’t you ?

              As a sample, in the text, below, I indicated where the boundary is missing :

              ...[social].......[/social]..............[/social]........[social]............[social]..........[/social]...
                                               ^                                     ^
                                        [social] missing			            [/social] missing
              

              BTW, I, also, assume that your database does NOT contain nested blocks [social].......[/social] as, for instance :

              [social]....[social].....[social].....[social].....[/social]....[/social].....[/social]......[social].....[/social]....[/social]...
              

              In that case, a possible regex could be :

              SEARCH (?-s)(?<=\[/social\])((?!\[social\]).)+?(?=\[/social\])|(?<=\[social\])((?!\[/social\]).)+?(?=\[social\])

              If you apply this regex against the text, below, it select all the zones where a boundary is missing !

              ...[social].......[/social]..............[/social]........[social]............[social]..........[/social]...
                                        >              <                       >            <
              

              Note that :

              • If the selection is surrounded with two boundaries [social], then, this selection should contain a [/social] ending boundary

              • If the selection is surrounded with two boundaries [/social], then, this selection should contain a [social] starting boundary


              If your text may be split on several lines, use, preferably, this regex, almost identical, which is, also, correct for ONE-line blocks [social].......[/social] !

              SEARCH (?s)(?<=\[/social\])((?!\[social\]).)+?(?=\[/social\])|(?<=\[social\])((?!\[/social\]).)+?(?=\[social\])

              ...[social].......[/social]....
                                        >
              .....
              .....[/social].....
                   <
              .......
              ...[social].......[/social]..............[/social]........[social]............[social]..........[/social]...
                                        >              <                       >            <
              ......
              ...[social]...
                        >
              ....
              .....[social]...
                   <
              .......[/social]...
              

              Notes :

              • The square brackets need to be escaped with the \ character, as they have a special meaning, in regular expressions

              • At the beginning, the (?-s) or (?s) modifier determines if the dot meta-character ( . ) represents a single standard character only, or any character

              • Then the regex engine tries to match one of the two alternatives :

                • (?<=\[/social\])((?!\[social\]).)+?(?=\[/social\])

                • (?<=\[social\])((?!\[/social\]).)+?(?=\[social\])

              • The first alternative matches the smallest range of characters ( (....)+? ), surrounded by two strings [/social], due to the look-behind (?<=\[/social\]) and the look-ahead (?=\[/social\])

              • The second alternative matches the smallest range of characters ( (....)+? ), surrounded by two strings [social], due to the look-behind (?<=\[social\]) and the look-ahead (?=\[social\])

              • In the first alternative, this range must not contain, at any position, the string [social], due to the negative look-ahead, in the construction (?!\[social\]).

              • In the second alternative, this range must not contain, at any position, the string [/social], due to the negative look-ahead, in the construction (?!\[/social\]).

              Best Regards,

              guy038

              1 Reply Last reply Reply Quote 1
              • David BennettD
                David Bennett
                last edited by

                Hey @guy038, thanks for replying! And hello to every notepad++ user.

                So you are suggesting, in words,

                • match a prefix, here [/social], but exclude it from the capture
                • capture a group:
                  — if suffix is absent, here [social]
                  — and any character, one or more times, but as few as possible
                • match a suffix, here [/social], but exclude it from the capture

                Is that worded right?

                Earlier I had tried many variations with the look-behind and look-ahead as well, because this simple construct makes so much sense. And then in between, to exclude captures where [social] appears, like it normally should.
                Your group capture notation ((?![social]).)+? however I hadn’t tried, thanks for this new variation in my sortiment, I always used exclusion notations like .?(?![social]) and even tried .?[^([social])] which I think is wrong Regex syntax in Notepad++ too.

                Either way, unfortunately your regex too does not find the instance where [/social] follows an [/social] without the corresponding [social] in between.
                (“Corresponding” because [social]…[/social] here is a “shortcode” in wordpress, but could be anything in other situations of text processing needs.)

                Using your regex in all variations you gave in my notepad++ 7.5.8 highlights the entire text (here a database), ie it has 0 hits.

                So I was wondering, could there be, logically, any kind of situation where

                • notepad++ COUNT [social] has 117 hits

                • and notepad++ COUNT [/social] has 118 hits
                  (as it does)

                • and yet, this would NOT be due to the presumed occurrence of one end-marker, here [/social], missing its corresponding start-marker, here [social]?

                Because if, logically, such situation is possible (despite that I myself can’t think of one), then your regex may be working despite that in my particular case it cannot find anything.

                Did I explain this puzzle well enough?

                Just to clarify, this is not “a notepad++ oddity”, my Expresso Regex sw has 0 hits too with your proposed regex, and with all notations that I had tried earlier.
                If any oddity then the oddity must be right within the sql database. But I can’t think of one. Can you or anyone else maybe?

                Scott SumnerS 1 Reply Last reply Reply Quote 0
                • David BennettD
                  David Bennett
                  last edited by

                  Hey @guy038, thanks for replying! And hello to every notepad++ user.

                  So you are suggesting, in words,

                  • match a prefix, here [/social], but exclude it from the capture
                  • capture a group:
                    — if suffix is absent, here [social]
                    — and any character, one or more times, but as few as possible
                  • match a suffix, here [/social], but exclude it from the capture

                  Is that worded right?

                  Earlier I had tried many variations with the look-behind and look-ahead as well, because this simple construct makes so much sense. And then in between, to exclude captures where [social] appears, like it normally should.
                  Your group capture notation ((?!\[social\]).)+? however I hadn’t tried, thanks for this new variation in my sortiment, I always used exclusion notations like .*?(?!\[social\]) and even tried .*?[^(\[social\])] which I think is wrong Regex syntax in Notepad++ too.

                  Either way, unfortunately your regex too does not find the instance where [/social] follows an [/social] without the corresponding [social] in between.
                  (“Corresponding” because [social]…[/social] here is a “shortcode” in wordpress, but could be anything in other situations of text processing needs.)

                  Using your regex in all variations you gave in my notepad++ 7.5.8 highlights the entire text (here a database), ie it has 0 hits.

                  So I was wondering, could there be, logically, any kind of situation where

                  • notepad++ COUNT [social] has 117 hits

                  • and notepad++ COUNT [/social] has 118 hits
                    (as it does)

                  • and yet, this would NOT be due to the presumed occurrence of one end-marker, here [/social], missing its corresponding start-marker, here [social]?

                  Because if, logically, such situation is possible (despite that I myself can’t think of one), then your regex may be working despite that in my particular case it cannot find anything.

                  Did I explain this puzzle well enough?

                  Just to clarify, this is not “a notepad++ oddity”, my Expresso Regex sw has 0 hits too with your proposed regex, and with all notations that I had tried earlier.
                  If any oddity then the oddity must be right within the sql database. But I can’t think of one. Can you or anyone else maybe?


                  Edit: I went back up to add some extra characters as this comment software here seems to require ESCAPING (like Regex does), otherwise it shows a DIFFERENT text-to-be-posted (even on the right side WHILE you are writing) than which you input (even WHILE you input it on the left side).
                  Hopefully now the OUTPUT text matches my INPUT text…
                  How do YOU more easily get your Regex notations to show up LIKE YOU ENTER THEM? Yours come up in red on grey background?

                  I am posting this again, because “You are only allowed to edit posts for 180 second(s) after posting”… and then even “As new user you can only post once every 1200 seconds” - lol, such bureaucracy makes even genuine comments like mine needlessly difficult…

                  1 Reply Last reply Reply Quote 1
                  • guy038G
                    guy038
                    last edited by guy038

                    Hi, @david-bennett, and All,

                    OK, my regex does not match something. Quite disappointing :-(( But I don’t give up !

                    If you get no result, this means that, either :

                    • My regular expression is not well constructed or my concept, to solve your problem, is erroneous

                    • Some characters in your text, or its general layout, prevents us from obtaining positive results

                    • May be, the two above steps arise together :-((

                    So, if you don’t mind, and if your data is, both, not confidential nor personal, you could send it ( or part of it ) to me. Here is, below, my e-mail address :

                    Working with real data is always better and, anyway, Notepad++ is really a Swiss knife ! Thus, no doubt about it ! We will, finally, find an acceptable solution ;-))


                    Regarding the red on gray background, you can obtain it by wrapping your text between two grave accents ( ` )

                    For instance

                    • If you write `text in red on gray`                =>    normal text :          text in red on gray

                    • If you write *`text in red on gray`*            =>    text in italic :         text in red on gray

                    • If you write **`text in red on gray`**        =>    text in Bold :          text in red on gray

                    • If you write ***`text in red on gray`***    =>    text in Bold-Italic :          text in red on gray


                    Refer, also, to the excellent summary of the Markdown syntax, on our forum, below, by @scott-sumner !

                    https://notepad-plus-plus.org/community/topic/14262/how-to-markdown-code-on-this-forum/4

                    And this FAQ Desk: post, from @peterjones, will give some additional information :

                    https://notepad-plus-plus.org/community/topic/15739/faq-desk-request-for-help-without-sufficient-information-to-help-you/1

                    Best Regards,

                    guy038

                    P.S. :

                    David, it would be particularly interesting if you could send me the part where you got 117 hits for [social] and 118 hits for [/social] !

                    1 Reply Last reply Reply Quote 1
                    • David BennettD
                      David Bennett
                      last edited by

                      First, thanks so much for replying! @guy038
                      Maybe you are an official here (hence why you know so much), either way, much appreciated taking the time!

                      Second,

                      “If you get no result, this means that, either :…”

                      Well no, like I hinted, the reason likely is at my end, lol:

                      “So I was wondering, could there be, logically, any kind of situation where notepad++ COUNT [social] has 117 hits, and notepad++ COUNT [/social] has 118 hits (as it does), and yet, this would NOT be due to the presumed occurrence of one end-marker, here [/social], missing its corresponding start-marker, here [social]? - Because if, logically, such situation is possible (despite that I myself can’t think of one), then your regex may be working despite that in my particular case it cannot find anything.”

                      Again, your regex may be working well in other raw texts :-)
                      So were you, or anyone else, able to think of a “logical” possibility/explanation of the above COUNTS?

                      Also thanks a lot for your “markdown” explanation and for Scott’s helpful link, I multi-clipboarded it, just have to remember that, the “quote” tip I used above already, you noticed.

                      Well, posting the raw db publicly certainly is not wise but sending it to you would be no problem I think. I assume you would find, your regex works in general, and maybe even find out, why it doesn’t work here. So I think, for both, it would be good to know, yes. :-)

                      Presumably, had you found a flaw in my initial assumption (above) you would have raised it, right @guy038 ?

                      If anyone else finds a flaw in it, shout it out loud, will ya?

                      1 Reply Last reply Reply Quote 0
                      • guy038G
                        guy038
                        last edited by

                        Hi, @david-bennett, and All,

                        Hum…, not totally sure if I got your message properly, but the N++ COUNT feature always scan the entire current file, even if one or several selections are present and whatever the Wrap around option is set or unset !

                        Of course, in Normal mode, for instance, the count result may be different if you tick/untick the Match case and/or the Match whole word only options

                        So, to my mind, if we assume that :

                        • The Match case AND the Match whole word only options are not ticked :

                        • No parameter has been changed, in the Find dialog, between the two COUNT operations ( except for the Find what: zone, of course ! )

                        => The count of the presumably well-balanced strings [social] and [/social] should always return identical numbers. If NOT, it, necessarily means that there are, indeed, one or several additional occurrences of one of the boundaries :-((

                        Cheers,

                        guy038

                        1 Reply Last reply Reply Quote 2
                        • David BennettD
                          David Bennett
                          last edited by

                          the N++ COUNT feature always scan the entire current file, even if one or several selections are present and whatever the Wrap around option is set or unset !

                          Agreed. No doubt about that. And regardless of options set, both normal and regex search always finds one more [/word] than [word]. I verified it for each potential oddity. ;-)
                          And yet, your earlier proposed regexes (is that a word?) give 0 hits. In regex search mode, lol.

                          Did you get the sql raw text I sent to the email you … made public? Did you find out if your suggested regex is working some other way in it?
                          I know that normally notepad++ regex works, in this and any sql db, because I am using it all the time, successfully. So now you got me curious what your explanation is why your regexes fail in this sql db?

                          1 Reply Last reply Reply Quote 0
                          • guy038G
                            guy038
                            last edited by guy038

                            Hello, @david-bennett, and All,

                            As planned I got your file, without any trouble and I first renamed it as a simple .txt file

                            When opened in N++, I, also, deleted your comments at beginning of file. So the first line is – Dumping data for table `wp_posts`

                            Here are some characteristics of this working file :

                            • Size : 9 048 416 bytes

                            • Lines : 1,761 ( the smallest data line is 237 chars long and the longest is 336936 long ! )

                            • 118 occurrences of the string [sociallocker], in that exact case

                            • 119 occurrences of the string [/sociallocker], in that exact case

                            I quickly understood that each line, even a long one, contained only one  block [sociallocker]........[/sociallocker] ! ( containing between 208 and 96081 characters )

                            Then, from above, we can deduce that the file contains :

                            • 118 correct blocks ......[sociallocker].....[/sociallocker].........

                            • 1 incorrect block ..........................[/sociallocker].........

                            IMPORTANT : Because of the very long lines of your file, I advice you to not use the Wrap feature ! So, uncheck the View > Word wrap menu option for quick navigation in text and between tabs


                            Before verifying why my previous regex does not work, it was better to make a simplified regex, which would look for any line containing the string [/sociallocker], but ONLY IF the string [sociallocker] cannot be found from beginning of file, till the [/sociallocker] string. So, I first tried :

                            SEARCH (?-s)^((?!\[sociallocker\]).)+?(?=\[/sociallocker\])

                            Unfortunately, it wrongly, grabbed all the file contents, due to a general failure of the regex engine !?

                            I had no chance, either, using a greedy quantifier with the regex, below, which gave identical results

                            SEARCH (?-s)^((?!\[sociallocker\]).)+(?=\[/sociallocker\])

                            I, then, had the intuition that using a non-capturing group could be the solution. Hence, the regex :

                            SEARCH (?-s)^(?:(?!\[sociallocker\]).)+?(?=\[/sociallocker\])

                            Bingo ! Ax expected, it did match all the contents of line 516, till the [/sociallocker] string ;-))

                            Strangely, using the greedy syntax, below, it leads to a catastrophic breakdown, too, as above !

                            SEARCH (?-s)^(?:(?!\[sociallocker\]).)+(?=\[/sociallocker\])

                            I suppose that it’s due, both, to the needed backtracking steps of the regex engine and to the presence of the look-ahead feature :-((

                            On reflection, it seemed to me that we could remove the look-ahead, too, as we just want to identify the line without the starting boundary. Hence, the regex below, with a lazy quantifier :

                            SEARCH (?-s)^(?:(?!\[sociallocker\]).)+?\[/sociallocker\]

                            Whaooou ! Again, the line 516 is correctly matched ! And, this time, the same regex, with the greedy quantifier, below, works too !

                            SEARCH (?-s)^(?:(?!\[sociallocker\]).)+\[/sociallocker\]

                            Here is, below, the general layout of the three lines, 514, 515 and 516

                             N°  1234567890...             col 24388                  col 43781     col 43795
                                                              v                            v             v
                            514  (4437,...................... [sociallocker]...............[/sociallocker]...4285 chars.....CRLF
                            515  INSERT ...368 chars.....CRLF
                            516  (4441,..........................................................................................[/sociallocker]......CRLF
                                                                                                                                 ^             ^ 
                                 1234567890...                                                                              col 55224      column 55238
                            

                            Before coming back,to the regex of my previous post, let us verify the number of ranges between two ending boundaries [/sociallocker] with the simple regex below :

                            SEARCH (?s)(?<=\[/sociallocker\]).+?(?=\[/sociallocker\])

                            We get 118 hits. This is quite logical as we have 119 strings [/sociallocker] => 118 intervals ( which contains between 2,657 and 1,339,177 chars ! )

                            Remark :

                            You could tell me : why not just use (?s)\[/sociallocker\].+?\[/sociallocker\], without the two look-arounds ?. Well, it’s not exactly the same as, when the first match occurred, the search process continues from after the present ending boundary, so, unfortunately, it skips one interval, after each match :-((


                            Well, you remember that, in my previous post, the regex contained two alternatives. Naturally, I thought it was safer to test them , separately :

                            So, the regex, below, tries to match two starting boundaries [sociallocker], without any ending boundary [/sociallocker], inside :

                            SEARCH (?s)(?<=\[sociallocker\])(?:(?!\[/sociallocker\]).)+?(?=\[sociallocker\])

                            => ~ 22 seconds ,later, on my old laptop, it gets no results, as expected ! Rather reassuring, isn’t it ?

                            Similarly, here is the regex, which tries to match two ending boundaries [/sociallocker], without any starting boundary [sociallocker], inside :

                            SEARCH (?s)(?<=\[/sociallocker\])(?:(?!\[sociallocker\]).)+?(?=\[/sociallocker\])

                            => 12 seconds, later, it did select the line 516, till its alone boundary [/sociallocker] ;-))


                            Now, it’s time to do the final tests, with the overall regex :

                            SEARCH (?s)(?<=\[/sociallocker\])(?:(?!\[sociallocker\]).)+?(?=\[/sociallocker\])|(?<=\[sociallocker\])(?:(?!\[/sociallocker\]).)+?(?=\[sociallocker\])

                            But again, one wrong match of all the file contents :-((

                            If we swap the two alternatives, no chance, either !

                            SEARCH (?s)(?<=\[sociallocker\])(?:(?!\[/sociallocker\]).)+?(?=\[sociallocker\])|(?<=\[/sociallocker\])(?:(?!\[sociallocker\]).)+?(?=\[/sociallocker\])

                            Remark :

                            I did additional tests to detect the limit and it happens that these two final regexes, match correctly ONLY IF the size of the file is, roughly, under 5 Mb. Surely, it depends of the RAM amount of your configuration and/or of some flaws in our regex engine !

                            Anyway, there’s, usually, one/several additional solutions ! For instance, I’m thinking of that simple one :

                            • Copy all your data in a new tab

                            • Mark, with bookmarks, in Normal search, all lines containing the string [/sociallocker]

                            • Use the Search > Bookmark > Remove Unmarked lines => your file is shortened to 119 lines

                            • Mark, with bookmarks, in Normal search, all lines containing the string [sociallocker]

                            • Use, again, the Search > Bookmark > Remove Unmarked lines => your file should, now, contains 1 line, beginning with :
                              (4441, 1, ‘2014-08-23 11:18:50’, ‘2014-08-23 11:18:50’ : our expected line !

                            • Now, searching, in Regular expression, for the regex ^\(4441,, in your original text, should clearly identify the line 516 :-))

                            Cheers,

                            guy038

                            P.S. :

                            Looking, again, to your file, I noticed that the starting and ending boundaries are embedded, as below :

                            • <span class=\"sharely\">[sociallocker]</span>

                            • <span class=\"sharely\">[/sociallocker]</span>

                            and, indeed, one complete block <span class=\"sharely\">[sociallocker]</span> is missing, in line 516

                            David BennettD 1 Reply Last reply Reply Quote 4
                            • David BennettD
                              David Bennett @guy038
                              last edited by

                              Huh, you write SO MUCH @guy038, I too would have preferred you left certain details out… ;-)

                              Let’s get to the gist: If I understood correctly, you found out

                              • the N++ regex engine has some “shortages” if the src txt exceeds ~ 5 Mb
                              • and in my particular case the regex that works is … below, lol

                              Working regex, in this sql and likely many other sql “texts”:
                              (?-s)^(?:(?!\[social\]).)+?(?=\[/social\])

                              The difference to your earlier regexes is tiny:
                              (?s)(?<=[social])((?![/social]).)+?(?=[social])

                              Wow, thanks so much! I have learned a lot along the lines… am sure all others reading this, have too.
                              Couldn’t see how you saw things like “longest line”, but am sure, practice makes perfect, with N++ You surely have a LOT of practice. :-)

                              To be honest, I have to chew on the difference in regex a bit more… when my mind is clearer/emptier. Again, thank you!!

                              1 Reply Last reply Reply Quote 0
                              • Scott SumnerS
                                Scott Sumner @David Bennett
                                last edited by

                                @David-Bennett said:

                                Using your regex in all variations you gave…highlights the entire text…

                                This highlighting-of-the-entire-text thing often comes up when the regex engine encounters a “catastrophic error”. There is some more info on it here if you care to read about it. Yes, it seems to be a bug in Notepad++, but it usually appears when a less-than-ideal regex is crafted to solve a problem.

                                Where does a less-than-ideal regex comes from? It often comes from the helpers on this forum trying to craft a regex to solve someone’s problem, without having the full benefit of exposure to the data that is being acted upon. Usually all they have is a really vague description of the data (although your description from the very beginning was really good!) or a few sample lines that may or may not be truly representative. Notice that once @guy038 got ahold of your real data file, he was able to craft something which met your need and avoided this problem.

                                1 Reply Last reply Reply Quote 3
                                • guy038G
                                  guy038
                                  last edited by guy038

                                  Hi, @david-bennett, and All,

                                  Couldn’t see how you saw things like “longest line”, but am sure, practice makes perfect, with N++ You surely have a LOT of practice. :-)

                                  Well… not practice , just a bit of logic and an appropriate regexes, as always ;-))


                                  If N is an integer and, by convention, Count(regex) represents the total matches of a regex, in current file

                                  • Count((?-s)^.{N,}) give all lines containing, at least, N character(s)

                                  • Count ((?-s)^.{M,N}$) give all lines containing, between M and N character(s)

                                  • Count ((?-s)^.{1,N}$) give all lines containing, at most, N characters(s)


                                  Thus, if we want to know ( just out of curiosity ! ) how many characters contains the longest line of a file, simply use something similar to the mathematical bissection/dichotomy method :

                                  https://en.wikipedia.org/wiki/Bisection_method

                                  • Let’s begin with Count ((?-s)^.{1000000,}). If result > 0, retry with numbers over one million !

                                  • If result = 0, retry with Count ((?-s)^.{500000,}) ( Case of your file )

                                    • if result > 0, retry with Count ((?-s)^.{750000,})

                                    • if result = 0, retry with Count ((?-s)^.{250000,}) => With your file, we get 1 hit Nice !

                                  Then, just find that line, with the Find button, and read the Ln: and Sel: zones of the status bar ;-)) ( => Line 20, 330936 chars )

                                  Note that you could, going on the method, taking the half of the pertinent interval, as number N, at each step !


                                  Similarly, to get the shortest line of a file ( note that logic is reversed ! )

                                  • Let’s begin with Count ((?-s)^.{1,1000}$). If result = 0, retry with numbers over 1000, as , let’s say 5000

                                  • If result > 0, retry with Count ((?-s)^.{1,500}$) ( Case of your file )

                                    • if result = 0, retry with Count ((?-s)^.{1,750}$)

                                    • if result > 0, retry with Count ((?-s)^.{1,250}$) => With your file, we’re lucky: we get 1 hit, again !

                                  So, find that line and read the Ln: and Sel: zones of the status bar ! ( => Line 1705, 237 chars )

                                  Cheers,

                                  guy038

                                  Scott SumnerS 1 Reply Last reply Reply Quote 1
                                  • Scott SumnerS
                                    Scott Sumner @guy038
                                    last edited by

                                    @guy038 said:

                                    if we want to know…how many characters…the longest line of a file, simply use…the mathematical bissection/dichotomy method

                                    Well…that’s a whole lot of trial and error work. How about a little Pythonscript?:

                                    longest_line_length = 0
                                    shortest_line_length = None
                                    
                                    def fel(line_contents, line_number, total_lines):
                                        global longest_line_length, shortest_line_length
                                        line_contents = line_contents.rstrip()  # remove line-ending
                                        llc = len(line_contents)
                                        if llc > longest_line_length: longest_line_length = llc
                                        if shortest_line_length == None: shortest_line_length = llc
                                        if llc < shortest_line_length: shortest_line_length = llc
                                    
                                    editor.forEachLine(fel)
                                    
                                    notepad.messageBox('The longest line in the current file is {}; the shortest is {}.'.format(longest_line_length, shortest_line_length), 'INFO')
                                    

                                    When run it pops up a box with the results, for example:

                                    Imgur

                                    1 Reply Last reply Reply Quote 2
                                    • First post
                                      Last post
                                    The Community of users of the Notepad++ text editor.
                                    Powered by NodeBB | Contributors