• Login
Community
  • Login

Find line above given text in document

Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
27 Posts 9 Posters 3.4k Views
Loading More Posts
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • B
    Benji2025 @Ekopalypse
    last edited by Benji2025 Apr 1, 2025, 2:44 PM Apr 1, 2025, 2:33 PM

    @Ekopalypse

    Hi Eko, thank you for quick response, I have managed to sort it now using Python.

    Tooltip in screenshot.

    7deb9ff6-4225-4a3b-81ec-2824167d8330-image.png

    No invisible characters by the way, and it works fine with a smaller amount of lines.

    1 Reply Last reply Reply Quote 0
    • E
      Ekopalypse @Benji2025
      last edited by Apr 1, 2025, 5:25 PM

      @Benji2025 said in Find line above given text in document:

      The suggested method works fine on a small number of lines but with millions of lines

      I tested this with 120_000_000 lines and it worked for me, but to be honest, it only had 1.5 GB, so … your file must be much larger accordingly. Unfortunately I don’t know the internals from what size this becomes a problem.

      c5cafef1-7945-4c85-bab7-e0567b13ca81-{B52BFEFA-92C5-4B2D-A551-4DDD7409CB32}.png

      I have managed to sort it now using Python

      hehe … from my point of view that is ALWAYS the solution :-D

      M 1 Reply Last reply Apr 2, 2025, 3:37 PM Reply Quote 2
      • G
        guy038
        last edited by guy038 Apr 1, 2025, 5:50 PM Apr 1, 2025, 5:44 PM

        Hello, @benji2025, @ekopalypse, @alan-kilborn and All,

        Oh, My God, I’ve been beaten by @ekopalypse :-((

        @benji2025, I’d really like to know the average size of your files and their number of lines !

        Indeed, I did a test with a file of size 143,151,374 bytes, containing 3,151,513 lines of 47 characters each. And, both, my or your regex worked fine and mark 121,212 lines !!

        So, your regex that I used is (?-s)^.*$(?=\RTEST)

        And I used a similar syntax (?-s)^.*\R(?=TEST$)


        On my old Win XP machine, with N++ v7.9.2 release, :

        • Your regex did the marking operation in about 24,3 seconds

        • My regex did the marking operation in about 23,2 seconds

        So, I suppose that you should try my regex version !

        Best Regards,

        guy038

        B C E 3 Replies Last reply Apr 2, 2025, 8:02 AM Reply Quote 0
        • B
          Benji2025 @guy038
          last edited by Apr 2, 2025, 8:02 AM

          @guy038 @Ekopalypse

          Thanks guys

          Just the 9 million lines at 1GB, and that is just the short run of a job I am running.

          Same error with guy038’s syntax.

          197094a7-ce8a-48c8-a66e-8d0ffd20d25a-image.png

          1 Reply Last reply Reply Quote 0
          • M
            Mark Olson @Ekopalypse
            last edited by Apr 2, 2025, 3:37 PM

            @Ekopalypse said in Find line above given text in document:

            I have managed to sort it now using Python

            hehe … from my point of view that is ALWAYS the solution :-D

            Yeah, to expand on that for the benefit of others who don’t know: Python’s re library is usually at least 10x faster than Notepad++'s built-in search capability, such that it is vastly better when searching extremely large files. The Columns++ and MultiReplace plugins, while very powerful in their own right, will AFAIK never be much faster than the Notepad++ find/replace form because they also do their search-replace operations through Scintilla.

            @PeterJones
            At this point I’ve repeated this PSA enough times that it should probably be added to one of the FAQ’s, maybe this one?

            C P 2 Replies Last reply Apr 2, 2025, 5:06 PM Reply Quote 1
            • C
              Coises @Mark Olson
              last edited by Apr 2, 2025, 5:06 PM

              @Mark-Olson said in Find line above given text in document:

              The Columns++ and MultiReplace plugins, while very powerful in their own right, will AFAIK never be much faster than the Notepad++ find/replace form because they also do their search-replace operations through Scintilla.

              A minor technical quibble: the regex search in Columns++ search does not use Scintilla search. While it does search within Scintilla’s buffer (avoiding a copy, using a documented interface that exposes the content as addressable bytes), it uses Boost::regex directly, with its own custom iterators, rather than the Scintilla search interface.

              Still, this probably only makes it approximately equal in speed relative to Notepad++, since it does use the same regular expression engine in essentially the same way.

              1 Reply Last reply Reply Quote 2
              • C
                Coises @guy038
                last edited by Apr 2, 2025, 5:22 PM

                @guy038

                I’ll have to see if I can test this with some large files, but does it not strike you as very strange that this expression:

                (?-s)^.*$(?=\RAccess is denied)

                should yield a complexity error, regardless of the data? Complexity errors are supposed to happen when the same text keeps getting rescanned; that is, not just when the operation takes a long time, but when the number of bytes examined is growing “too much” faster than the start point is being moved forward (or else when internal stacks overflow preset bounds). This expression shouldn’t cause that. It doesn’t backtrack.

                @Benji2025 — I know you’ve solved your problem, but if you are still reading and interested: Does the same thing happen on the same data with:

                (?-s)^.*+(?=\RAccess is denied)

                That shouldn’t matter, but maybe the regex engine isn’t as smart as I think it is.

                T C 2 Replies Last reply Apr 2, 2025, 8:47 PM Reply Quote 2
                • P
                  PeterJones @Mark Olson
                  last edited by PeterJones Apr 2, 2025, 5:27 PM Apr 2, 2025, 5:27 PM

                  @Mark-Olson said in Find line above given text in document:

                  At this point I’ve repeated this PSA enough times that it should probably be added to one of the FAQ’s, maybe this one?

                  The suggestion to “use Python” is not necessarily the same as the suggestion to “use PythonScript”; I am not sure whether @Benji2025 was using the standalone python interpreter, or using the Notepad++ automation plugin to search the open file. I am also not sure whether your comment about performance is saying “python.exe’s re library is faster” or “using PythonScript and the re library is faster” or “using PythonScript and it’s re-like editor.research() is faster”. Because those are three different things.

                  The FAQ you pointed to is only about using PythonScript plugin (or other plugins) to do mathy-replacements; the generalized statement you seem to be making doesn’t seem to be restricted to mathy-replacements, so I’m not sure that’s the best place, even once the context of the claim is clarified.

                  So that we don’t clutter this specific question with workshopping, if you wanted to make an RFC post in the Blogs category to workshop a new FAQ entry, once everyone was happy with everything workshopped, I could create a new entry in the FAQ category and duplicate the final version of the post (with you as the author).

                  1 Reply Last reply Reply Quote 2
                  • T
                    Terry R @Coises
                    last edited by Terry R Apr 2, 2025, 8:48 PM Apr 2, 2025, 8:47 PM

                    @Coises said in Find line above given text in document:

                    I’ll have to see if I can test this with some large files, but does it not strike you as very strange that this expression:

                    I tried thinking a bit laterally about this issue, more specifically the error of “complexity” stopping the regular expression from completing.

                    Since we all seemed unsure of why it would generate such a message from what “seemed” to be a simple find expression I thought I would do a small amount of testing to see if the .* or the lookahead was likely to blame.

                    I created a file with the lines bla bla bla bla and Access is denied on a ratio of about 10 bla to 1 of Access. I got to 240M lines approx (so by my calculation around 3.6Gb) at which point the lags in updating NPP were significant. At this point I gave the original regex a try. Work called so after I completed that job I came back to a very sorry dual monitor Windows 11 system. One monitor had called it quits (Windows wouldn’t use it) and the smaller monitor had all windows squished on it. On the upside, the regex worked, whereas I actually wanted it to fail so I could hatch my next cunning step, a revised regex to continue testing.

                    So back to my idea of seeing which part might be the cause of the issue. Since the original request was to be able to mark the lines I considered not trying to "capture’ the entire line, instead just use something like .\R(?=Access is denied$). Mark function will still mark the line even if only 1 character on that line is sought.

                    Another similar idea would have been to (again) just select the last character on a line and also select the following line with “access is denied.” If the reason for the “mark” was to extract them to another tab/file, then it would be much simpler to remove the “access is denied” line at that time.

                    There was also a 3rd idea. That is to remove the \R immediately before the “access is denied” line. Then use the mark function to mark the actual text “Access is denied”. Based on this another slightly modified idea is to copy the “Access is denied” to the end of the line above. Then Mark lines which have this text but not from the start of the line.

                    As shown, there are often many answers/solutions to the problem, especially if one is willing to divide and conquer. It is nice to give a one line solution but often complexities (inability to solve or easily adjust or user unable to comprehend the solution) will make a multiple step solution more palatable.

                    Terry

                    C 1 Reply Last reply Apr 2, 2025, 10:18 PM Reply Quote 4
                    • C
                      Coises @Coises
                      last edited by Coises Apr 2, 2025, 9:54 PM Apr 2, 2025, 9:53 PM

                      I wrote:

                      I’ll have to see if I can test this with some large files

                      I have a moderately large file (19,473 lines, 119,172,867 bytes) I sometimes use for testing. It does not contain any lines which begin Access is denied.

                      This expression:
                      (?-s)^.*$(?=\RAccess is denied)
                      results in the complexity error message; this one:
                      (?-s)^.*+(?=\RAccess is denied)
                      correctly returns zero matches.

                      So the regular expression engine does not optimize the first expression to the equivalent second expression.

                      T 1 Reply Last reply Apr 2, 2025, 10:02 PM Reply Quote 2
                      • T
                        Terry R @Coises
                        last edited by Terry R Apr 2, 2025, 10:16 PM Apr 2, 2025, 10:02 PM

                        @Coises said in Find line above given text in document:

                        So the regular expression engine does not optimize the first expression to the equivalent second expression.

                        So although the .*$ doesn’t appear to allow for backtracking, maybe that’s what the engine thinks is possible, hence the error. In the second the .*+ states "there CANNOT be any backtracking!

                        So if the OP changed the $ to a + that alone might be sufficient to allow for the regex to work.

                        Terry

                        PS actually the more I think about it, the .*$ will allow for backtracking. My (Our) thinking likely has to change. Although $ is a meta-character, it still doesn’t have the “power” to command the engine to not backtrack, whereas the + does. It might even be such that any character at the $ position isn’t regarded as an anchor to prevent backtracking when/if deemed (possibly) needed by the engine. I’d be interested in going back over some of these errors reported and seeing if it is possible to add the possessive modifier and re-test.

                        1 Reply Last reply Reply Quote 2
                        • C
                          Coises @Terry R
                          last edited by Apr 2, 2025, 10:18 PM

                          @Terry-R said in Find line above given text in document:

                          Since we all seemed unsure of why it would generate such a message from what “seemed” to be a simple find expression I thought I would do a small amount of testing to see if the .* or the lookahead was likely to blame.

                          A truly strange result happened when I tried removing the lookahead and using plain old Count in the Find dialog. With my 19,473 line, 119,172,867 byte file, entering either of these expressions:
                          (?-s)^.*$
                          (?-s)^.*+
                          into the Find window and pressing Count causes Notepad++ to hang (“Not Responding”). I’ve waited over six minutes before force closing.

                          So, I tried to Count one of those expressions in the search in my Columns++ plugin, because (depending on the cause) that can show a progress meter for slow operations, and I wanted to get an idea what was happening.

                          It completed, with the correct answer (19,473, the number of lines) in around one second. The result is the same with either expression. (The original expressions behave as in Notepad++: the version with the dollar sign gets a complexity error, the version with a plus sign works.)

                          Now given that both use Boost::regex, I have no idea why Notepad++ hangs.

                          C 1 Reply Last reply Apr 2, 2025, 10:30 PM Reply Quote 0
                          • C
                            Coises @Coises
                            last edited by Coises Apr 3, 2025, 1:53 AM Apr 2, 2025, 10:30 PM

                            @Coises said in Find line above given text in document:

                            A truly strange result happened when I tried removing the lookahead and using plain old Count in the Find dialog. With my 19,473 line, 119,172,867 byte file, entering either of these expressions:
                            (?-s)^.*$
                            (?-s)^.*+
                            into the Find window and pressing Count causes Notepad++ to hang (“Not Responding”). I’ve waited over six minutes before force closing.

                            Now given that both use Boost::regex, I have no idea why Notepad++ hangs.

                            Ugh. Doesn’t hang in 8.6.6 portable. Which gives me a bad feeling it could be related to PR #16208 .

                            Edit to add: Reported in 8.7.9 announcement as a regression. Still studying the cause.

                            1 Reply Last reply Reply Quote 3
                            • E
                              Ekopalypse @guy038
                              last edited by Apr 3, 2025, 10:08 AM

                              @guy038 said in Find line above given text in document:

                              Oh, My God, I’ve been beaten by @ekopalypse :-((

                              What’s that suppose to mean?? Lol - just kidding :-D

                              1 Reply Last reply Reply Quote 0
                              • G
                                guy038
                                last edited by guy038 Apr 3, 2025, 4:50 PM Apr 3, 2025, 4:43 PM

                                Hello, @ekopalypse,

                                May be, it’s a language barrier ! It was, in no way, offensive to you !

                                I just wanted to say that your example seemed closer to @benji2025 case and that is was impossible for me to compete with you ( 120,000,000 lines !) ;-)))

                                BR

                                guy038

                                1 Reply Last reply Reply Quote 1
                                • G
                                  guy038
                                  last edited by Apr 3, 2025, 4:49 PM

                                  Hi, @benji2025, @ekopalypse, @alan-kilborn, @coises and All,

                                  I did additional tests :

                                  First, if the Word wrap option is enabled, using the same regex as before, with the Match Case option checked, the BookMarking operation took about the same time : 23.2 seconds

                                  Secondly, after the BookMarking operation if you re-run the same regex, the whole operation is done in 16 seconds. This seems logical because n++ does not have to re-bookmark the already bookmarked lines !

                                  Thirdly, if I do not check the Bookmark line option, in the Mark dialog, the Marking operation is a bit quicker : 19,4 seconds

                                  Fourthly, if I select all the contents of the test file ( => the In selection box is automatically checked ) the operation is a bit slower : 23,8 seconds


                                  Now, for all the tests below, I used these rules :

                                  • The Word wrap option is unchecked

                                  • In the Mark dialog, the Bookmark line option is checked and all the other box options are unchecked.

                                  • Generally the Match case option is unchecked but may be checked in few occasions.

                                  • Before each search, I hit the Clear all marks button and place the caret at the very beginning of the test file.

                                  • I did my tests twice : on my old XP machine, with N++ v7.9.2 and on my new W10 laptop, with N++ v8.7.6 ( @coises, I avoided, on purpose, using the v8.7.8 and v8.7.9 releases ! )

                                  • Each time, I opened N++, from a command prompt window, with the command Notepad++ -nosession Benji.txt Test_Benji.txt, so with only these two files.

                                  • For the W10 test, I simply used an USB key containing the portable N++ v8.7.6 release and the test file.

                                  -------------------------- NEC XP - N++ v7.9.2 -------------------------------------------------------------------------------------------------------
                                  
                                  (?-s)^.*\R(?=TEST$)             24   s                         Option 'Match Case' unchecked   
                                  
                                  (?-s)^.*\R(?=TEST$)             23.1 s                         Option 'Match Case' checked    ( The test in my **previous** post )
                                  
                                  (?-s)^.+\R(?=TEST$)             23.9 s                         Option 'Match Case' unchecked   
                                  
                                  (?-s)^.*+\R(?=TEST$)            19.9 s   ( Atomic )            Option 'Match Case' unchecked   
                                  
                                  (?-s)^.++\R(?=TEST$)            19.8 s   ( Atomic )            Option 'Match Case' unchecked   
                                  
                                  (?-s)^.++\r\n(?=TEST$)          18.6 s   ( Atomic )            Option 'Match Case' unchecked   
                                  
                                  (?-s)^.++\r\n(?=TEST$)          17.7 s   ( Atomic )            Option 'Match Case' checked     
                                  
                                  (?-is)^.++\r\n(?=TEST$)         69   s   ( Atomic )  ( ?! )    Option 'Match Case' unchecked   
                                  
                                  --- Without the ^ symbol ---
                                  
                                  (?-s).*\R(?=TEST$)             25.6 s                         Option 'Match Case' unchecked   
                                  
                                  (?-s).+\R(?=TEST$)             23.1 s                         Option 'Match Case' unchecked   
                                  
                                  (?-s).*+\R(?=TEST$)            21.5 s   ( Atomic )            Option 'Match Case' unchecked   
                                  
                                  (?-s).++\R(?=TEST$)            19.2 s   ( Atomic )            Option 'Match Case' unchecked   
                                  
                                  (?-s).++\r\n(?=TEST$)          18.1 s   ( Atomic )            Option 'Match Case' unchecked   
                                  
                                  (?-s).++\r\n(?=TEST$)          17.25 s  ( Atomic )            Option 'Match Case' checked     
                                  
                                  (?-is).++\r\n(?=TEST$)        237   s   ( Atomic )  ( ?! )    Option 'Match Case' unchecked   
                                  
                                  --- Without the (?-s)^ part ---
                                  
                                  .*\R(?=TEST$)                  25.1 s                         Option 'Match Case' unchecked   
                                  
                                  .+\R(?=TEST$)                  22.8 s                         Option 'Match Case' unchecked   
                                  
                                  .*+\R(?=TEST$)                 21.2 s   ( Atomic )            Option 'Match Case' unchecked   
                                  
                                  .++\R(?=TEST$)                 18.9 s   ( Atomic )            Option 'Match Case' unchecked   
                                  
                                  .++\r\n(?=TEST$)               17.8 s   ( Atomic )            Option 'Match Case' unchecked   
                                  
                                  .++\r\n(?=TEST$)               17   s   ( Atomic )            Option 'Match Case' checked     
                                  
                                  (?-i).++\r\n(?=TEST$)         236   s   ( Atomic )  ( ?! )    Option 'Match Case' unchecked   
                                  
                                  --- Using the @Terry-R solution ---
                                  
                                  (?-s).\R(?=TEST$)              77   s               ( ?! )    Option 'Match Case' unchecked   
                                  
                                  (?-s).\r\n(?=TEST$)            71   s               ( ?! )    Option 'Match Case' unchecked   
                                  
                                  
                                  (?-s).{1}+\R(?=TEST$)          93   s   ( Atomic )  ( ?! )    Option 'Match Case' unchecked   
                                  
                                  (?-s).{1}+\r\n(?=TEST$)        84   s   ( Atomic )  ( ?! )    Option 'Match Case' unchecked   
                                  
                                  
                                  .{1}+\R(?=TEST$)               88   s   ( Atomic )  ( ?! )    Option 'Match Case' unchecked   
                                  
                                  .{1}+\r\n(?=TEST$)             79   s   ( Atomic )  ( ?! )    Option 'Match Case' unchecked   
                                  
                                  --- After CONCATENATION of the line BEFORE the line TEST with the line TEST ---
                                  
                                  First, the regex \R(?=TEST$) is replaced with NOTHING ( 57 s ) => 142,908,950 bytes for 3,030,301 lines. Then :
                                  
                                  TEST$                          20.8 s                         Option 'Match Case' unchecked   
                                  
                                  TEST$                          13.6 s                         Option 'Match Case' checked     
                                  
                                  (?-i)TEST$                     13.8 s                         Option 'Match Case' unchecked   
                                  
                                  Last, the regex TEST$ is replaced with \r\n$0  ( 65 s )
                                  
                                  -------------------------- HP Win 10 - N++ 8.7.6 -----------------------------------------------------------------------------------------------------
                                  
                                  (?-s)^.*\R(?=TEST$)             2,3  s                         Option 'Match Case' unchecked   
                                  
                                  (?-s)^.*\R(?=TEST$)             2    s                         Option 'Match Case' checked    ( The test in my **previous** post )
                                  
                                  (?-s)^.+\R(?=TEST$)             2.3  s                         Option 'Match Case' unchecked   
                                  
                                  (?-s)^.*+\R(?=TEST$)            1.9  s   ( Atomic )            Option 'Match Case' unchecked   
                                  
                                  (?-s)^.++\R(?=TEST$)            1.97 s   ( Atomic )            Option 'Match Case' unchecked   
                                  
                                  (?-s)^.++\r\n(?=TEST$)          1.86 s   ( Atomic )            Option 'Match Case' unchecked   
                                  
                                  (?-s)^.++\r\n(?=TEST$)          1.5  s   ( Atomic )            Option 'Match Case' checked     
                                  
                                  (?-is)^.++\r\n(?=TEST$)         5.8  s   ( Atomic ) ( ! )      Option 'Match Case' unchecked   
                                  
                                  ---
                                  
                                  --- Without the ^ symbol ---
                                  
                                  (?-s).*\R(?=TEST$)              2.7  s                        Option 'Match Case' unchecked   
                                  
                                  (?-s).+\R(?=TEST$)              2.3  s                        Option 'Match Case' unchecked   
                                  
                                  (?-s).*+\R(?=TEST$)             2.3  s  ( Atomic )            Option 'Match Case' unchecked   
                                  
                                  (?-s).++\R(?=TEST$)             1.9  s  ( Atomic )            Option 'Match Case' unchecked   
                                  
                                  (?-s).++\r\n(?=TEST$)           1.8  s  ( Atomic )            Option 'Match Case' unchecked   
                                  
                                  (?-s).++\r\n(?=TEST$)           1.45 s  ( Atomic )            Option 'Match Case' checked     
                                  
                                  (?-is).++\r\n(?=TEST$)         23.9  s  ( Atomic )  ( !? )    Option 'Match Case' unchecked   
                                  
                                  ---
                                  
                                  --- Without the (?-s)^ part ---
                                  
                                  .*\R(?=TEST$)                   2.7  s                        Option 'Match Case' unchecked   
                                  
                                  .+\R(?=TEST$)                   2.23 s                        Option 'Match Case' unchecked   
                                  
                                  .*+\R(?=TEST$)                  2.23 s  ( Atomic )            Option 'Match Case' unchecked   
                                  
                                  .++\R(?=TEST$)                  1.8  s  ( Atomic )            Option 'Match Case' unchecked   
                                  
                                  .++\r\n(?=TEST$)                1.7  s  ( Atomic )            Option 'Match Case' unchecked   
                                  
                                  .++\r\n(?=TEST$)                1.38 s  ( Atomic )            Option 'Match Case' checked     
                                  
                                  (?-i).++\r\n(?=TEST$)          24    s  ( Atomic )  ( ?! )    Option 'Match Case' unchecked   
                                  
                                  --- Using the @Terry-R solution ---
                                  
                                  (?-s).\R(?=TEST$)               6.35 s              ( ! )     Option 'Match Case' unchecked   
                                  
                                  (?-s).\r\n(?=TEST$)             8.6  s              ( ! )     Option 'Match Case' unchecked   
                                  
                                  
                                  (?-s).{1}+\R(?=TEST$)           8.2  s  ( Atomic )  ( ! )     Option 'Match Case' unchecked   
                                  
                                  (?-s).{1}+\r\n(?=TEST$)        10.6  s  ( Atomic )  ( ! )     Option 'Match Case' unchecked   
                                  
                                  
                                  .{1}+\R(?=TEST$)                7.5  s  ( Atomic )  ( ! )     Option 'Match Case' unchecked   
                                  
                                  .{1}+\r\n(?=TEST$)              9.75 s  ( Atomic )  ( ! )     Option 'Match Case' unchecked   
                                  
                                  --- After CONCATENATION of the line BEFORE the line TEST with the line TEST ---
                                  
                                  First, the regex \R(?=TEST$) is replaced with NOTHING ( 26.2 s ) => 142,908,950 bytes for 3,030,301 lines. Then :
                                  
                                  TEST$                           2.3  s                        Option 'Match Case' unchecked   
                                  
                                  TEST$                           0.95 s                        Option 'Match Case' checked     
                                  
                                  (?-i)TEST$                      1    s                        Option 'Match Case' unchecked   
                                  
                                  Last, the regex TEST$ is replaced with \r\n$0  ( 25.3 s )
                                  

                                  Conclusion :

                                  So, given the rules above, the best syntaxes seem to be, on my new Windos 10 machine :

                                  • The regex .++\r\n(?=TEST$) in 1.38 second, with the Match Case option checked.

                                  • The regex TEST$ in 0.95 second, AFTER an initial contatenation of the line before the line TEST with the line TEST.

                                  Best Regards,

                                  guy038

                                  1 Reply Last reply Reply Quote 0
                                  21 out of 27
                                  • First post
                                    21/27
                                    Last post
                                  The Community of users of the Notepad++ text editor.
                                  Powered by NodeBB | Contributors