Community
    • Login

    deleting specific lines that don't meet a criteria

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    12 Posts 5 Posters 1.2k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Neil SchipperN
      Neil Schipper @M P
      last edited by

      @m-p You will need to be patient: on a test file of 1000 lines, with 50% of the lines matching the regex, the Mark operation took about 1.25 minutes on my old Intel® Core™ i5 CPU 650 @ 3.20 GHz, and with a portable npp rev that I use to fool with plugins, etc:

      Notepad++ v8.1.9   (32-bit)
      Build time : Oct 21 2021 - 23:32:04
      Path : C:\Users\neils\Downloads\npp\npp.8.1.9.portable\notepad++.exe
      Command Line : -multiInst
      Admin mode : OFF
      Local Conf mode : ON
      Cloud Config : OFF
      OS Name : Windows 10 Enterprise (64-bit) 
      OS Version : 2004
      OS Build : 19041.1348
      Current ANSI codepage : 1252
      Plugins : AnalysePlugin.dll AutoCodepage.dll AutoEolFormat.dll BetterMultiSelection.dll BookmarksDook.dll BracketsCheck.dll ColorPicker.dll CustomLineNumbers.dll ElasticTabstops.dll ExtSettings.dll FileSwitcher.dll FingerText.dll FWDataViz.dll GitSCM.dll GotoLineCol.dll HexEditor.dll LightExplorer.dll linefilter2.dll Linefilter3.dll linesort.dll LocationNavigate.dll MarkdownViewerPlusPlus.dll MenuIcons.dll mimeTools.dll MultiClipboard.dll MusicPlaye_1.0.11x86r.dll NavigateTo.dll NewFileBrowser.dll nppAutoDetectIndent.dll NppCalc.dll NppConverter.dll NppExport.dll NppMarkdownPanel.dll NppMenuSearch.dll NppQCP.dll NppTextViz.dll OpenSelection.dll pork2sausage.dll PythonScript.dll QuickText.dll RegexTrainer.dll SecurePad.dll selectNLaunch.dll VisualStudioLineCopy.dll _CustomizeToolbar.dll 
      
      

      (I forget why I even chose a 32-bit rev.)

      Your 400k lines would take on the order of 500 minutes. (The remove unmarked lines operation was very quick.)

      I also noticed that the Mark dialog’s Find what entry box changed to show blank during the lengthy process. This made me worry that the process had hung up, but fortunately that turned out to not be the case.

      1 Reply Last reply Reply Quote 0
      • astrosofistaA
        astrosofista @M P
        last edited by astrosofista

        @m-p, @Neil-Schipper, all

        As said before, I am also not sure that the regex engine can process 400000 lines in a single run. Despite the warning, I suggest the following regular expression that will delete all lines that do not meet the proposed criteria:

        Search: (?-s)(^(.+?).:\2\R)|.*\R
        Replace: ?1$0:
        

        So, care to make a backup copy of the file, put the caret at the very beginning of the document, select just the Regular Expression mode and click on Replace All.

        Stay safe

        Neil SchipperN 1 Reply Last reply Reply Quote 2
        • Neil SchipperN
          Neil Schipper @astrosofista
          last edited by Neil Schipper

          @astrosofista Nice solution. It runs much faster than mine. (I do find it mysterious that deleting would be faster than mere marking & bookmarking, but what do I know about the innards of the regex machinery?).

          Also, although I’ve skimmed the topic in the docs, I’ve never actually seen a Substitution Conditional in action, so thanks.

          … and I found I flaw in my solution: it doesn’t match short lines of form ab:a. Fix is below:

          ^(\w{2,})\w:\1$  ===> bingo ===> oops, prevents xy:x
          ^(\w{1,})\w:\1$  ===> handles that case
          ^(\w+)\w:\1$  ===> ditto but uses more conventional construct than {1,}
          
          1 Reply Last reply Reply Quote 2
          • guy038G
            guy038
            last edited by guy038

            Hello, @m-p, @neil-schipper, @astrosofista and All,

            Here is a method :

            • Open the Replace diakog ( Ctrl + H )

            • SEARCH (?-s)^(\w+)\w:\1\R(*SKIP)(*FAIL)|.+\R

            • REPLACE Leave EMPTY

            • Tick the Wrap around option

            • Select the Regular expression search mode

            • Click on the Replace All button


            As a test, I duplicated your 6-lines text, below, 66,667 times for a total of 400,002 lines

            1234567:123456
            567890ß:567890
            dfcghvioti6uzfghj:dfcghvioti6uzfgh
            7658656:dfcghvioti6uzfghj
            sgagdaskdj:oijvjpi osdcj
            98760798698657:9876079869865
            

            After a click on the Replace All button, 16 s later, it displayed thee message “133,334 occurrences were replaced”, so exactly the two lines, below, 66,667 times ! ( with N++ v8.1.9.2 on a Win 10 Pro 64 bits laptop with a SSD

            7658656:dfcghvioti6uzfghj
            sgagdaskdj:oijvjpi osdcj
            

            Best Regards,

            guy038

            P.S. :

            See the definition of the Backtracking Control verbs (*SKIP) and (*FAIL) below :

            https://community.notepad-plus-plus.org/post/55464

            Neil SchipperN 1 Reply Last reply Reply Quote 2
            • M PM
              M P @Neil Schipper
              last edited by

              @neil-schipper thank you for this amazing help. It processed very fast!! (1min max)

              Neil SchipperN 1 Reply Last reply Reply Quote 0
              • Neil SchipperN
                Neil Schipper @M P
                last edited by Neil Schipper

                @m-p I’m glad it worked out for you. (I hope you checked that no cases of ab:a were missed!)

                It’s interesting to me that the solutions of @astrosofista and @guy038 are so much faster than mine. On my machine and with my test data:

                • my solution’s (book)marking phase (using my most recent regex) took about 58 seconds
                • @astrosofista’s one-step solution took about 8 seconds
                • @guy038 's one-step solution took about 4 seconds

                @guy038, I’ve been spending some time with your amazing backtracking control verbs write-up. I’m having a tough time with it, and I haven’t absorbed much as yet. Maybe I’m not used to thinking (in a correct and disciplined way) about normal backtracking, and this makes it hard to think about modifying it. (Also, the organ between my ears is not quite what it was 30 years ago.) Anyway, I’ll plod along some more and maybe after {3,} readings, something will be absorbed and retained.

                Alan KilbornA 1 Reply Last reply Reply Quote 2
                • Alan KilbornA
                  Alan Kilborn @Neil Schipper
                  last edited by Alan Kilborn

                  @neil-schipper said in deleting specific lines that don't meet a criteria:

                  my solution’s (book)marking phase (using my most recent regex) took about 58 seconds

                  Operations involving bookmarking are often slow with Notepad++. This has been discussed on the forum before, with (I don’t believe) any solutions to the problem being derived.

                  Some possible references:

                  • https://community.notepad-plus-plus.org/topic/15159/bookmark-multiple-lines
                  • https://community.notepad-plus-plus.org/topic/18900/persistent-highlight-of-characters-e-g-no-break-space
                  • https://github.com/notepad-plus-plus/notepad-plus-plus/issues/8279
                  Neil SchipperN 1 Reply Last reply Reply Quote 0
                  • Neil SchipperN
                    Neil Schipper @Alan Kilborn
                    last edited by

                    @alan-kilborn Thanks, I’ll look at those. I figured the main contributor to slowness was promiscuous backtracking on lines that don’t meet the mark condition.

                    Alan KilbornA 1 Reply Last reply Reply Quote 0
                    • Alan KilbornA
                      Alan Kilborn @Neil Schipper
                      last edited by

                      @neil-schipper said in deleting specific lines that don't meet a criteria:

                      I figured the main contributor to slowness was promiscuous backtracking on lines that don’t meet the mark condition

                      That could be. Actually, I wasn’t that absorbed with this issue, so the slowness may well NOT involve the bookmarking action. I just thought I’d point it out for general awareness that there is some sort of performance issue with some bookmarking actions.

                      1 Reply Last reply Reply Quote 0
                      • Neil SchipperN
                        Neil Schipper @guy038
                        last edited by

                        Hi @astrosofista & @guy038 & @alan-kilborn,

                        OK, I found some closure on this. When re-running my bookmarking regex I noticed activity in the BookmarksDook panel. If I made the panel invisible, the process was still very slow.

                        I also observed that a native bookmarking process like Inverse Bookmark (which has nothing to do with running a regex) was super slow.

                        Then I went to a different Npp instance, this one minimalist (no plugins), and ran the regex against the same data, and the process was blink-of-an-eye. I then doubled the data to 2k lines, and it was still pretty much instant.

                        Then I took Bookmarks@Dook out of the earlier instance (moved plug-in subdir away, restarted): the bookmarking by regex process was instant. Restored Bookmarks@Dook plugin, slow again.

                        So there you have it.

                        @alan-kilborn Those threads deal with bookmark processes driven by PythonScript, not native. The two issues might still be related. None of those discussions made mention of the presence or absence of the plugin.

                        1 Reply Last reply Reply Quote 1
                        • First post
                          Last post
                        The Community of users of the Notepad++ text editor.
                        Powered by NodeBB | Contributors