Community
    • Login

    deleting specific lines that don't meet a criteria

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    12 Posts 5 Posters 1.2k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Neil SchipperN
      Neil Schipper @M P
      last edited by

      @m-p Hello. My solution is in the text box below. (I left my failed regexes in for public amusement.) Copy the text into a ‘new’ file tab and try it. I can’t guarantee it on a monster size file, but kindly let us know.

      (\w+)\w:/1 ===> fail
      (\w*?)\w:\1  ===> does match whole lines of interest, but also parts of lines we want to not match
      (\w{2,})\w:\1  ===> still also matches parts of to-reject lines
      ^(\w{2,})\w:\1$  ===> bingo
      
      Recipe:
      1. select text of the last regex above ($ is last char of regex); invoke 'Mark' (Ctl-m) (dialog appears, regex appears in the 'Find what' box); enable 'Bookmark line' and Reg exprn; in the file move cursor to home position; execute 'Mark All'
      2. Search -> Bookmark -> Remove unmarked lines
      
      1234567:123456
      123456:123456
      12346:123456
      567890ß:567890
      dfcghvioti6uzfghj:dfcghvioti6uzfgh
      7658656:dfcghvioti6uzfghj
      sgagdaskdj:oijvjpi osdcj
      98760798698657:9876079869865
      
      M PM 1 Reply Last reply Reply Quote 0
      • Neil SchipperN
        Neil Schipper @M P
        last edited by

        @m-p You will need to be patient: on a test file of 1000 lines, with 50% of the lines matching the regex, the Mark operation took about 1.25 minutes on my old Intel® Core™ i5 CPU 650 @ 3.20 GHz, and with a portable npp rev that I use to fool with plugins, etc:

        Notepad++ v8.1.9   (32-bit)
        Build time : Oct 21 2021 - 23:32:04
        Path : C:\Users\neils\Downloads\npp\npp.8.1.9.portable\notepad++.exe
        Command Line : -multiInst
        Admin mode : OFF
        Local Conf mode : ON
        Cloud Config : OFF
        OS Name : Windows 10 Enterprise (64-bit) 
        OS Version : 2004
        OS Build : 19041.1348
        Current ANSI codepage : 1252
        Plugins : AnalysePlugin.dll AutoCodepage.dll AutoEolFormat.dll BetterMultiSelection.dll BookmarksDook.dll BracketsCheck.dll ColorPicker.dll CustomLineNumbers.dll ElasticTabstops.dll ExtSettings.dll FileSwitcher.dll FingerText.dll FWDataViz.dll GitSCM.dll GotoLineCol.dll HexEditor.dll LightExplorer.dll linefilter2.dll Linefilter3.dll linesort.dll LocationNavigate.dll MarkdownViewerPlusPlus.dll MenuIcons.dll mimeTools.dll MultiClipboard.dll MusicPlaye_1.0.11x86r.dll NavigateTo.dll NewFileBrowser.dll nppAutoDetectIndent.dll NppCalc.dll NppConverter.dll NppExport.dll NppMarkdownPanel.dll NppMenuSearch.dll NppQCP.dll NppTextViz.dll OpenSelection.dll pork2sausage.dll PythonScript.dll QuickText.dll RegexTrainer.dll SecurePad.dll selectNLaunch.dll VisualStudioLineCopy.dll _CustomizeToolbar.dll 
        
        

        (I forget why I even chose a 32-bit rev.)

        Your 400k lines would take on the order of 500 minutes. (The remove unmarked lines operation was very quick.)

        I also noticed that the Mark dialog’s Find what entry box changed to show blank during the lengthy process. This made me worry that the process had hung up, but fortunately that turned out to not be the case.

        1 Reply Last reply Reply Quote 0
        • astrosofistaA
          astrosofista @M P
          last edited by astrosofista

          @m-p, @Neil-Schipper, all

          As said before, I am also not sure that the regex engine can process 400000 lines in a single run. Despite the warning, I suggest the following regular expression that will delete all lines that do not meet the proposed criteria:

          Search: (?-s)(^(.+?).:\2\R)|.*\R
          Replace: ?1$0:
          

          So, care to make a backup copy of the file, put the caret at the very beginning of the document, select just the Regular Expression mode and click on Replace All.

          Stay safe

          Neil SchipperN 1 Reply Last reply Reply Quote 2
          • Neil SchipperN
            Neil Schipper @astrosofista
            last edited by Neil Schipper

            @astrosofista Nice solution. It runs much faster than mine. (I do find it mysterious that deleting would be faster than mere marking & bookmarking, but what do I know about the innards of the regex machinery?).

            Also, although I’ve skimmed the topic in the docs, I’ve never actually seen a Substitution Conditional in action, so thanks.

            … and I found I flaw in my solution: it doesn’t match short lines of form ab:a. Fix is below:

            ^(\w{2,})\w:\1$  ===> bingo ===> oops, prevents xy:x
            ^(\w{1,})\w:\1$  ===> handles that case
            ^(\w+)\w:\1$  ===> ditto but uses more conventional construct than {1,}
            
            1 Reply Last reply Reply Quote 2
            • guy038G
              guy038
              last edited by guy038

              Hello, @m-p, @neil-schipper, @astrosofista and All,

              Here is a method :

              • Open the Replace diakog ( Ctrl + H )

              • SEARCH (?-s)^(\w+)\w:\1\R(*SKIP)(*FAIL)|.+\R

              • REPLACE Leave EMPTY

              • Tick the Wrap around option

              • Select the Regular expression search mode

              • Click on the Replace All button


              As a test, I duplicated your 6-lines text, below, 66,667 times for a total of 400,002 lines

              1234567:123456
              567890ß:567890
              dfcghvioti6uzfghj:dfcghvioti6uzfgh
              7658656:dfcghvioti6uzfghj
              sgagdaskdj:oijvjpi osdcj
              98760798698657:9876079869865
              

              After a click on the Replace All button, 16 s later, it displayed thee message “133,334 occurrences were replaced”, so exactly the two lines, below, 66,667 times ! ( with N++ v8.1.9.2 on a Win 10 Pro 64 bits laptop with a SSD

              7658656:dfcghvioti6uzfghj
              sgagdaskdj:oijvjpi osdcj
              

              Best Regards,

              guy038

              P.S. :

              See the definition of the Backtracking Control verbs (*SKIP) and (*FAIL) below :

              https://community.notepad-plus-plus.org/post/55464

              Neil SchipperN 1 Reply Last reply Reply Quote 2
              • M PM
                M P @Neil Schipper
                last edited by

                @neil-schipper thank you for this amazing help. It processed very fast!! (1min max)

                Neil SchipperN 1 Reply Last reply Reply Quote 0
                • Neil SchipperN
                  Neil Schipper @M P
                  last edited by Neil Schipper

                  @m-p I’m glad it worked out for you. (I hope you checked that no cases of ab:a were missed!)

                  It’s interesting to me that the solutions of @astrosofista and @guy038 are so much faster than mine. On my machine and with my test data:

                  • my solution’s (book)marking phase (using my most recent regex) took about 58 seconds
                  • @astrosofista’s one-step solution took about 8 seconds
                  • @guy038 's one-step solution took about 4 seconds

                  @guy038, I’ve been spending some time with your amazing backtracking control verbs write-up. I’m having a tough time with it, and I haven’t absorbed much as yet. Maybe I’m not used to thinking (in a correct and disciplined way) about normal backtracking, and this makes it hard to think about modifying it. (Also, the organ between my ears is not quite what it was 30 years ago.) Anyway, I’ll plod along some more and maybe after {3,} readings, something will be absorbed and retained.

                  Alan KilbornA 1 Reply Last reply Reply Quote 2
                  • Alan KilbornA
                    Alan Kilborn @Neil Schipper
                    last edited by Alan Kilborn

                    @neil-schipper said in deleting specific lines that don't meet a criteria:

                    my solution’s (book)marking phase (using my most recent regex) took about 58 seconds

                    Operations involving bookmarking are often slow with Notepad++. This has been discussed on the forum before, with (I don’t believe) any solutions to the problem being derived.

                    Some possible references:

                    • https://community.notepad-plus-plus.org/topic/15159/bookmark-multiple-lines
                    • https://community.notepad-plus-plus.org/topic/18900/persistent-highlight-of-characters-e-g-no-break-space
                    • https://github.com/notepad-plus-plus/notepad-plus-plus/issues/8279
                    Neil SchipperN 1 Reply Last reply Reply Quote 0
                    • Neil SchipperN
                      Neil Schipper @Alan Kilborn
                      last edited by

                      @alan-kilborn Thanks, I’ll look at those. I figured the main contributor to slowness was promiscuous backtracking on lines that don’t meet the mark condition.

                      Alan KilbornA 1 Reply Last reply Reply Quote 0
                      • Alan KilbornA
                        Alan Kilborn @Neil Schipper
                        last edited by

                        @neil-schipper said in deleting specific lines that don't meet a criteria:

                        I figured the main contributor to slowness was promiscuous backtracking on lines that don’t meet the mark condition

                        That could be. Actually, I wasn’t that absorbed with this issue, so the slowness may well NOT involve the bookmarking action. I just thought I’d point it out for general awareness that there is some sort of performance issue with some bookmarking actions.

                        1 Reply Last reply Reply Quote 0
                        • Neil SchipperN
                          Neil Schipper @guy038
                          last edited by

                          Hi @astrosofista & @guy038 & @alan-kilborn,

                          OK, I found some closure on this. When re-running my bookmarking regex I noticed activity in the BookmarksDook panel. If I made the panel invisible, the process was still very slow.

                          I also observed that a native bookmarking process like Inverse Bookmark (which has nothing to do with running a regex) was super slow.

                          Then I went to a different Npp instance, this one minimalist (no plugins), and ran the regex against the same data, and the process was blink-of-an-eye. I then doubled the data to 2k lines, and it was still pretty much instant.

                          Then I took Bookmarks@Dook out of the earlier instance (moved plug-in subdir away, restarted): the bookmarking by regex process was instant. Restored Bookmarks@Dook plugin, slow again.

                          So there you have it.

                          @alan-kilborn Those threads deal with bookmark processes driven by PythonScript, not native. The two issues might still be related. None of those discussions made mention of the presence or absence of the plugin.

                          1 Reply Last reply Reply Quote 1
                          • First post
                            Last post
                          The Community of users of the Notepad++ text editor.
                          Powered by NodeBB | Contributors