Community
    • Login

    deleting specific lines that don't meet a criteria

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    12 Posts 5 Posters 1.2k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • M PM
      M P
      last edited by

      I got a list with over 400000 lines that looks similar to this:

      1234567:123456
      567890ß:567890
      dfcghvioti6uzfghj:dfcghvioti6uzfgh
      7658656:dfcghvioti6uzfghj
      sgagdaskdj:oijvjpi osdcj
      98760798698657:9876079869865

      I ONLY want to keep the lines that have the same text before and after the “:” EXCEPT for the last number/letter which is missing in the text after the “:” f.e.:

      1234567:123456
      567890ß:567890
      dfcghvioti6uzfghj:dfcghvioti6uzfgh

      7658656:dfcghvioti6uzfghj
      sgagdaskdj:oijvjpi osdcj

      98760798698657:9876079869865

      as you see the bulk marked text is what’s important to keep for me. the only difference between the first and second word is that the last number/letter is missing in the second word/text

      How can this be done on a scale of 400000 lines? Im happy to pay 25$ for a solution!

      Neil SchipperN astrosofistaA 3 Replies Last reply Reply Quote 0
      • Neil SchipperN
        Neil Schipper @M P
        last edited by

        @m-p Hello. My solution is in the text box below. (I left my failed regexes in for public amusement.) Copy the text into a ‘new’ file tab and try it. I can’t guarantee it on a monster size file, but kindly let us know.

        (\w+)\w:/1 ===> fail
        (\w*?)\w:\1  ===> does match whole lines of interest, but also parts of lines we want to not match
        (\w{2,})\w:\1  ===> still also matches parts of to-reject lines
        ^(\w{2,})\w:\1$  ===> bingo
        
        Recipe:
        1. select text of the last regex above ($ is last char of regex); invoke 'Mark' (Ctl-m) (dialog appears, regex appears in the 'Find what' box); enable 'Bookmark line' and Reg exprn; in the file move cursor to home position; execute 'Mark All'
        2. Search -> Bookmark -> Remove unmarked lines
        
        1234567:123456
        123456:123456
        12346:123456
        567890ß:567890
        dfcghvioti6uzfghj:dfcghvioti6uzfgh
        7658656:dfcghvioti6uzfghj
        sgagdaskdj:oijvjpi osdcj
        98760798698657:9876079869865
        
        M PM 1 Reply Last reply Reply Quote 0
        • Neil SchipperN
          Neil Schipper @M P
          last edited by

          @m-p You will need to be patient: on a test file of 1000 lines, with 50% of the lines matching the regex, the Mark operation took about 1.25 minutes on my old Intel® Core™ i5 CPU 650 @ 3.20 GHz, and with a portable npp rev that I use to fool with plugins, etc:

          Notepad++ v8.1.9   (32-bit)
          Build time : Oct 21 2021 - 23:32:04
          Path : C:\Users\neils\Downloads\npp\npp.8.1.9.portable\notepad++.exe
          Command Line : -multiInst
          Admin mode : OFF
          Local Conf mode : ON
          Cloud Config : OFF
          OS Name : Windows 10 Enterprise (64-bit) 
          OS Version : 2004
          OS Build : 19041.1348
          Current ANSI codepage : 1252
          Plugins : AnalysePlugin.dll AutoCodepage.dll AutoEolFormat.dll BetterMultiSelection.dll BookmarksDook.dll BracketsCheck.dll ColorPicker.dll CustomLineNumbers.dll ElasticTabstops.dll ExtSettings.dll FileSwitcher.dll FingerText.dll FWDataViz.dll GitSCM.dll GotoLineCol.dll HexEditor.dll LightExplorer.dll linefilter2.dll Linefilter3.dll linesort.dll LocationNavigate.dll MarkdownViewerPlusPlus.dll MenuIcons.dll mimeTools.dll MultiClipboard.dll MusicPlaye_1.0.11x86r.dll NavigateTo.dll NewFileBrowser.dll nppAutoDetectIndent.dll NppCalc.dll NppConverter.dll NppExport.dll NppMarkdownPanel.dll NppMenuSearch.dll NppQCP.dll NppTextViz.dll OpenSelection.dll pork2sausage.dll PythonScript.dll QuickText.dll RegexTrainer.dll SecurePad.dll selectNLaunch.dll VisualStudioLineCopy.dll _CustomizeToolbar.dll 
          
          

          (I forget why I even chose a 32-bit rev.)

          Your 400k lines would take on the order of 500 minutes. (The remove unmarked lines operation was very quick.)

          I also noticed that the Mark dialog’s Find what entry box changed to show blank during the lengthy process. This made me worry that the process had hung up, but fortunately that turned out to not be the case.

          1 Reply Last reply Reply Quote 0
          • astrosofistaA
            astrosofista @M P
            last edited by astrosofista

            @m-p, @Neil-Schipper, all

            As said before, I am also not sure that the regex engine can process 400000 lines in a single run. Despite the warning, I suggest the following regular expression that will delete all lines that do not meet the proposed criteria:

            Search: (?-s)(^(.+?).:\2\R)|.*\R
            Replace: ?1$0:
            

            So, care to make a backup copy of the file, put the caret at the very beginning of the document, select just the Regular Expression mode and click on Replace All.

            Stay safe

            Neil SchipperN 1 Reply Last reply Reply Quote 2
            • Neil SchipperN
              Neil Schipper @astrosofista
              last edited by Neil Schipper

              @astrosofista Nice solution. It runs much faster than mine. (I do find it mysterious that deleting would be faster than mere marking & bookmarking, but what do I know about the innards of the regex machinery?).

              Also, although I’ve skimmed the topic in the docs, I’ve never actually seen a Substitution Conditional in action, so thanks.

              … and I found I flaw in my solution: it doesn’t match short lines of form ab:a. Fix is below:

              ^(\w{2,})\w:\1$  ===> bingo ===> oops, prevents xy:x
              ^(\w{1,})\w:\1$  ===> handles that case
              ^(\w+)\w:\1$  ===> ditto but uses more conventional construct than {1,}
              
              1 Reply Last reply Reply Quote 2
              • guy038G
                guy038
                last edited by guy038

                Hello, @m-p, @neil-schipper, @astrosofista and All,

                Here is a method :

                • Open the Replace diakog ( Ctrl + H )

                • SEARCH (?-s)^(\w+)\w:\1\R(*SKIP)(*FAIL)|.+\R

                • REPLACE Leave EMPTY

                • Tick the Wrap around option

                • Select the Regular expression search mode

                • Click on the Replace All button


                As a test, I duplicated your 6-lines text, below, 66,667 times for a total of 400,002 lines

                1234567:123456
                567890ß:567890
                dfcghvioti6uzfghj:dfcghvioti6uzfgh
                7658656:dfcghvioti6uzfghj
                sgagdaskdj:oijvjpi osdcj
                98760798698657:9876079869865
                

                After a click on the Replace All button, 16 s later, it displayed thee message “133,334 occurrences were replaced”, so exactly the two lines, below, 66,667 times ! ( with N++ v8.1.9.2 on a Win 10 Pro 64 bits laptop with a SSD

                7658656:dfcghvioti6uzfghj
                sgagdaskdj:oijvjpi osdcj
                

                Best Regards,

                guy038

                P.S. :

                See the definition of the Backtracking Control verbs (*SKIP) and (*FAIL) below :

                https://community.notepad-plus-plus.org/post/55464

                Neil SchipperN 1 Reply Last reply Reply Quote 2
                • M PM
                  M P @Neil Schipper
                  last edited by

                  @neil-schipper thank you for this amazing help. It processed very fast!! (1min max)

                  Neil SchipperN 1 Reply Last reply Reply Quote 0
                  • Neil SchipperN
                    Neil Schipper @M P
                    last edited by Neil Schipper

                    @m-p I’m glad it worked out for you. (I hope you checked that no cases of ab:a were missed!)

                    It’s interesting to me that the solutions of @astrosofista and @guy038 are so much faster than mine. On my machine and with my test data:

                    • my solution’s (book)marking phase (using my most recent regex) took about 58 seconds
                    • @astrosofista’s one-step solution took about 8 seconds
                    • @guy038 's one-step solution took about 4 seconds

                    @guy038, I’ve been spending some time with your amazing backtracking control verbs write-up. I’m having a tough time with it, and I haven’t absorbed much as yet. Maybe I’m not used to thinking (in a correct and disciplined way) about normal backtracking, and this makes it hard to think about modifying it. (Also, the organ between my ears is not quite what it was 30 years ago.) Anyway, I’ll plod along some more and maybe after {3,} readings, something will be absorbed and retained.

                    Alan KilbornA 1 Reply Last reply Reply Quote 2
                    • Alan KilbornA
                      Alan Kilborn @Neil Schipper
                      last edited by Alan Kilborn

                      @neil-schipper said in deleting specific lines that don't meet a criteria:

                      my solution’s (book)marking phase (using my most recent regex) took about 58 seconds

                      Operations involving bookmarking are often slow with Notepad++. This has been discussed on the forum before, with (I don’t believe) any solutions to the problem being derived.

                      Some possible references:

                      • https://community.notepad-plus-plus.org/topic/15159/bookmark-multiple-lines
                      • https://community.notepad-plus-plus.org/topic/18900/persistent-highlight-of-characters-e-g-no-break-space
                      • https://github.com/notepad-plus-plus/notepad-plus-plus/issues/8279
                      Neil SchipperN 1 Reply Last reply Reply Quote 0
                      • Neil SchipperN
                        Neil Schipper @Alan Kilborn
                        last edited by

                        @alan-kilborn Thanks, I’ll look at those. I figured the main contributor to slowness was promiscuous backtracking on lines that don’t meet the mark condition.

                        Alan KilbornA 1 Reply Last reply Reply Quote 0
                        • Alan KilbornA
                          Alan Kilborn @Neil Schipper
                          last edited by

                          @neil-schipper said in deleting specific lines that don't meet a criteria:

                          I figured the main contributor to slowness was promiscuous backtracking on lines that don’t meet the mark condition

                          That could be. Actually, I wasn’t that absorbed with this issue, so the slowness may well NOT involve the bookmarking action. I just thought I’d point it out for general awareness that there is some sort of performance issue with some bookmarking actions.

                          1 Reply Last reply Reply Quote 0
                          • Neil SchipperN
                            Neil Schipper @guy038
                            last edited by

                            Hi @astrosofista & @guy038 & @alan-kilborn,

                            OK, I found some closure on this. When re-running my bookmarking regex I noticed activity in the BookmarksDook panel. If I made the panel invisible, the process was still very slow.

                            I also observed that a native bookmarking process like Inverse Bookmark (which has nothing to do with running a regex) was super slow.

                            Then I went to a different Npp instance, this one minimalist (no plugins), and ran the regex against the same data, and the process was blink-of-an-eye. I then doubled the data to 2k lines, and it was still pretty much instant.

                            Then I took Bookmarks@Dook out of the earlier instance (moved plug-in subdir away, restarted): the bookmarking by regex process was instant. Restored Bookmarks@Dook plugin, slow again.

                            So there you have it.

                            @alan-kilborn Those threads deal with bookmark processes driven by PythonScript, not native. The two issues might still be related. None of those discussions made mention of the presence or absence of the plugin.

                            1 Reply Last reply Reply Quote 1
                            • First post
                              Last post
                            The Community of users of the Notepad++ text editor.
                            Powered by NodeBB | Contributors