• Login
Community
  • Login

deleting specific lines that don't meet a criteria

Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
12 Posts 5 Posters 1.3k Views
Loading More Posts
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • M
    M P
    last edited by Dec 2, 2021, 10:01 AM

    I got a list with over 400000 lines that looks similar to this:

    1234567:123456
    567890ß:567890
    dfcghvioti6uzfghj:dfcghvioti6uzfgh
    7658656:dfcghvioti6uzfghj
    sgagdaskdj:oijvjpi osdcj
    98760798698657:9876079869865

    I ONLY want to keep the lines that have the same text before and after the “:” EXCEPT for the last number/letter which is missing in the text after the “:” f.e.:

    1234567:123456
    567890ß:567890
    dfcghvioti6uzfghj:dfcghvioti6uzfgh

    7658656:dfcghvioti6uzfghj
    sgagdaskdj:oijvjpi osdcj

    98760798698657:9876079869865

    as you see the bulk marked text is what’s important to keep for me. the only difference between the first and second word is that the last number/letter is missing in the second word/text

    How can this be done on a scale of 400000 lines? Im happy to pay 25$ for a solution!

    N A 3 Replies Last reply Dec 2, 2021, 12:34 PM Reply Quote 0
    • N
      Neil Schipper @M P
      last edited by Dec 2, 2021, 12:34 PM

      @m-p Hello. My solution is in the text box below. (I left my failed regexes in for public amusement.) Copy the text into a ‘new’ file tab and try it. I can’t guarantee it on a monster size file, but kindly let us know.

      (\w+)\w:/1 ===> fail
      (\w*?)\w:\1  ===> does match whole lines of interest, but also parts of lines we want to not match
      (\w{2,})\w:\1  ===> still also matches parts of to-reject lines
      ^(\w{2,})\w:\1$  ===> bingo
      
      Recipe:
      1. select text of the last regex above ($ is last char of regex); invoke 'Mark' (Ctl-m) (dialog appears, regex appears in the 'Find what' box); enable 'Bookmark line' and Reg exprn; in the file move cursor to home position; execute 'Mark All'
      2. Search -> Bookmark -> Remove unmarked lines
      
      1234567:123456
      123456:123456
      12346:123456
      567890ß:567890
      dfcghvioti6uzfghj:dfcghvioti6uzfgh
      7658656:dfcghvioti6uzfghj
      sgagdaskdj:oijvjpi osdcj
      98760798698657:9876079869865
      
      M 1 Reply Last reply Dec 3, 2021, 8:42 AM Reply Quote 0
      • N
        Neil Schipper @M P
        last edited by Dec 2, 2021, 1:40 PM

        @m-p You will need to be patient: on a test file of 1000 lines, with 50% of the lines matching the regex, the Mark operation took about 1.25 minutes on my old Intel® Core™ i5 CPU 650 @ 3.20 GHz, and with a portable npp rev that I use to fool with plugins, etc:

        Notepad++ v8.1.9   (32-bit)
        Build time : Oct 21 2021 - 23:32:04
        Path : C:\Users\neils\Downloads\npp\npp.8.1.9.portable\notepad++.exe
        Command Line : -multiInst
        Admin mode : OFF
        Local Conf mode : ON
        Cloud Config : OFF
        OS Name : Windows 10 Enterprise (64-bit) 
        OS Version : 2004
        OS Build : 19041.1348
        Current ANSI codepage : 1252
        Plugins : AnalysePlugin.dll AutoCodepage.dll AutoEolFormat.dll BetterMultiSelection.dll BookmarksDook.dll BracketsCheck.dll ColorPicker.dll CustomLineNumbers.dll ElasticTabstops.dll ExtSettings.dll FileSwitcher.dll FingerText.dll FWDataViz.dll GitSCM.dll GotoLineCol.dll HexEditor.dll LightExplorer.dll linefilter2.dll Linefilter3.dll linesort.dll LocationNavigate.dll MarkdownViewerPlusPlus.dll MenuIcons.dll mimeTools.dll MultiClipboard.dll MusicPlaye_1.0.11x86r.dll NavigateTo.dll NewFileBrowser.dll nppAutoDetectIndent.dll NppCalc.dll NppConverter.dll NppExport.dll NppMarkdownPanel.dll NppMenuSearch.dll NppQCP.dll NppTextViz.dll OpenSelection.dll pork2sausage.dll PythonScript.dll QuickText.dll RegexTrainer.dll SecurePad.dll selectNLaunch.dll VisualStudioLineCopy.dll _CustomizeToolbar.dll 
        
        

        (I forget why I even chose a 32-bit rev.)

        Your 400k lines would take on the order of 500 minutes. (The remove unmarked lines operation was very quick.)

        I also noticed that the Mark dialog’s Find what entry box changed to show blank during the lengthy process. This made me worry that the process had hung up, but fortunately that turned out to not be the case.

        1 Reply Last reply Reply Quote 0
        • A
          astrosofista @M P
          last edited by astrosofista Dec 2, 2021, 3:35 PM Dec 2, 2021, 3:34 PM

          @m-p, @Neil-Schipper, all

          As said before, I am also not sure that the regex engine can process 400000 lines in a single run. Despite the warning, I suggest the following regular expression that will delete all lines that do not meet the proposed criteria:

          Search: (?-s)(^(.+?).:\2\R)|.*\R
          Replace: ?1$0:
          

          So, care to make a backup copy of the file, put the caret at the very beginning of the document, select just the Regular Expression mode and click on Replace All.

          Stay safe

          N 1 Reply Last reply Dec 2, 2021, 5:09 PM Reply Quote 2
          • N
            Neil Schipper @astrosofista
            last edited by Neil Schipper Dec 2, 2021, 5:10 PM Dec 2, 2021, 5:09 PM

            @astrosofista Nice solution. It runs much faster than mine. (I do find it mysterious that deleting would be faster than mere marking & bookmarking, but what do I know about the innards of the regex machinery?).

            Also, although I’ve skimmed the topic in the docs, I’ve never actually seen a Substitution Conditional in action, so thanks.

            … and I found I flaw in my solution: it doesn’t match short lines of form ab:a. Fix is below:

            ^(\w{2,})\w:\1$  ===> bingo ===> oops, prevents xy:x
            ^(\w{1,})\w:\1$  ===> handles that case
            ^(\w+)\w:\1$  ===> ditto but uses more conventional construct than {1,}
            
            1 Reply Last reply Reply Quote 2
            • G
              guy038
              last edited by guy038 Dec 4, 2021, 9:54 AM Dec 2, 2021, 5:53 PM

              Hello, @m-p, @neil-schipper, @astrosofista and All,

              Here is a method :

              • Open the Replace diakog ( Ctrl + H )

              • SEARCH (?-s)^(\w+)\w:\1\R(*SKIP)(*FAIL)|.+\R

              • REPLACE Leave EMPTY

              • Tick the Wrap around option

              • Select the Regular expression search mode

              • Click on the Replace All button


              As a test, I duplicated your 6-lines text, below, 66,667 times for a total of 400,002 lines

              1234567:123456
              567890ß:567890
              dfcghvioti6uzfghj:dfcghvioti6uzfgh
              7658656:dfcghvioti6uzfghj
              sgagdaskdj:oijvjpi osdcj
              98760798698657:9876079869865
              

              After a click on the Replace All button, 16 s later, it displayed thee message “133,334 occurrences were replaced”, so exactly the two lines, below, 66,667 times ! ( with N++ v8.1.9.2 on a Win 10 Pro 64 bits laptop with a SSD

              7658656:dfcghvioti6uzfghj
              sgagdaskdj:oijvjpi osdcj
              

              Best Regards,

              guy038

              P.S. :

              See the definition of the Backtracking Control verbs (*SKIP) and (*FAIL) below :

              https://community.notepad-plus-plus.org/post/55464

              N 1 Reply Last reply Dec 6, 2021, 4:29 PM Reply Quote 2
              • M
                M P @Neil Schipper
                last edited by Dec 3, 2021, 8:42 AM

                @neil-schipper thank you for this amazing help. It processed very fast!! (1min max)

                N 1 Reply Last reply Dec 4, 2021, 7:37 AM Reply Quote 0
                • N
                  Neil Schipper @M P
                  last edited by Neil Schipper Dec 4, 2021, 7:38 AM Dec 4, 2021, 7:37 AM

                  @m-p I’m glad it worked out for you. (I hope you checked that no cases of ab:a were missed!)

                  It’s interesting to me that the solutions of @astrosofista and @guy038 are so much faster than mine. On my machine and with my test data:

                  • my solution’s (book)marking phase (using my most recent regex) took about 58 seconds
                  • @astrosofista’s one-step solution took about 8 seconds
                  • @guy038 's one-step solution took about 4 seconds

                  @guy038, I’ve been spending some time with your amazing backtracking control verbs write-up. I’m having a tough time with it, and I haven’t absorbed much as yet. Maybe I’m not used to thinking (in a correct and disciplined way) about normal backtracking, and this makes it hard to think about modifying it. (Also, the organ between my ears is not quite what it was 30 years ago.) Anyway, I’ll plod along some more and maybe after {3,} readings, something will be absorbed and retained.

                  A 1 Reply Last reply Dec 4, 2021, 12:26 PM Reply Quote 2
                  • A
                    Alan Kilborn @Neil Schipper
                    last edited by Alan Kilborn Dec 4, 2021, 12:28 PM Dec 4, 2021, 12:26 PM

                    @neil-schipper said in deleting specific lines that don't meet a criteria:

                    my solution’s (book)marking phase (using my most recent regex) took about 58 seconds

                    Operations involving bookmarking are often slow with Notepad++. This has been discussed on the forum before, with (I don’t believe) any solutions to the problem being derived.

                    Some possible references:

                    • https://community.notepad-plus-plus.org/topic/15159/bookmark-multiple-lines
                    • https://community.notepad-plus-plus.org/topic/18900/persistent-highlight-of-characters-e-g-no-break-space
                    • https://github.com/notepad-plus-plus/notepad-plus-plus/issues/8279
                    N 1 Reply Last reply Dec 4, 2021, 12:47 PM Reply Quote 0
                    • N
                      Neil Schipper @Alan Kilborn
                      last edited by Dec 4, 2021, 12:47 PM

                      @alan-kilborn Thanks, I’ll look at those. I figured the main contributor to slowness was promiscuous backtracking on lines that don’t meet the mark condition.

                      A 1 Reply Last reply Dec 4, 2021, 1:19 PM Reply Quote 0
                      • A
                        Alan Kilborn @Neil Schipper
                        last edited by Dec 4, 2021, 1:19 PM

                        @neil-schipper said in deleting specific lines that don't meet a criteria:

                        I figured the main contributor to slowness was promiscuous backtracking on lines that don’t meet the mark condition

                        That could be. Actually, I wasn’t that absorbed with this issue, so the slowness may well NOT involve the bookmarking action. I just thought I’d point it out for general awareness that there is some sort of performance issue with some bookmarking actions.

                        1 Reply Last reply Reply Quote 0
                        • N
                          Neil Schipper @guy038
                          last edited by Dec 6, 2021, 4:29 PM

                          Hi @astrosofista & @guy038 & @alan-kilborn,

                          OK, I found some closure on this. When re-running my bookmarking regex I noticed activity in the BookmarksDook panel. If I made the panel invisible, the process was still very slow.

                          I also observed that a native bookmarking process like Inverse Bookmark (which has nothing to do with running a regex) was super slow.

                          Then I went to a different Npp instance, this one minimalist (no plugins), and ran the regex against the same data, and the process was blink-of-an-eye. I then doubled the data to 2k lines, and it was still pretty much instant.

                          Then I took Bookmarks@Dook out of the earlier instance (moved plug-in subdir away, restarted): the bookmarking by regex process was instant. Restored Bookmarks@Dook plugin, slow again.

                          So there you have it.

                          @alan-kilborn Those threads deal with bookmark processes driven by PythonScript, not native. The two issues might still be related. None of those discussions made mention of the presence or absence of the plugin.

                          1 Reply Last reply Reply Quote 1
                          5 out of 12
                          • First post
                            5/12
                            Last post
                          The Community of users of the Notepad++ text editor.
                          Powered by NodeBB | Contributors