deleting specific lines that don't meet a criteria
-
@m-p Hello. My solution is in the text box below. (I left my failed regexes in for public amusement.) Copy the text into a ‘new’ file tab and try it. I can’t guarantee it on a monster size file, but kindly let us know.
(\w+)\w:/1 ===> fail (\w*?)\w:\1 ===> does match whole lines of interest, but also parts of lines we want to not match (\w{2,})\w:\1 ===> still also matches parts of to-reject lines ^(\w{2,})\w:\1$ ===> bingo Recipe: 1. select text of the last regex above ($ is last char of regex); invoke 'Mark' (Ctl-m) (dialog appears, regex appears in the 'Find what' box); enable 'Bookmark line' and Reg exprn; in the file move cursor to home position; execute 'Mark All' 2. Search -> Bookmark -> Remove unmarked lines 1234567:123456 123456:123456 12346:123456 567890ß:567890 dfcghvioti6uzfghj:dfcghvioti6uzfgh 7658656:dfcghvioti6uzfghj sgagdaskdj:oijvjpi osdcj 98760798698657:9876079869865
-
@m-p You will need to be patient: on a test file of 1000 lines, with 50% of the lines matching the regex, the Mark operation took about 1.25 minutes on my old Intel® Core™ i5 CPU 650 @ 3.20 GHz, and with a portable npp rev that I use to fool with plugins, etc:
Notepad++ v8.1.9 (32-bit) Build time : Oct 21 2021 - 23:32:04 Path : C:\Users\neils\Downloads\npp\npp.8.1.9.portable\notepad++.exe Command Line : -multiInst Admin mode : OFF Local Conf mode : ON Cloud Config : OFF OS Name : Windows 10 Enterprise (64-bit) OS Version : 2004 OS Build : 19041.1348 Current ANSI codepage : 1252 Plugins : AnalysePlugin.dll AutoCodepage.dll AutoEolFormat.dll BetterMultiSelection.dll BookmarksDook.dll BracketsCheck.dll ColorPicker.dll CustomLineNumbers.dll ElasticTabstops.dll ExtSettings.dll FileSwitcher.dll FingerText.dll FWDataViz.dll GitSCM.dll GotoLineCol.dll HexEditor.dll LightExplorer.dll linefilter2.dll Linefilter3.dll linesort.dll LocationNavigate.dll MarkdownViewerPlusPlus.dll MenuIcons.dll mimeTools.dll MultiClipboard.dll MusicPlaye_1.0.11x86r.dll NavigateTo.dll NewFileBrowser.dll nppAutoDetectIndent.dll NppCalc.dll NppConverter.dll NppExport.dll NppMarkdownPanel.dll NppMenuSearch.dll NppQCP.dll NppTextViz.dll OpenSelection.dll pork2sausage.dll PythonScript.dll QuickText.dll RegexTrainer.dll SecurePad.dll selectNLaunch.dll VisualStudioLineCopy.dll _CustomizeToolbar.dll
(I forget why I even chose a 32-bit rev.)
Your 400k lines would take on the order of 500 minutes. (The remove unmarked lines operation was very quick.)
I also noticed that the Mark dialog’s Find what entry box changed to show blank during the lengthy process. This made me worry that the process had hung up, but fortunately that turned out to not be the case.
-
@m-p, @Neil-Schipper, all
As said before, I am also not sure that the regex engine can process 400000 lines in a single run. Despite the warning, I suggest the following regular expression that will delete all lines that do not meet the proposed criteria:
Search: (?-s)(^(.+?).:\2\R)|.*\R Replace: ?1$0:
So, care to make a backup copy of the file, put the caret at the very beginning of the document, select just the
Regular Expression mode
and click onReplace All
.Stay safe
-
@astrosofista Nice solution. It runs much faster than mine. (I do find it mysterious that deleting would be faster than mere marking & bookmarking, but what do I know about the innards of the regex machinery?).
Also, although I’ve skimmed the topic in the docs, I’ve never actually seen a Substitution Conditional in action, so thanks.
… and I found I flaw in my solution: it doesn’t match short lines of form
ab:a
. Fix is below:^(\w{2,})\w:\1$ ===> bingo ===> oops, prevents xy:x ^(\w{1,})\w:\1$ ===> handles that case ^(\w+)\w:\1$ ===> ditto but uses more conventional construct than {1,}
-
Hello, @m-p, @neil-schipper, @astrosofista and All,
Here is a method :
-
Open the Replace diakog (
Ctrl + H
) -
SEARCH
(?-s)^(\w+)\w:\1\R(*SKIP)(*FAIL)|.+\R
-
REPLACE
Leave EMPTY
-
Tick the
Wrap around
option -
Select the
Regular expression
search mode -
Click on the
Replace All
button
As a test, I duplicated your
6
-lines text, below,66,667
times for a total of400,002
lines1234567:123456 567890ß:567890 dfcghvioti6uzfghj:dfcghvioti6uzfgh 7658656:dfcghvioti6uzfghj sgagdaskdj:oijvjpi osdcj 98760798698657:9876079869865
After a click on the
Replace All
button,16 s
later, it displayed thee message “133,334
occurrences were replaced”, so exactly the two lines, below,66,667
times ! ( with N++v8.1.9.2
on aWin 10 Pro 64 bits
laptop with aSSD
7658656:dfcghvioti6uzfghj sgagdaskdj:oijvjpi osdcj
Best Regards,
guy038
P.S. :
See the definition of the Backtracking Control verbs
(*SKIP)
and(*FAIL)
below : -
-
@neil-schipper thank you for this amazing help. It processed very fast!! (1min max)
-
@m-p I’m glad it worked out for you. (I hope you checked that no cases of
ab:a
were missed!)It’s interesting to me that the solutions of @astrosofista and @guy038 are so much faster than mine. On my machine and with my test data:
- my solution’s (book)marking phase (using my most recent regex) took about 58 seconds
- @astrosofista’s one-step solution took about 8 seconds
- @guy038 's one-step solution took about 4 seconds
@guy038, I’ve been spending some time with your amazing backtracking control verbs write-up. I’m having a tough time with it, and I haven’t absorbed much as yet. Maybe I’m not used to thinking (in a correct and disciplined way) about normal backtracking, and this makes it hard to think about modifying it. (Also, the organ between my ears is not quite what it was 30 years ago.) Anyway, I’ll plod along some more and maybe after
{3,}
readings, something will be absorbed and retained. -
@neil-schipper said in deleting specific lines that don't meet a criteria:
my solution’s (book)marking phase (using my most recent regex) took about 58 seconds
Operations involving bookmarking are often slow with Notepad++. This has been discussed on the forum before, with (I don’t believe) any solutions to the problem being derived.
Some possible references:
-
@alan-kilborn Thanks, I’ll look at those. I figured the main contributor to slowness was promiscuous backtracking on lines that don’t meet the mark condition.
-
@neil-schipper said in deleting specific lines that don't meet a criteria:
I figured the main contributor to slowness was promiscuous backtracking on lines that don’t meet the mark condition
That could be. Actually, I wasn’t that absorbed with this issue, so the slowness may well NOT involve the bookmarking action. I just thought I’d point it out for general awareness that there is some sort of performance issue with some bookmarking actions.
-
Hi @astrosofista & @guy038 & @alan-kilborn,
OK, I found some closure on this. When re-running my bookmarking regex I noticed activity in the BookmarksDook panel. If I made the panel invisible, the process was still very slow.
I also observed that a native bookmarking process like Inverse Bookmark (which has nothing to do with running a regex) was super slow.
Then I went to a different Npp instance, this one minimalist (no plugins), and ran the regex against the same data, and the process was blink-of-an-eye. I then doubled the data to 2k lines, and it was still pretty much instant.
Then I took Bookmarks@Dook out of the earlier instance (moved plug-in subdir away, restarted): the bookmarking by regex process was instant. Restored Bookmarks@Dook plugin, slow again.
So there you have it.
@alan-kilborn Those threads deal with bookmark processes driven by PythonScript, not native. The two issues might still be related. None of those discussions made mention of the presence or absence of the plugin.