How to delete all lines found in another txt document
-
@DimakSerpg said in How to delete all lines found in another txt document:
why it’s so complicated?
Because it’s essentially trying to recreate a full programming language or database system in something that’s meant for text editing, not database manipulation. I have never heard of a text editor in which “delete all lines found in another txt document” is implemented natively.
If you search the forum, there’s also examples of using the PythonScript plugin to programmatically do essentially the same thing.
I’m unfamiliar with notepad
The application is Notepad++, not notepad. There’s a difference (the latter being the simple app that Microsoft has included with Windows for decades, the former being the high-powered text editor that we talk about in this Forum).
and this doesn’t work.
…
So it’s eight symbols now, and for some reason there are 3 dots??He is this forum’s acknowledged regex guru, but even a guru can sometimes make mistakes or not explain things well (especially when they are communicating technical information in a language other than their native language)
I believe the
...
was supposed to indicate that there could be more beyond the initial three equals symbols. And I believe that showing five equals=====
instead of three equals===
was just enthusiasm on Guy’s part.If it helps, think of those instructions as
- At the end of
source.txt
, add a new line beginning with at least three=
equal symbols - Then append the contents of the
delete.txt
file after the line you just added
And given the instructions above, the SEARCH line needs to change as well:
- SEARCH
(?s-i)^((?-s).+\R)(?=.*^===+\R.*^\1)|^=+\R.+
(it should only have 3=
in a row, not the 4 that Guy originally showed)
So assuming
originalsource.txt
:this is okay delete me this was good i should be deleted fine
and original
delete.txt
:i should be deleted delete me
those would be merged into
this is okay delete me this was good i should be deleted fine === i should be deleted delete me
then running FIND WHAT
(?s-i)^((?-s).+\R)(?=.*^===+\R.*^\1)|^=+\R.+
REPLACE WITH <empty>, SEARCH MODE = regular expression, click REPLACE ALL, I get:this is okay this was good fine
This sequence successfully eliminated the lines from
delete.txt
that were insource.txt
…As with all search/replace instructions that you get from a forum, I highly recommend having a backup copy of any data before you run a REPLACE ALL that you don’t understand.
- At the end of
-
@PeterJones for some reason it works with your examples.
But when I use the same method with my text, it doesn’t work.
Maybe it’s because my text are big? There are like 4.4 million lines, when source and delete files are merged.
It’s all just numbers. So i want to delete 2 million numbers that are in my source file with 2.4 million numbers.After i click “replace all” it just deletes everything.
But it works without any problems when i pick like 100 lines. So the problem is in 4.4 million lines.
-
@DimakSerpg said in How to delete all lines found in another txt document:
It’s all just numbers. So i want to delete 2 million numbers that are in my source file with 2.4 million numbers.
After i click “replace all” it just deletes everything.That’s a different problem than we normally see with big files and such activity. Normally, big files make it so that there’s not enough space in the regex memory, and the regex will thus not run… But Guy’s regex was intended to be immune to long files (and my modification should have been, too), since the capture-memory of the regex should only be one line’s worth.
I’m really surprised that its fallback would be to delete everything. (Well, unless the 2.4M in
source.txt
aren’t unique, and it just so happens that every line insource.txt
is also contained in the 2M lines ofdelete.txt
. It might be worth trying Edit > Line Operations > Remove Duplicate Lines on a copy ofsource.txt
, and seeing if there are still more than 2M lines after the removal; if there are 2M or fewer lines, then it’s entirely possible that every unique line matches a line fromdelete.txt
.)But it works without any problems when I pick like 100 lines. So the problem is in 4.4 million lines.
If it’s not multiples of the same line in
source.txt
, then it’s beyond me. Maybe when Guy or one of the other regex greats has a chance, they can come try to give an alternative that will work with your data.It would help if you could provide a list of like 20 lines of
source.txt
and 5 lines ofdelete.txt
– you can use fake numbers, if there’s something confidential about the numbers, but they should “look like” real data. Someone that has the time and ability could then take those examples, and make huge datafiles that have lots of numbers that are similar to those examples, and see if they can come up with something that works for deleting 2M lines from 2.4M lines of source.But I hinted at it before, and will phrase it differently to make it explicit: a text editor is the wrong tool for the job. You are essentially trying to delete a huge number of records from a database – this could probably be done in a database application, and it could be easily done in a few lines of code with a good programming language – but we cannot help you with either database or programming solutions here, because this forum is about Notepad++.
-
@PeterJones said in How to delete all lines found in another txt document:
but even a guru can sometimes make mistakes or not explain things well (especially when they are communicating technical information in a language other than their native language)
I believe the … was supposed to indicate that there could be more beyond the initial three equals symbols.
And I think that the posters receiving information need to actually do some THINKING about what they’re being given…
-
@PeterJones I updated notepad, thought it might help, and now there’s error.
@Alan-Kilborn said in How to delete all lines found in another txt document:
And I think that the posters receiving information need to actually do some THINKING about what they’re being given…
Uhh… no? It’s pretty simple.
- do this
- then this
- done
I don’t need to know exactly what this command means, I don’t need to learn regex for this. It’s a simple command that would work as it is, but the problem is on my side because of the large text.
-
Hello, @dimakserpg and All,
Could you provide us a small part of your
source.txt
anddelete.txt
( let"s say about50
lines of each ) ?Try to insert these sections as raw text, using the
</>
button when writing your post !I will try to find out a new method, suitable for big files !
Best Regards,
guy038
BTW, in my regex, I used this part
^===+\R
which represents a complete line of, at least,3
equal signs. followed with its line-breakThus, as long as this line begins with
===
, it doesn"t matter if more equal signs are written right after ! -
@DimakSerpg said in How to delete all lines found in another txt document:
Uhh… no? It’s pretty simple.
If it were simple, you would’ve figured it out without help.
I don’t need to know exactly what this command means, I don’t need to learn regex for this.
That’s a poor attitude. So essentially you are saying, “I don’t need to learn because I can dupe other people into doing it for free for me”. See how much help you receive if you continue with that attitude in life. I have already given you a working solution for reasonable quantities of data, and given you alternate suggestions of non-Notepad++ ideas that you might want to pursue; after this post, I’ve had my say.
I notice you also didn’t bother showing any example data, like I requested. And now Guy has requested it as well. If you don’t at least put in that much thought and effort, it will be virtually impossible for someone to help you, even if they were willing to look beyond your attitude.
It’s a simple command that would work as it is, but the problem is on my side because of the large text.
It’s not a simple command, but it does work correctly with smaller datasets.
----
Please note: This Community Forum is not a data transformation service; you should not expect to be able to always say “I have data like X and want it to look like Y” and have us do all the work for you. If you are new to the Forum, and new to regular expressions, we will often give help on the first one or two data-transformation questions, especially if they are well-asked and you show a willingness to learn; and we will point you to the documentation where you can learn how to do the data transformations for yourself in the future. But if you repeatedly ask us to do your work for you, you will find that the patience of usually-helpful Community members wears thin. The best way to learn regular expressions is by experimenting with them yourself, and getting a feel for how they work; having us spoon-feed you the answers without you putting in the effort doesn’t help you in the long term and is uninteresting and annoying for us.
-
@guy038 said in How to delete all lines found in another txt document:
Could you provide us a small part of your source.txt and delete.txt ( let"s say about 50 lines of each ) ?
@PeterJones said in How to delete all lines found in another txt document:
I notice you also didn’t bother showing any example data, like I requested. And now Guy has requested it as well. If you don’t at least put in that much thought and effort, it will be virtually impossible for someone to help you, even if they were willing to look beyond your attitude.
-
Did you notice the part where I said,
Someone ... could then take those examples, and make huge datafiles
– I wasn’t claiming that they would use just the small example; I was saying they needed that small example as a starting point, to try to replicate the problem with the original regex and try to solve it using the extended data.I don’t understand why you are unwilling to provide even that much. Guy has said he’s willing to help you, and all you have to do to receive that help is to provide example data that he can start from. If you choose not to share a small amount of example data, I think even Guy’s willingness to help you will not be able to overcome your lack of effort.
-
REGEX IN NOTEPAD++ IS THE WRONG TOOL FOR THIS JOB!
I created three sets of files:
- 100,000 7-digit numbers in each, where it will delete about 1/3 of the ones from
source.txt
- 1,000,000 7-digit numbers in each, where it will delete about 1/2 of the ones from
source.txt
- 10,000,000 9-digit numbers in each, where it will delete about 1/3 of the ones from
source.txt
I started
notepad++ -nosession -multiInst -noPlugin src1e5.txt del1e5.txt
running on the regex for the smallest of those.
Then in another Notepad++ session, I spend about 10minutes coding up a script in Perl, and made sure it worked on the 100,000 line file in under a second. It then worked on the 1,000,000 line file in about 4 seconds. And then it processed the 10,000,000 line file in 4 minutes.I then wrote up this post. By the time I was done with that, it still hadn’t finished running the regex in Notepad++.
IyFwZXJsDQp1c2UgNS4wMTI7DQp1c2Ugd2FybmluZ3M7DQp1c2Ugc3RyaWN0Ow0KdXNlIFRpbWU6OkhpUmVzIHF3L3RpbWUvOw0KDQpwcmludCBTVERFUlIgc2NhbGFyIHRpbWUsICJcbiI7DQpteSBAc3JjID0gZG8geyBvcGVuIG15ICRmaCwgJzwnLCAnc3JjMWU3LnR4dCc7IDwkZmg+IH07DQpteSBAZGVsID0gZG8geyBvcGVuIG15ICRmaCwgJzwnLCAnZGVsMWU3LnR4dCc7IDwkZmg+IH07DQpteSAlaDsgQGh7QGRlbH0gPSBAZGVsOw0Kb3BlbiBteSAkZmgsICc+JywgJ291dDFlNy50eHQnOw0Kc2VsZWN0ICRmaDsNCiRcID0gIiI7DQpwcmludCBmb3IgZ3JlcCB7IWV4aXN0cyAkaHskX319IEBzcmM7DQpwcmludCBTVERFUlIgc2NhbGFyIHRpbWUsICJcbiI7DQo
If you can figure out how to decode that text box using Notepad++, and run a perl script (not in Notepad++), it’s yours, for free, no tech support provided. Good luck,
- 100,000 7-digit numbers in each, where it will delete about 1/3 of the ones from