Use Text File to Remove Lines?
-
Hi. This is sorta hard to explain, but here goes.
I have a big text file filled with a lot of ID’s. Is it possible to use the text file to remove lines?
In other words is it possible to use it like a database of lines to remove? -
@mrmagnum8841 said in Use Text File to Remove Lines?:
In other words is it possible to use it like a database of lines to remove?
Kindof. I am going to assume you have two files – the huge file that has lines you want to remove (
database.txt
), and another smaller file with a list of IDs that will indicate which lines to delete (list.txt
).Most important, backup both files – I recommend working from a copy rather than the original.
example
list.txt
:ID005 ID007 ID011 ID013
example
database.txt
:This has many things, ID001, yada This has many things, ID002, yada This has many things, ID003, yada This has many things, ID004, yada This has many things, ID005, yada This has many things, ID006, yada This has many things, ID007, yada This has many things, ID008, yada This has many things, ID009, yada This has many things, ID010, yada This has many things, ID011, yada This has many things, ID012, yada This has many things, ID013, yada This has many things, ID014, yada This has many things, ID015, yada This has many things, ID016, yada
desired result (5, 7, 11, 13 deleted):
This has many things, ID001, yada This has many things, ID002, yada This has many things, ID003, yada This has many things, ID004, yada This has many things, ID006, yada This has many things, ID008, yada This has many things, ID009, yada This has many things, ID010, yada This has many things, ID012, yada This has many things, ID014, yada This has many things, ID015, yada This has many things, ID016, yada
- In
list.txt
,Ctrl+A
,Ctrl+J
: this joins everything into one long line, space-separated- Search > Replace (or
Ctrl+H
): goal is to make the list|
-separated- FIND =
\h+
- REPLACE =
|
- MODE = regular expression
- Replace All
- FIND =
- Another Replacement: goal is to get the lines down to less than 1000 characters each (assuming no ID is greater than 50 characters)
- FIND =
(?-s)^(.{950,}?)\|
- REPLACE =
$1\r\n
- MODE = regular expression
- Replace All
- You don’t need this step 3 if your line is less than 1000 characters after step 2.
- FIND =
- Next replacement: goal is to make each line look like
(ID#|ID#|...|ID#)
, with a bit of stuff at the beginning and end- FIND =
(?-s)^.*$
- REPLACE =
\(?-s\)^.*\($0\).*\\R?
- MODE = regular expression
- Replace All
- FIND =
- Now, for each line in
list.txt
:- Copy the line from
list.txt
- Switch to
database.txt
window - Search > Replace (or
Ctrl+H
)- paste the line into the FIND box, so it looks like FIND =
(?-s)^.*(ID#|ID#|...|ID#).*\R?
- make the REPLACE box empty
- MODE = regular expression
- Replace All
- paste the line into the FIND box, so it looks like FIND =
- Repeat as necessary for each line from
list.txt
- Copy the line from
-----
caveat emptor
This sequence seemed to work for me, based on my understanding of your issue, and is published here to help you learn how to do this. I make no guarantees or warranties as to the functionality for you. You are responsible to save and backup all data before and after running this sequence. If you want to use it long term, I recommend investing time in adding error checking and verifying with edge cases.
- In
-
Thank you. However I’ve already seen something similar and the only real issue with it is the length.
As of right now, I have over 23738 lines with 32 characters being the length of each line and more and more get added. -
Hello, @mrmagnum8841, @peterjones and All,
Hi Peter, we have already seen this type of request, many times, on our forum !
So @mrmagnum8841, here is the road map :
-
Open a N++ new tab
-
First, paste the contents of the
database.txt
file in that new tab -
Secondly, add a line containing, at least,
3
equal signs (===
) -
Thirdly, append the contents of the
list.txt
file -
Now open the Replace dialog (
Ctrl + H
)-
SEARCH
(?s-i)^(?-s:.*(\w+).*\R)(?=.*^===+.+?^\1$)|^===.+
-
REPLACE
Leave EMPTY
-
Tick the
Wrap around
option, if necessary -
Select the
Regular expression
search mode -
Click on the
Replace All
button
-
Voila, that’s all !
Notes :
-
Globally, this regex searches, in current line, if a word is also present, with the same case, in the
2nd
part of file, after the line of equal signs=====
, in the nearest complete line -
If so, all the current line contents, with its line-break, are selected and, as the replacement zone is empty, this line is just deleted
Best Regards,
guy038
-
-
@guy038 said,
Hi Peter, we have already seen this type of request,
I know. I just couldn’t quickly find any of them to link. Unfortunately, you have so many excellent regex posts on this forum, but I could never bookmark enough of them in an organized fashion to be able to always find the one I am thinking of for any given future reply. :-) I tried my best to recreate it from memory (not knowing if you were going to be answering over the weekend or not), but I forgot your trick for doing it in the same file rather than in 1000-character chunks. I guess I should have just waited for you to reply. ;-)
As of right now, I have over 23738 lines with 32 characters being the length of each line and more and more get added.
You’ve now told us you have 23738 lines of data, which is helpful information; but you haven’t told us how many IDs there are to delete from those 23738 lines (is it a few IDs, a few hundred IDs, a few thousand IDs, more than half the IDs)?
Using the nomenclature I did in my first reply: with my solution, it shouldn’t be a problem if
database.txt
has 23k lines; with my solution, it became an issue iflist.txt
had more than ~1000 characters in the list of IDs… which is why my procedure grouped it into multiple groups of IDs if there were too many characters in yourlist.txt
.That’s one of the benefits of @guy038’s solution: his solution can have as many IDs as you want to delete from the main
database.txt
, and it will still work – Unfortunately, historically, if there are too many characters in the lookahead expression, Notepad++ occasionally gives up and just selects everything, which would have the unfortunate side effect of deleting everything.If you try @guy038’s solution, and if it deletes everything or otherwise deletes too much, let us know, and we will try to help you through. But if that happens, it will help us to help you if you could give us a better representation of what you have – whether you really have two going to have to stop making us guess what you data is structured like. I assumed two files: one as a
database.txt
that had the main data you wanted to process, and a secondlist.txt
that just listed the IDs you wanted to delete fromdatabase.txt
. But that was just an assumption, which you have neither confirmed or denied. So if you need more help from us, you will need to provide more information, including dummy data. You can use the</>
button on the post toolbar to format data as text (like I did in my original reply); give us an excerpt (not the full 23000 lines) of your data – if there is sensitive/secret information, just make up a handful of lines of dummy data that looks similar but with fake names, numbers, etc; and give us an example of the IDs you’d like to delete from your dummy data. -
Hello, @mrmagnum8841, @peterjones and All,
I re-tested my regex S/R with a consequent amount of lines and I must say that this S/R, proposed in my previous post, failed miserably, even with very little data.: -(((
Assuming that each line of the
database.txt
file contains32
characters per line, it does not work when more than160
lines :-(( A pity !Even this modified S/R, where I use delimiters to better catch the identifier
ID###
:SEARCH
(?s-i)^(?-s:.*,\x20(\w+),.*\R)(?=.*^===+.+?^\1$)|^===.+
With
,\x20
as the start delimiter and,
as an end delimiter, can support about2,200
lines but not more :-(I also used this other version without the line delimiter
=======
:SEARCH
(?-si)^.*,\x20(\w+),.*\R(?=(?s).+?^\1$)|^(ID...\R)+
But, though the regex seems more simple, the result is worse as it can only handle about
1.850
lines !And, anyway, all of these regexes S/R end up , selecting all the file contents which is, obviously, not the desired goal !
Finally, it seems that the @peterjones’s solution is the more efficient ! The only drawback of his method is when the
list.txt
file contains too many identifiers OR when a lot of these identifiers do not exist in thedatabase.txt
file !. In this later case, this leads to a resulting regex(?-s)^.*(ID#|ID#|...|ID#).*\R?
containing two many uselessID#
alternatives !So, here is my new attempt :
-
It should support, both, important size of the
database.txt
andlist.txt
files -
The contents of the
list.txt
file may refer to the identifiers whose lines containing them, in thedatabase.txt
file, have to be deleted or, on the contrary, have to be retained ! -
Of course, it does not alter the initial order of lines of the
database.txt
file -
It minimizes the number of alternatives of the final regex, created in order to process the
database.txt
contents
Here is the road map, assuming the following examples of :
- The initial
list.txt
file (12
lines, not sorted ) :
ID011 ID000 ID005 ID037 ID008 ID013 ID024 ID043 ID003 ID026 ID028 ID016
- The initial
database.txt
file (50
lines, of32
chars, not sorted ) :
This a simple test, ID007, ABCDE This a simple test, ID024, ABCDE This a simple test, ID011, ABCDE This a simple test, ID001, ABCDE This a simple test, ID002, ABCDE This a simple test, ID004, ABCDE This a simple test, ID005, ABCDE This a simple test, ID006, ABCDE This a simple test, ID007, ABCDE This a simple test, ID007, ABCDE This a simple test, ID024, ABCDE This a simple test, ID009, ABCDE This a simple test, ID010, ABCDE This a simple test, ID024, ABCDE This a simple test, ID011, ABCDE This a simple test, ID007, ABCDE This a simple test, ID011, ABCDE This a simple test, ID003, ABCDE This a simple test, ID012, ABCDE This a simple test, ID013, ABCDE This a simple test, ID014, ABCDE This a simple test, ID017, ABCDE This a simple test, ID011, ABCDE This a simple test, ID018, ABCDE This a simple test, ID019, ABCDE This a simple test, ID020, ABCDE This a simple test, ID021, ABCDE This a simple test, ID022, ABCDE This a simple test, ID023, ABCDE This a simple test, ID012, ABCDE This a simple test, ID007, ABCDE This a simple test, ID023, ABCDE This a simple test, ID024, ABCDE This a simple test, ID024, ABCDE This a simple test, ID011, ABCDE This a simple test, ID025, ABCDE This a simple test, ID024, ABCDE This a simple test, ID027, ABCDE This a simple test, ID023, ABCDE This a simple test, ID023, ABCDE This a simple test, ID028, ABCDE This a simple test, ID029, ABCDE This a simple test, ID012, ABCDE This a simple test, ID029, ABCDE This a simple test, ID029, ABCDE This a simple test, ID029, ABCDE This a simple test, ID030, ABCDE This a simple test, ID007, ABCDE This a simple test, ID003, ABCDE This a simple test, ID024, ABCDE
-
First, copy the
database.txt
contents as, let’s say, the filedummy.txt
-
Now, open the
dummy.txt
file -
Perform the regex S/R :
-
SEARCH
(?-s)^.*(ID\d\d\d).*\R
-
REPLACE
\1\t$0
-
=> We copy the identifier at beginning of all lines, followed with a
tab
separator and get :ID007 This a simple test, ID007, ABCDE ID024 This a simple test, ID024, ABCDE ID011 This a simple test, ID011, ABCDE ID001 This a simple test, ID001, ABCDE ID002 This a simple test, ID002, ABCDE ID004 This a simple test, ID004, ABCDE ID005 This a simple test, ID005, ABCDE ID006 This a simple test, ID006, ABCDE ID007 This a simple test, ID007, ABCDE ID007 This a simple test, ID007, ABCDE ID024 This a simple test, ID024, ABCDE ID009 This a simple test, ID009, ABCDE ID010 This a simple test, ID010, ABCDE ID024 This a simple test, ID024, ABCDE ID011 This a simple test, ID011, ABCDE ID007 This a simple test, ID007, ABCDE ID011 This a simple test, ID011, ABCDE ID003 This a simple test, ID003, ABCDE ID012 This a simple test, ID012, ABCDE ID013 This a simple test, ID013, ABCDE ID014 This a simple test, ID014, ABCDE ID017 This a simple test, ID017, ABCDE ID011 This a simple test, ID011, ABCDE ID018 This a simple test, ID018, ABCDE ID019 This a simple test, ID019, ABCDE ID020 This a simple test, ID020, ABCDE ID021 This a simple test, ID021, ABCDE ID022 This a simple test, ID022, ABCDE ID023 This a simple test, ID023, ABCDE ID012 This a simple test, ID012, ABCDE ID007 This a simple test, ID007, ABCDE ID023 This a simple test, ID023, ABCDE ID024 This a simple test, ID024, ABCDE ID024 This a simple test, ID024, ABCDE ID011 This a simple test, ID011, ABCDE ID025 This a simple test, ID025, ABCDE ID024 This a simple test, ID024, ABCDE ID027 This a simple test, ID027, ABCDE ID023 This a simple test, ID023, ABCDE ID023 This a simple test, ID023, ABCDE ID028 This a simple test, ID028, ABCDE ID029 This a simple test, ID029, ABCDE ID012 This a simple test, ID012, ABCDE ID029 This a simple test, ID029, ABCDE ID029 This a simple test, ID029, ABCDE ID029 This a simple test, ID029, ABCDE ID030 This a simple test, ID030, ABCDE ID007 This a simple test, ID007, ABCDE ID003 This a simple test, ID003, ABCDE ID024 This a simple test, ID024, ABCDE
- Append the
list.txt
contents, at the end of thedummy.txt
file, giving these62
lines, below :
ID007 This a simple test, ID007, ABCDE ID024 This a simple test, ID024, ABCDE ID011 This a simple test, ID011, ABCDE ID001 This a simple test, ID001, ABCDE ID002 This a simple test, ID002, ABCDE ID004 This a simple test, ID004, ABCDE ID005 This a simple test, ID005, ABCDE ID006 This a simple test, ID006, ABCDE ID007 This a simple test, ID007, ABCDE ID007 This a simple test, ID007, ABCDE ID024 This a simple test, ID024, ABCDE ID009 This a simple test, ID009, ABCDE ID010 This a simple test, ID010, ABCDE ID024 This a simple test, ID024, ABCDE ID011 This a simple test, ID011, ABCDE ID007 This a simple test, ID007, ABCDE ID011 This a simple test, ID011, ABCDE ID003 This a simple test, ID003, ABCDE ID012 This a simple test, ID012, ABCDE ID013 This a simple test, ID013, ABCDE ID014 This a simple test, ID014, ABCDE ID017 This a simple test, ID017, ABCDE ID011 This a simple test, ID011, ABCDE ID018 This a simple test, ID018, ABCDE ID019 This a simple test, ID019, ABCDE ID020 This a simple test, ID020, ABCDE ID021 This a simple test, ID021, ABCDE ID022 This a simple test, ID022, ABCDE ID023 This a simple test, ID023, ABCDE ID012 This a simple test, ID012, ABCDE ID007 This a simple test, ID007, ABCDE ID023 This a simple test, ID023, ABCDE ID024 This a simple test, ID024, ABCDE ID024 This a simple test, ID024, ABCDE ID011 This a simple test, ID011, ABCDE ID025 This a simple test, ID025, ABCDE ID024 This a simple test, ID024, ABCDE ID027 This a simple test, ID027, ABCDE ID023 This a simple test, ID023, ABCDE ID023 This a simple test, ID023, ABCDE ID028 This a simple test, ID028, ABCDE ID029 This a simple test, ID029, ABCDE ID012 This a simple test, ID012, ABCDE ID029 This a simple test, ID029, ABCDE ID029 This a simple test, ID029, ABCDE ID029 This a simple test, ID029, ABCDE ID030 This a simple test, ID030, ABCDE ID007 This a simple test, ID007, ABCDE ID003 This a simple test, ID003, ABCDE ID024 This a simple test, ID024, ABCDE ID011 ID000 ID005 ID037 ID008 ID013 ID024 ID043 ID003 ID026 ID028 ID016
- Select the option
Edit > Line operations > Sort Lines Lexicographically Descending
We get the following text :
ID043 ID037 ID030 This a simple test, ID030, ABCDE ID029 This a simple test, ID029, ABCDE ID029 This a simple test, ID029, ABCDE ID029 This a simple test, ID029, ABCDE ID029 This a simple test, ID029, ABCDE ID028 This a simple test, ID028, ABCDE ID028 ID027 This a simple test, ID027, ABCDE ID026 ID025 This a simple test, ID025, ABCDE ID024 This a simple test, ID024, ABCDE ID024 This a simple test, ID024, ABCDE ID024 This a simple test, ID024, ABCDE ID024 This a simple test, ID024, ABCDE ID024 This a simple test, ID024, ABCDE ID024 This a simple test, ID024, ABCDE ID024 This a simple test, ID024, ABCDE ID024 ID023 This a simple test, ID023, ABCDE ID023 This a simple test, ID023, ABCDE ID023 This a simple test, ID023, ABCDE ID023 This a simple test, ID023, ABCDE ID022 This a simple test, ID022, ABCDE ID021 This a simple test, ID021, ABCDE ID020 This a simple test, ID020, ABCDE ID019 This a simple test, ID019, ABCDE ID018 This a simple test, ID018, ABCDE ID017 This a simple test, ID017, ABCDE ID016 ID014 This a simple test, ID014, ABCDE ID013 This a simple test, ID013, ABCDE ID013 ID012 This a simple test, ID012, ABCDE ID012 This a simple test, ID012, ABCDE ID012 This a simple test, ID012, ABCDE ID011 This a simple test, ID011, ABCDE ID011 This a simple test, ID011, ABCDE ID011 This a simple test, ID011, ABCDE ID011 This a simple test, ID011, ABCDE ID011 This a simple test, ID011, ABCDE ID011 ID010 This a simple test, ID010, ABCDE ID009 This a simple test, ID009, ABCDE ID008 ID007 This a simple test, ID007, ABCDE ID007 This a simple test, ID007, ABCDE ID007 This a simple test, ID007, ABCDE ID007 This a simple test, ID007, ABCDE ID007 This a simple test, ID007, ABCDE ID007 This a simple test, ID007, ABCDE ID006 This a simple test, ID006, ABCDE ID005 This a simple test, ID005, ABCDE ID005 ID004 This a simple test, ID004, ABCDE ID003 This a simple test, ID003, ABCDE ID003 This a simple test, ID003, ABCDE ID003 ID002 This a simple test, ID002, ABCDE ID001 This a simple test, ID001, ABCDE ID000
-
Now, perform the regex S/R :
-
SEARCH
(?-s)ID(.{3}).+\RID\1\R|.+\R
-
REPLACE
?1|\1
-
=> You should
always
obtain a single line, like below :|028|024|013|011|005|003
Remark : This line should not exceed
2,010
characters long. However, this should be generally the case as we collect only identifiers present in thedatabase.txt
file. I also omitted the common partID
to get a smaller expression !-
At this penultimate step, we’ll use a regex S/R to … create an new search regex ! So :
-
SEARCH
^\|(.+)
-
REPLACE
\(?-s\)^\(?=.*ID\(\1\)\).+\\R
-
which give us the regex :
(?-s)^(?=.*ID(028|024|013|011|005|003)).+\R
-
Save the one-line file
dummy.txt
-
Finally, open your
database.txt
file -
And, here is the final regex S/R to perform. Two cases :
-
(A) If the
list.txt
file contains all the identifiers to be deleted, indatabase.txt
, use the new search regex :-
SEARCH
(?-s)^(?=.*ID(028|024|013|011|005|003)).+\R
-
REPLACE
Leave EMPTY
-
-
(B) If the
list.txt
file contains the identifiers which must be only retained, indatabase.txt
, add the part|^.+\R
at end and modify the replacement part :-
SEARCH
(?-s)^(?=.*ID(028|024|013|011|005|003)).+\R|^.+\R
-
REPLACE
?1$0
-
-
- With the regex S/R
(A)
, we get a finaldatabase.txt
file of33
lines, below :
This a simple test, ID007, ABCDE This a simple test, ID001, ABCDE This a simple test, ID002, ABCDE This a simple test, ID004, ABCDE This a simple test, ID006, ABCDE This a simple test, ID007, ABCDE This a simple test, ID007, ABCDE This a simple test, ID009, ABCDE This a simple test, ID010, ABCDE This a simple test, ID007, ABCDE This a simple test, ID012, ABCDE This a simple test, ID014, ABCDE This a simple test, ID017, ABCDE This a simple test, ID018, ABCDE This a simple test, ID019, ABCDE This a simple test, ID020, ABCDE This a simple test, ID021, ABCDE This a simple test, ID022, ABCDE This a simple test, ID023, ABCDE This a simple test, ID012, ABCDE This a simple test, ID007, ABCDE This a simple test, ID023, ABCDE This a simple test, ID025, ABCDE This a simple test, ID027, ABCDE This a simple test, ID023, ABCDE This a simple test, ID023, ABCDE This a simple test, ID029, ABCDE This a simple test, ID012, ABCDE This a simple test, ID029, ABCDE This a simple test, ID029, ABCDE This a simple test, ID029, ABCDE This a simple test, ID030, ABCDE This a simple test, ID007, ABCDE
- With the regex S/R
(B)
, we get a finaldatabase.txt
file of17
lines, below :
This a simple test, ID024, ABCDE This a simple test, ID011, ABCDE This a simple test, ID005, ABCDE This a simple test, ID024, ABCDE This a simple test, ID024, ABCDE This a simple test, ID011, ABCDE This a simple test, ID011, ABCDE This a simple test, ID003, ABCDE This a simple test, ID013, ABCDE This a simple test, ID011, ABCDE This a simple test, ID024, ABCDE This a simple test, ID024, ABCDE This a simple test, ID011, ABCDE This a simple test, ID024, ABCDE This a simple test, ID028, ABCDE This a simple test, ID003, ABCDE This a simple test, ID024, ABCDE
Note that, if we had followed the @peterjones’s method, we would have ended up with this search regex, a bit longer :
(?-s)^.*(ID011|ID000|ID005|ID037|ID008|ID013|ID024|ID043|ID003|ID026|ID028|ID016).*\R?
Of course, here, there no notable difference but, depending of the
list.txt
contents, it could be of some importance !Best Regards,
guy038
-
-
@guy038 ,
Great effort, hopefully not wasted.
Unfortunately, @mrmagnum8841 has never come back and answered whether my interpretation of the problem is in any way accurate.
I am the one who introduced the
database.txt
andlist.txt
idea (as the most reasonable way I could come up with of interpreting what was asked for, but never explicitly stated). And I am the one who used “ID###” – @mrmagnum8841 just said “big text file filled with a lot of ID’s”. There may not be anything so easy to capture as the “ID” prefix before each ID. It may be that each ID is really a UUID, or it may be that each ID is really exactly 32 hexadecimal characters, or it may be that each ID is really someone’s name with all spaces and special characters removed, or it may be that each ID just appears to our eyes to be a random set of characters with a random length. Making too many optimizations without any feedback from @mrmagnum8841 might be an interesting mental exercise, but we have no idea if we’ve ever been answering @mrmagnum8841’s actual need.@mrmagnum8841 , if you want more help than we have provided, please actually respond with answers to the questions raised, and let us know how close, or far, we are to actually solving your problem.
-
I think that what I now call “the Dail-ism” is an applicable comment at this point.
-
Hi, @mrmagnum8841, @peterjones, @alan-kilborn and All,
When elaborating my previous post, I remenbered, from this post :
https://community.notepad-plus-plus.org/post/51385
This following regex
(?-s)^(.+\R)(?=(?s).+?^\1)
, which, indeed, could work with a5 Mb
file ,containing more than200,000
lines ! Much better, isn’t it ?Seemingly, the fact that, in this regex, the group
1
corresponds to an entire line, with its line-break, whereas the(?-si)^.*,\x20(\w+),.*\R(?=(?s).+?^\1$)
syntax stores, only, theID###
part, of each line, in group1
( which fails with a file over82
Kb -2,500
lines ! ) makes all the difference !! Why ?As you said, Peter, it was a mental exercise, not specifically intended for the OP, in order to find a correct way to filter fairly large files, as I’m rather irritated by the limitations of my various regular expression attempts :-((
Cheers,
guy038
-