Use Text File to Remove Lines?

mrmagnum8841

Hi. This is sorta hard to explain, but here goes.
I have a big text file filled with a lot of ID’s. Is it possible to use the text file to remove lines?
In other words is it possible to use it like a database of lines to remove?

PeterJones

@mrmagnum8841 said in Use Text File to Remove Lines?:

In other words is it possible to use it like a database of lines to remove?

Kindof. I am going to assume you have two files – the huge file that has lines you want to remove (database.txt), and another smaller file with a list of IDs that will indicate which lines to delete (list.txt).

Most important, backup both files – I recommend working from a copy rather than the original.

example list.txt:

ID005
ID007
ID011
ID013

example database.txt:

This has many things, ID001, yada
This has many things, ID002, yada
This has many things, ID003, yada
This has many things, ID004, yada
This has many things, ID005, yada
This has many things, ID006, yada
This has many things, ID007, yada
This has many things, ID008, yada
This has many things, ID009, yada
This has many things, ID010, yada
This has many things, ID011, yada
This has many things, ID012, yada
This has many things, ID013, yada
This has many things, ID014, yada
This has many things, ID015, yada
This has many things, ID016, yada

desired result (5, 7, 11, 13 deleted):

This has many things, ID001, yada
This has many things, ID002, yada
This has many things, ID003, yada
This has many things, ID004, yada
This has many things, ID006, yada
This has many things, ID008, yada
This has many things, ID009, yada
This has many things, ID010, yada
This has many things, ID012, yada
This has many things, ID014, yada
This has many things, ID015, yada
This has many things, ID016, yada

In list.txt,
1. Ctrl+A, Ctrl+J: this joins everything into one long line, space-separated
2. Search > Replace (or Ctrl+H): goal is to make the list |-separated
  - FIND = \h+
  - REPLACE = |
  - MODE = regular expression
  - Replace All
3. Another Replacement: goal is to get the lines down to less than 1000 characters each (assuming no ID is greater than 50 characters)
  - FIND = (?-s)^(.{950,}?)\|
  - REPLACE = $1\r\n
  - MODE = regular expression
  - Replace All
  - You don’t need this step 3 if your line is less than 1000 characters after step 2.
4. Next replacement: goal is to make each line look like (ID#|ID#|...|ID#), with a bit of stuff at the beginning and end
  - FIND = (?-s)^.*$
  - REPLACE = $?-s$^.*$$0$.*\\R?
  - MODE = regular expression
  - Replace All
Now, for each line in list.txt:
1. Copy the line from list.txt
2. Switch to database.txt window
3. Search > Replace (or Ctrl+H)
  - paste the line into the FIND box, so it looks like FIND = (?-s)^.*(ID#|ID#|...|ID#).*\R?
  - make the REPLACE box empty
  - MODE = regular expression
  - Replace All
4. Repeat as necessary for each line from list.txt

-----

caveat emptor

This sequence seemed to work for me, based on my understanding of your issue, and is published here to help you learn how to do this. I make no guarantees or warranties as to the functionality for you. You are responsible to save and backup all data before and after running this sequence. If you want to use it long term, I recommend investing time in adding error checking and verifying with edge cases.

mrmagnum8841

Thank you. However I’ve already seen something similar and the only real issue with it is the length.
As of right now, I have over 23738 lines with 32 characters being the length of each line and more and more get added.

guy038

Hello, @mrmagnum8841, @peterjones and All,

Hi Peter, we have already seen this type of request, many times, on our forum !

So @mrmagnum8841, here is the road map :

Open a N++ new tab
First, paste the contents of the database.txt file in that new tab
Secondly, add a line containing, at least, 3 equal signs ( === )
Thirdly, append the contents of the list.txt file
Now open the Replace dialog ( Ctrl + H )
- SEARCH (?s-i)^(?-s:.*(\w+).*\R)(?=.*^===+.+?^\1$)|^===.+
- REPLACE Leave EMPTY
- Tick the Wrap around option, if necessary
- Select the Regular expression search mode
- Click on the Replace All button

Voila, that’s all !

Notes :

Globally, this regex searches, in current line, if a word is also present, with the same case, in the 2nd part of file, after the line of equal signs =====, in the nearest complete line
If so, all the current line contents, with its line-break, are selected and, as the replacement zone is empty, this line is just deleted

Best Regards,

guy038

PeterJones

@guy038 said,

Hi Peter, we have already seen this type of request,

I know. I just couldn’t quickly find any of them to link. Unfortunately, you have so many excellent regex posts on this forum, but I could never bookmark enough of them in an organized fashion to be able to always find the one I am thinking of for any given future reply. :-) I tried my best to recreate it from memory (not knowing if you were going to be answering over the weekend or not), but I forgot your trick for doing it in the same file rather than in 1000-character chunks. I guess I should have just waited for you to reply. ;-)

@mrmagnum8841 ,

As of right now, I have over 23738 lines with 32 characters being the length of each line and more and more get added.

You’ve now told us you have 23738 lines of data, which is helpful information; but you haven’t told us how many IDs there are to delete from those 23738 lines (is it a few IDs, a few hundred IDs, a few thousand IDs, more than half the IDs)?

Using the nomenclature I did in my first reply: with my solution, it shouldn’t be a problem if database.txt has 23k lines; with my solution, it became an issue if list.txt had more than ~1000 characters in the list of IDs… which is why my procedure grouped it into multiple groups of IDs if there were too many characters in your list.txt.

That’s one of the benefits of @guy038’s solution: his solution can have as many IDs as you want to delete from the main database.txt, and it will still work – Unfortunately, historically, if there are too many characters in the lookahead expression, Notepad++ occasionally gives up and just selects everything, which would have the unfortunate side effect of deleting everything.

If you try @guy038’s solution, and if it deletes everything or otherwise deletes too much, let us know, and we will try to help you through. But if that happens, it will help us to help you if you could give us a better representation of what you have – whether you really have two going to have to stop making us guess what you data is structured like. I assumed two files: one as a database.txt that had the main data you wanted to process, and a second list.txt that just listed the IDs you wanted to delete from database.txt. But that was just an assumption, which you have neither confirmed or denied. So if you need more help from us, you will need to provide more information, including dummy data. You can use the </> button on the post toolbar to format data as text (like I did in my original reply); give us an excerpt (not the full 23000 lines) of your data – if there is sensitive/secret information, just make up a handful of lines of dummy data that looks similar but with fake names, numbers, etc; and give us an example of the IDs you’d like to delete from your dummy data.

guy038

Hello, @mrmagnum8841, @peterjones and All,

I re-tested my regex S/R with a consequent amount of lines and I must say that this S/R, proposed in my previous post, failed miserably, even with very little data.: -(((

Assuming that each line of the database.txt file contains 32 characters per line, it does not work when more than 160 lines :-(( A pity !

Even this modified S/R, where I use delimiters to better catch the identifier ID### :

SEARCH (?s-i)^(?-s:.*,\x20(\w+),.*\R)(?=.*^===+.+?^\1$)|^===.+

With ,\x20 as the start delimiter and , as an end delimiter, can support about 2,200 lines but not more :-(

I also used this other version without the line delimiter ======= :

SEARCH (?-si)^.*,\x20(\w+),.*\R(?=(?s).+?^\1$)|^(ID...\R)+

But, though the regex seems more simple, the result is worse as it can only handle about 1.850 lines !

And, anyway, all of these regexes S/R end up , selecting all the file contents which is, obviously, not the desired goal !

Finally, it seems that the @peterjones’s solution is the more efficient ! The only drawback of his method is when the list.txt file contains too many identifiers OR when a lot of these identifiers do not exist in the database.txt file !. In this later case, this leads to a resulting regex (?-s)^.*(ID#|ID#|...|ID#).*\R? containing two many useless ID# alternatives !

So, here is my new attempt :

It should support, both, important size of the database.txt and list.txt files
The contents of the list.txt file may refer to the identifiers whose lines containing them, in the database.txt file, have to be deleted or, on the contrary, have to be retained !
Of course, it does not alter the initial order of lines of the database.txt file
It minimizes the number of alternatives of the final regex, created in order to process the database.txt contents

Here is the road map, assuming the following examples of :

The initial list.txt file ( 12 lines, not sorted ) :

ID011
ID000
ID005
ID037
ID008
ID013
ID024
ID043
ID003
ID026
ID028
ID016

The initial database.txt file ( 50 lines, of 32 chars, not sorted ) :

This a simple test, ID007, ABCDE
This a simple test, ID024, ABCDE
This a simple test, ID011, ABCDE
This a simple test, ID001, ABCDE
This a simple test, ID002, ABCDE
This a simple test, ID004, ABCDE
This a simple test, ID005, ABCDE
This a simple test, ID006, ABCDE
This a simple test, ID007, ABCDE
This a simple test, ID007, ABCDE
This a simple test, ID024, ABCDE
This a simple test, ID009, ABCDE
This a simple test, ID010, ABCDE
This a simple test, ID024, ABCDE
This a simple test, ID011, ABCDE
This a simple test, ID007, ABCDE
This a simple test, ID011, ABCDE
This a simple test, ID003, ABCDE
This a simple test, ID012, ABCDE
This a simple test, ID013, ABCDE
This a simple test, ID014, ABCDE
This a simple test, ID017, ABCDE
This a simple test, ID011, ABCDE
This a simple test, ID018, ABCDE
This a simple test, ID019, ABCDE
This a simple test, ID020, ABCDE
This a simple test, ID021, ABCDE
This a simple test, ID022, ABCDE
This a simple test, ID023, ABCDE
This a simple test, ID012, ABCDE
This a simple test, ID007, ABCDE
This a simple test, ID023, ABCDE
This a simple test, ID024, ABCDE
This a simple test, ID024, ABCDE
This a simple test, ID011, ABCDE
This a simple test, ID025, ABCDE
This a simple test, ID024, ABCDE
This a simple test, ID027, ABCDE
This a simple test, ID023, ABCDE
This a simple test, ID023, ABCDE
This a simple test, ID028, ABCDE
This a simple test, ID029, ABCDE
This a simple test, ID012, ABCDE
This a simple test, ID029, ABCDE
This a simple test, ID029, ABCDE
This a simple test, ID029, ABCDE
This a simple test, ID030, ABCDE
This a simple test, ID007, ABCDE
This a simple test, ID003, ABCDE
This a simple test, ID024, ABCDE

First, copy the database.txt contents as, let’s say, the file dummy.txt
Now, open the dummy.txt file
Perform the regex S/R :
- SEARCH (?-s)^.*(ID\d\d\d).*\R
- REPLACE \1\t$0

=> We copy the identifier at beginning of all lines, followed with a tab separator and get :

ID007	This a simple test, ID007, ABCDE
ID024	This a simple test, ID024, ABCDE
ID011	This a simple test, ID011, ABCDE
ID001	This a simple test, ID001, ABCDE
ID002	This a simple test, ID002, ABCDE
ID004	This a simple test, ID004, ABCDE
ID005	This a simple test, ID005, ABCDE
ID006	This a simple test, ID006, ABCDE
ID007	This a simple test, ID007, ABCDE
ID007	This a simple test, ID007, ABCDE
ID024	This a simple test, ID024, ABCDE
ID009	This a simple test, ID009, ABCDE
ID010	This a simple test, ID010, ABCDE
ID024	This a simple test, ID024, ABCDE
ID011	This a simple test, ID011, ABCDE
ID007	This a simple test, ID007, ABCDE
ID011	This a simple test, ID011, ABCDE
ID003	This a simple test, ID003, ABCDE
ID012	This a simple test, ID012, ABCDE
ID013	This a simple test, ID013, ABCDE
ID014	This a simple test, ID014, ABCDE
ID017	This a simple test, ID017, ABCDE
ID011	This a simple test, ID011, ABCDE
ID018	This a simple test, ID018, ABCDE
ID019	This a simple test, ID019, ABCDE
ID020	This a simple test, ID020, ABCDE
ID021	This a simple test, ID021, ABCDE
ID022	This a simple test, ID022, ABCDE
ID023	This a simple test, ID023, ABCDE
ID012	This a simple test, ID012, ABCDE
ID007	This a simple test, ID007, ABCDE
ID023	This a simple test, ID023, ABCDE
ID024	This a simple test, ID024, ABCDE
ID024	This a simple test, ID024, ABCDE
ID011	This a simple test, ID011, ABCDE
ID025	This a simple test, ID025, ABCDE
ID024	This a simple test, ID024, ABCDE
ID027	This a simple test, ID027, ABCDE
ID023	This a simple test, ID023, ABCDE
ID023	This a simple test, ID023, ABCDE
ID028	This a simple test, ID028, ABCDE
ID029	This a simple test, ID029, ABCDE
ID012	This a simple test, ID012, ABCDE
ID029	This a simple test, ID029, ABCDE
ID029	This a simple test, ID029, ABCDE
ID029	This a simple test, ID029, ABCDE
ID030	This a simple test, ID030, ABCDE
ID007	This a simple test, ID007, ABCDE
ID003	This a simple test, ID003, ABCDE
ID024	This a simple test, ID024, ABCDE

Append the list.txt contents, at the end of the dummy.txt file, giving these 62 lines, below :

ID007	This a simple test, ID007, ABCDE
ID024	This a simple test, ID024, ABCDE
ID011	This a simple test, ID011, ABCDE
ID001	This a simple test, ID001, ABCDE
ID002	This a simple test, ID002, ABCDE
ID004	This a simple test, ID004, ABCDE
ID005	This a simple test, ID005, ABCDE
ID006	This a simple test, ID006, ABCDE
ID007	This a simple test, ID007, ABCDE
ID007	This a simple test, ID007, ABCDE
ID024	This a simple test, ID024, ABCDE
ID009	This a simple test, ID009, ABCDE
ID010	This a simple test, ID010, ABCDE
ID024	This a simple test, ID024, ABCDE
ID011	This a simple test, ID011, ABCDE
ID007	This a simple test, ID007, ABCDE
ID011	This a simple test, ID011, ABCDE
ID003	This a simple test, ID003, ABCDE
ID012	This a simple test, ID012, ABCDE
ID013	This a simple test, ID013, ABCDE
ID014	This a simple test, ID014, ABCDE
ID017	This a simple test, ID017, ABCDE
ID011	This a simple test, ID011, ABCDE
ID018	This a simple test, ID018, ABCDE
ID019	This a simple test, ID019, ABCDE
ID020	This a simple test, ID020, ABCDE
ID021	This a simple test, ID021, ABCDE
ID022	This a simple test, ID022, ABCDE
ID023	This a simple test, ID023, ABCDE
ID012	This a simple test, ID012, ABCDE
ID007	This a simple test, ID007, ABCDE
ID023	This a simple test, ID023, ABCDE
ID024	This a simple test, ID024, ABCDE
ID024	This a simple test, ID024, ABCDE
ID011	This a simple test, ID011, ABCDE
ID025	This a simple test, ID025, ABCDE
ID024	This a simple test, ID024, ABCDE
ID027	This a simple test, ID027, ABCDE
ID023	This a simple test, ID023, ABCDE
ID023	This a simple test, ID023, ABCDE
ID028	This a simple test, ID028, ABCDE
ID029	This a simple test, ID029, ABCDE
ID012	This a simple test, ID012, ABCDE
ID029	This a simple test, ID029, ABCDE
ID029	This a simple test, ID029, ABCDE
ID029	This a simple test, ID029, ABCDE
ID030	This a simple test, ID030, ABCDE
ID007	This a simple test, ID007, ABCDE
ID003	This a simple test, ID003, ABCDE
ID024	This a simple test, ID024, ABCDE
ID011
ID000
ID005
ID037
ID008
ID013
ID024
ID043
ID003
ID026
ID028
ID016

Select the option Edit > Line operations > Sort Lines Lexicographically Descending

We get the following text :

ID043
ID037
ID030	This a simple test, ID030, ABCDE
ID029	This a simple test, ID029, ABCDE
ID029	This a simple test, ID029, ABCDE
ID029	This a simple test, ID029, ABCDE
ID029	This a simple test, ID029, ABCDE
ID028	This a simple test, ID028, ABCDE
ID028
ID027	This a simple test, ID027, ABCDE
ID026
ID025	This a simple test, ID025, ABCDE
ID024	This a simple test, ID024, ABCDE
ID024	This a simple test, ID024, ABCDE
ID024	This a simple test, ID024, ABCDE
ID024	This a simple test, ID024, ABCDE
ID024	This a simple test, ID024, ABCDE
ID024	This a simple test, ID024, ABCDE
ID024	This a simple test, ID024, ABCDE
ID024
ID023	This a simple test, ID023, ABCDE
ID023	This a simple test, ID023, ABCDE
ID023	This a simple test, ID023, ABCDE
ID023	This a simple test, ID023, ABCDE
ID022	This a simple test, ID022, ABCDE
ID021	This a simple test, ID021, ABCDE
ID020	This a simple test, ID020, ABCDE
ID019	This a simple test, ID019, ABCDE
ID018	This a simple test, ID018, ABCDE
ID017	This a simple test, ID017, ABCDE
ID016
ID014	This a simple test, ID014, ABCDE
ID013	This a simple test, ID013, ABCDE
ID013
ID012	This a simple test, ID012, ABCDE
ID012	This a simple test, ID012, ABCDE
ID012	This a simple test, ID012, ABCDE
ID011	This a simple test, ID011, ABCDE
ID011	This a simple test, ID011, ABCDE
ID011	This a simple test, ID011, ABCDE
ID011	This a simple test, ID011, ABCDE
ID011	This a simple test, ID011, ABCDE
ID011
ID010	This a simple test, ID010, ABCDE
ID009	This a simple test, ID009, ABCDE
ID008
ID007	This a simple test, ID007, ABCDE
ID007	This a simple test, ID007, ABCDE
ID007	This a simple test, ID007, ABCDE
ID007	This a simple test, ID007, ABCDE
ID007	This a simple test, ID007, ABCDE
ID007	This a simple test, ID007, ABCDE
ID006	This a simple test, ID006, ABCDE
ID005	This a simple test, ID005, ABCDE
ID005
ID004	This a simple test, ID004, ABCDE
ID003	This a simple test, ID003, ABCDE
ID003	This a simple test, ID003, ABCDE
ID003
ID002	This a simple test, ID002, ABCDE
ID001	This a simple test, ID001, ABCDE
ID000

Now, perform the regex S/R :
- SEARCH (?-s)ID(.{3}).+\RID\1\R|.+\R
- REPLACE ?1|\1

=> You should always obtain a single line, like below :

|028|024|013|011|005|003

Remark : This line should not exceed 2,010 characters long. However, this should be generally the case as we collect only identifiers present in the database.txt file. I also omitted the common part ID to get a smaller expression !

At this penultimate step, we’ll use a regex S/R to … create an new search regex ! So :
- SEARCH ^\|(.+)
- REPLACE $?-s$^$?=.*ID\(\1$\).+\\R

which give us the regex :

(?-s)^(?=.*ID(028|024|013|011|005|003)).+\R

Save the one-line file dummy.txt
Finally, open your database.txt file
And, here is the final regex S/R to perform. Two cases :
- (A) If the list.txt file contains all the identifiers to be deleted, in database.txt, use the new search regex :
  - SEARCH (?-s)^(?=.*ID(028|024|013|011|005|003)).+\R
  - REPLACE Leave EMPTY
- (B) If the list.txt file contains the identifiers which must be only retained, in database.txt, add the part |^.+\R at end and modify the replacement part :
  - SEARCH (?-s)^(?=.*ID(028|024|013|011|005|003)).+\R|^.+\R
  - REPLACE ?1$0

With the regex S/R (A), we get a final database.txt file of 33 lines, below :

This a simple test, ID007, ABCDE
This a simple test, ID001, ABCDE
This a simple test, ID002, ABCDE
This a simple test, ID004, ABCDE
This a simple test, ID006, ABCDE
This a simple test, ID007, ABCDE
This a simple test, ID007, ABCDE
This a simple test, ID009, ABCDE
This a simple test, ID010, ABCDE
This a simple test, ID007, ABCDE
This a simple test, ID012, ABCDE
This a simple test, ID014, ABCDE
This a simple test, ID017, ABCDE
This a simple test, ID018, ABCDE
This a simple test, ID019, ABCDE
This a simple test, ID020, ABCDE
This a simple test, ID021, ABCDE
This a simple test, ID022, ABCDE
This a simple test, ID023, ABCDE
This a simple test, ID012, ABCDE
This a simple test, ID007, ABCDE
This a simple test, ID023, ABCDE
This a simple test, ID025, ABCDE
This a simple test, ID027, ABCDE
This a simple test, ID023, ABCDE
This a simple test, ID023, ABCDE
This a simple test, ID029, ABCDE
This a simple test, ID012, ABCDE
This a simple test, ID029, ABCDE
This a simple test, ID029, ABCDE
This a simple test, ID029, ABCDE
This a simple test, ID030, ABCDE
This a simple test, ID007, ABCDE

With the regex S/R (B), we get a final database.txt file of 17 lines, below :

This a simple test, ID024, ABCDE
This a simple test, ID011, ABCDE
This a simple test, ID005, ABCDE
This a simple test, ID024, ABCDE
This a simple test, ID024, ABCDE
This a simple test, ID011, ABCDE
This a simple test, ID011, ABCDE
This a simple test, ID003, ABCDE
This a simple test, ID013, ABCDE
This a simple test, ID011, ABCDE
This a simple test, ID024, ABCDE
This a simple test, ID024, ABCDE
This a simple test, ID011, ABCDE
This a simple test, ID024, ABCDE
This a simple test, ID028, ABCDE
This a simple test, ID003, ABCDE
This a simple test, ID024, ABCDE

Note that, if we had followed the @peterjones’s method, we would have ended up with this search regex, a bit longer :

(?-s)^.*(ID011|ID000|ID005|ID037|ID008|ID013|ID024|ID043|ID003|ID026|ID028|ID016).*\R?

Of course, here, there no notable difference but, depending of the list.txt contents, it could be of some importance !

Best Regards,

guy038

PeterJones

@guy038 ,

Great effort, hopefully not wasted.

Unfortunately, @mrmagnum8841 has never come back and answered whether my interpretation of the problem is in any way accurate.

I am the one who introduced the database.txt and list.txt idea (as the most reasonable way I could come up with of interpreting what was asked for, but never explicitly stated). And I am the one who used “ID###” – @mrmagnum8841 just said “big text file filled with a lot of ID’s”. There may not be anything so easy to capture as the “ID” prefix before each ID. It may be that each ID is really a UUID, or it may be that each ID is really exactly 32 hexadecimal characters, or it may be that each ID is really someone’s name with all spaces and special characters removed, or it may be that each ID just appears to our eyes to be a random set of characters with a random length. Making too many optimizations without any feedback from @mrmagnum8841 might be an interesting mental exercise, but we have no idea if we’ve ever been answering @mrmagnum8841’s actual need.

@mrmagnum8841 , if you want more help than we have provided, please actually respond with answers to the questions raised, and let us know how close, or far, we are to actually solving your problem.

Alan Kilborn

I think that what I now call “the Dail-ism” is an applicable comment at this point.

guy038

Hi, @mrmagnum8841, @peterjones, @alan-kilborn and All,

When elaborating my previous post, I remenbered, from this post :

https://community.notepad-plus-plus.org/post/51385

This following regex (?-s)^(.+\R)(?=(?s).+?^\1), which, indeed, could work with a 5 Mb file ,containing more than 200,000 lines ! Much better, isn’t it ?

Seemingly, the fact that, in this regex, the group 1 corresponds to an entire line, with its line-break, whereas the (?-si)^.*,\x20(\w+),.*\R(?=(?s).+?^\1$) syntax stores, only, the ID### part, of each line, in group 1 ( which fails with a file over 82 Kb - 2,500 lines ! ) makes all the difference !! Why ?

As you said, Peter, it was a mental exercise, not specifically intended for the OP, in order to find a correct way to filter fairly large files, as I’m rather irritated by the limitations of my various regular expression attempts :-((

Cheers,

guy038