Remove entries from second file
-
Using Notepad ++ I have one main file 1.txt that has 10,000 unique URL’s in it. In my second file called 2.txt I have a list of 2000 URL’s. What I need to do is take those 2000 URL’s that are in 2.txt and have those removed from 1.txt which will essentially give me 8000 URL’s in 1.txt.
Goal: The 1.txt has 10,000 URL’s which contain 2000 that I need to remove. The 2000 that needs to be removed are in 2.txt.
Is it possible to run some kind of search/replace to perform this actions? Thanks in advance!
-
@Michael-Rebusify said in Remove entries from second file:
Goal: The 1.txt has 10,000 URL’s which contain 2000 that I need to remove.
Welcome to the NPP forum.
From the little you have explained it may be possible to use a regular expression (regex) to remove the 2000 unwanted URLs. However you will need to provide some more detail and possibly an excerpt from the 2 files if possible.
- Does the URL list in file #2 only contain 1 of each URL?
- Are the URLs in both files contained on lines by themselves or are they embedded in sentences or other data.
- Would you be okay with combining the contents of both files, then after sorting (provided URLs on lines by themselves) use a regex to remove any lines where 2 consecutive lines are duplicated.
That’s how I might do it, proviso is needing to see sample data or at least a dummy representation of your data if actual data sensitive info.
Terry
-
Hi,
- Yes, there are 2000 url’s in 2.text and they are all on a line and unique.
- Yes, all on each line.
- If I combine by putting all of 2.txt at the bottom of 1.txt then we’d have to remove all that were duplicates. That would remove the initial URL and the second one.
I have to remove 10,000 URL’s but need to keep 2000 of them (2.txt).
-
So if one considers the data:
one two three four five six seven eight nine ten ------- five seven
Note that the line of dashes is just there as a visual divider between 2 sections; here the sections could be considered the first, larger file at the top, and the second, smaller file at the bottom. Each section contains unique lines, but obviously there is going to be commonality between the sections.
So a regular expression replacement operation using
^(.+?\R)(?=(?s).*?\1)
as the search expression and an empty replace expression seems to remove the content of the bottom section that also appears in the top section, leaving the bottom section intact:one two three four six eight nine ten ------- five seven
So, in theory, if one combines into one file the 10000 line section from the first file (placed at the top in the new file) and the 2000 line section (placed at the bottom in the new file) and runs the above replacement on the new file, it should do the job?
Obviously, when the operation is complete, copy the top 8000 lines of the file to whatever file you need it in.
-
Thank you!
-
Hello, @michael-rebusify, @terry-r, @alan-kilborn and All,
I was waiting the Terry’s reply, first, but, in the meanwhile, I already imagined the suitable regex, a bit longer, which could handle special cases as, for instance, duplicate lines in the first part, before the separation line and no similar line in the second part, which, obviously, should not be considered !
As you replied to @michael-rebusify, and now that we assume that no duplicate line exist, in each section, here is my shortened solution :
SEARCH
(?-s)^(.+)\R(?s)(?=.*^\1(\R|\z))|^%%%.+
REPLACE
Leave EMPTY
Let’s examine, Alan, the differences with your search regex
^(.+?\R)(?=(?s).*?\1)
:-
Firstly, note that I added the alternative
^%%%.+
, which grasps, after the first part, all the second section from the separation line, included, which must be deleted too ! -
Secondly, I changed the
\1
syntax with^\1(\R|\z)
, which forces a line, in the first section to have an exact equivalent in the second section, even if the last line of the2nd
section does not end with a line-break. See the example, below, to easily pin down the differences ;-)) -
Thirdly, it was necessary to place the
\R
syntax, outside the group1
, to be able to include the\z
assertion -
Fourthly, in order that my second alternative has the implicit
(?s)
modifier, I needed to place the prior(?s)
before the positive look-ahead structure !
So, let’s consider a new N++ tab containing :
-
All the
File_1
contents -
A line of, at least,
3
percent characters -
All the
File_2
contents, with the last line possibly without any line-break
Here is an example data :
one two three four five six seven eight nine ten twenty-two %%%%%%%%%% twenty-two five nineteen seven
After the regex S/R
(?-s)^(.+)\R(?s)(?=.*^\1(\R|\z))|^%%%.+
, you should get the expected contents of the newFile_1
:one two three four six eight nine ten
With your version, Alan,
^(.+?\R)(?=(?s).*?\1)
, assuming the secondseven
word is the very end of file, you would have obtained, after replacement :one three four six eight nine ten %%%%%%%%%% twenty-two five nineteen seven
Best regards,
guy038
-
-
Sometimes I think we run the risk of “oversolving” and confusing an OP. Oftentimes it is best not to read extra things into a specification, especially when the original is described very well.
-
Hi, @alan-kilborn,
I totally agree with your statement. Nevertheless, your regex would be more exact, just adding one more
^
symbol, giving :^(.+?\R)(?=(?s).*?^\1)
Test it against the very simple text, below :
two --- twenty-two
The version
\1
would wrongly select the wordtwo
, where as the version^\1
correctly does not find any occurrence ;-))Cheers,
guy038
-
@guy038 said in Remove entries from second file:
I was waiting the Terry’s reply, first
@guy038 you don’t need to await my reply first. Sure I had intended to give a regex answer but other matters got in the way. I knew that those questions needed to be asked in order to get a full and correct understanding of the problem.
My regex would have been similar, although given the lookahead can be problematic with large amounts of data I try to avoid that situation like the plague. I would have simply combined both files and sorted, thus putting 2 “same” lines together and used a regex to remove both. If ordering was needed to be kept line numbering would have been an option although it creates more steps overall.
Good thing is the result the OP was seeking has been achieved and that’s all that matters.
Cheers
Terry -
\1 would wrongly select
Yes. Perhaps I should have invented some dummy url data instead of a simple number-to-word list; if I had done so there would not have been any entry that would have been a subset of another entry, like “two” is a subset of “twenty-two”. Again, I was considering the OP’s well-stated problem case.
-
@Michael-Rebusify said in Remove entries from second file:
0 URL’s which contain 2000 that I need to remove. The 2000 that needs to be removed are in 2.txt.
Is it possible to run some kind of search/replace to perform this actions? Thanks in advance!Hi Michael,
First I would copy your file, then I would record a macro where you delete every second row.
You can then run the macro till the end of the file…
Does that help you? -
Hello, @terry-r and All,
I said :
I was waiting the Terry’s reply, first,…
Because I think it is more fair to let the first guy, helping the OP, to develop his solution ;-)) Then, you can jump into the discussion, proposing alternate solutions, too !
Besides, I know that I’m really too eager to give my regex solutions and, very often, I must prevent some people from helping their own solutions ;-))
Now, regarding your solution, you’re right about it : best to avoid large amounts of data inside the look-ahead structure ;-))
So, from the short example data, given in my previous post :
one two three four five six seven eight nine ten twenty-two %%%%%%%%%% twenty-two five nineteen seven
First, using the
Edit > Column Editor...
option, at column12
or more, we would get :one 01 two 02 three 03 four 04 five 05 six 06 seven 07 eight 08 nine 09 ten 10 twenty-two 11 %%%%%%%%%% 12 twenty-two 13 five 14 nineteen 15 seven 16
And , after the
Edit > Line Operations > Sort Lines Lexicographically Ascending
option, we have :%%%%%%%%%% 12 eight 08 five 05 five 14 four 04 nine 09 nineteen 15 one 01 seven 07 seven 16 six 06 ten 10 three 03 twenty-two 11 twenty-two 13 two 02
Now, using the following regex S/R :
SEARCH
(?-s)^(.+)\x20+\d+\R\1\x20+\d+\R?
OR(?-s)^(.+)(\x20+\d+\R?)\1(?2)
REPLACE
Leave EMPTY
We are left with :
%%%%%%%%%% 12 eight 08 four 04 nine 09 nineteen 15 one 01 six 06 ten 10 three 03 two 02
Then, moving back the numbers from the end to the beginning of line and adding a space column, with the column mode selection, we would obtain :
12 %%%%%%%%%% 08 eight 04 four 09 nine 15 nineteen 01 one 06 six 10 ten 03 three 02 two
And, after a last ascending sort, we have :
01 one 02 two 03 three 04 four 06 six 08 eight 09 nine 10 ten 12 %%%%%%%%%% 15 nineteen
Finally, after processing this last regex S/R, we get our expected text, removing leading numbers and trailing spaces as well as anything from the separation line till the very end of file :
SEARCH
(?s)^\d+\h*%%%+.+|^\d+\h*|\h+$
REPLACE
Leave EMPTY
one two three four six eight nine ten
Cheers,
guy038