Subtract document B from A
-
Hi @guy038 !
Thank you so much for your efforts, they are quite admirable.
I believe we’re getting closer here but when taking your “replace $ with [line of values]” step (55 seconds on my laptop) it stops at line 161,454.
To mitigate this, I’ve cut all data from line 161,455 to a new document and ran the replace again. This time, it stopped at line 48,192.
At this point I noticed a strange phenomenon: I couldn’t delete any text from these files anymore so I decided to start over. A new S/R attempt stopped at line 128,653 and everything after that came out garbled (showing “LF” blocks).
I believe this is related to the punctuation inherent to the content. To better understand this, perhaps it’s easier to just have a look at the actual data. Therefore I have uploaded the files on Gdrive:
DocA.txt
DocB.txt
Warning: do not attempt to visit any of the websites mentioned within these files as they may contain malicious data!
DocA is a hosts file that blocks unwanted servers. DocB holds safe Facebook servers that are supposed to be deleted from DocA.I hope that helps. If this is too difficult to achieve or too much effort has gone into this already, you really shouldn’t bother. I’ll do it manually each time if I must, no worries :)
Cheers
Timmy -
It may be time to consider the “screaming monkey”…
I tried @Claudia-Frank 's Pythonscript code on your documents and, I don’t know, I either didn’t wait long enough, or something else went wrong, but I ended up having to kill Notepad++ to put a stop to whatever it (or the PS plugin) was doing.
I thought about trying @guy038’s regular expression on the data, but then I thought, “this is too much work; no one will really be willing to do this, let alone remember it”.
So…it may be time to bite the monkey and give this task to an external scripting language. Starting with @Claudia-Frank 's Pythonscript code, I came up with the following Python3 code that runs on your data files so quickly that I didn’t even think to time it:
with open('DocA.txt', 'r', encoding='utf-8') as filea: linesa = filea.readlines() with open('DocB.txt', 'r', encoding='utf-8') as fileb: setb = set(fileb.readlines()) linesa = sorted(linesa) seta = set(linesa) linesc = sorted(list(seta - setb)) with open('DocA_sorted.txt', 'w', encoding='utf-8') as filea: filea.write(''.join(linesa)) with open('DocC.txt', 'w', encoding='utf-8') as filec: filec.write(''.join(linesc))
I sorted the lines because that makes comparing
DocA_sorted
andDocC
(where C = B - A) easy when attempting to validate the results.Sometimes it’s just best to leave the confines of Notepad++ for a task, and I’m sure Perl or AWK or “insert your favorite programming language here” works just as well as Python3 for this problem.
-
tried it ten times in a row without a problem. Takes < 1sec to do the job.
Cheers
Claudia -
Interesting…still don’t know what was going wrong with it for me…but I’m not inclined to spend any more time to find out. :-)
Some tasks just sort of feel right outside Notepad++, and for me this is one of them. I guess if there was a no-brainer menu function that did it I’d feel differently, but I dunno, do you think we’ll ever see that? :-)
And BTW, I should have said C = A - B above.
-
Hi, @timmy-M, @claudia-frank, @scott-sumner and All,
As Scott said, I do think that an external scripting language is the best way to solve your goal ! And I’m convinced that both Claudia’s and Scott’s scripts work just fine !
But your stubborn servant just gave a try to a regex solution, which is, of course, longer and less elegant than a script :-((
Timmy, I could, correctly, download your two files
DocA.txt
andDocB.txt
, without any problem !-
The hosts file,
DocA.txt
, is an Unix UTF-8 encoded file, containing255,386
lines -
The Safe Facebook servers file,
DocB.txt
, is a Windows UTF-8 BOM encoded file, containing only113
lines
Now, if we try to change all the lines of your
DocB.txt
, containing the safe sites, in an unique line, we get a line of3,701
bytes. But, if we were going to add3,701
bytes to each line ofDocA.txt
to simulate my 2nd method, in my previous post, the total size ofDocA.txt
would have been increased, up to 908 Mo about (6,788,298 + 255,386 * 3,701 = 951,971,884 bytes
! ) Obviously, this resulting file would be too big to handle while running even simple regexes :-((So, …we need to find out a 3rd way !
To begin with, I copied
DocA.txt
asDocC.txt
andDocB.txt
asDocD.txt
Then, I did some verifications on these two files :
-
I normalized all letters of these files in lower-case ( with
Ctrl-A
andCtrl + U
) -
I performed a classical sort ( Edit > Line Operations > Sort Lines Lexicographically Ascending )
Note : Sort comes after the
Ctrl + U
command, as the sort is a true Unicode sort !-
Then I ran the regex
^(.+\R)\1+
to search possible duplicate consecutive lines-
DocD.txt
did not contain any duplicate line -
DocC.txt
contained118
duplicates, with appeared, due to the previous normalization to lower-case. So, in order to delete them, I used the regex S/R :
-
SEARCH
^(.+\R)\1+
REPLACE
\1
So,
DocC.txt
, now, contained255,268
lines- I noticed that, in
DocD.txt
, many occurrences of addresses ended with “facebook.com ” or “fbcdn.net ”, for instance. So, after copyingDocD.txt
contents, in a new tab, I decided to have an idea of the different sites, regarded as safe, with the regex S/R, below :
SEARCH
^.{8}.+[.-]([^.\r\n]+\.[^.\r\n]+)$
REPLACE
\1\t\t\t\t\t$0
After a classical sort, I obtained the following list :
amazonaws.com 0.0.0.0 ec2-34-193-80-93.compute-1.amazonaws.com facebook.com 0.0.0.0 0-act.channel.facebook.com facebook.com 0.0.0.0 0-edge-chat.facebook.com facebook.com 0.0.0.0 1-act.channel.facebook.com facebook.com 0.0.0.0 1-edge-chat.facebook.com facebook.com 0.0.0.0 2-act.channel.facebook.com facebook.com 0.0.0.0 2-edge-chat.facebook.com facebook.com 0.0.0.0 3-act.channel.facebook.com facebook.com 0.0.0.0 3-edge-chat.facebook.com facebook.com 0.0.0.0 4-act.channel.facebook.com facebook.com 0.0.0.0 4-edge-chat.facebook.com facebook.com 0.0.0.0 5-act.channel.facebook.com facebook.com 0.0.0.0 5-edge-chat.facebook.com facebook.com 0.0.0.0 6-act.channel.facebook.com facebook.com 0.0.0.0 6-edge-chat.facebook.com facebook.com 0.0.0.0 act.channel.facebook.com facebook.com 0.0.0.0 api-read.facebook.com facebook.com 0.0.0.0 api.ak.facebook.com facebook.com 0.0.0.0 api.connect.facebook.com facebook.com 0.0.0.0 app.logs-facebook.com facebook.com 0.0.0.0 ar-ar.facebook.com facebook.com 0.0.0.0 attachments.facebook.com facebook.com 0.0.0.0 b-api.facebook.com facebook.com 0.0.0.0 b-graph.facebook.com facebook.com 0.0.0.0 b.static.ak.facebook.com facebook.com 0.0.0.0 badge.facebook.com facebook.com 0.0.0.0 beta-chat-01-05-ash3.facebook.com facebook.com 0.0.0.0 bigzipfiles.facebook.com facebook.com 0.0.0.0 channel-ecmp-05-ash3.facebook.com facebook.com 0.0.0.0 channel-staging-ecmp-05-ash3.facebook.com facebook.com 0.0.0.0 channel-testing-ecmp-05-ash3.facebook.com facebook.com 0.0.0.0 check4.facebook.com facebook.com 0.0.0.0 check6.facebook.com facebook.com 0.0.0.0 creative.ak.facebook.com facebook.com 0.0.0.0 d.facebook.com facebook.com 0.0.0.0 de-de.facebook.com facebook.com 0.0.0.0 developers.facebook.com facebook.com 0.0.0.0 edge-chat.facebook.com facebook.com 0.0.0.0 edge-mqtt-mini-shv-01-lax3.facebook.com facebook.com 0.0.0.0 edge-mqtt-mini-shv-02-lax3.facebook.com facebook.com 0.0.0.0 edge-star-shv-01-lax3.facebook.com facebook.com 0.0.0.0 edge-star-shv-02-lax3.facebook.com facebook.com 0.0.0.0 error.facebook.com facebook.com 0.0.0.0 es-la.facebook.com facebook.com 0.0.0.0 fr-fr.facebook.com facebook.com 0.0.0.0 graph.facebook.com facebook.com 0.0.0.0 hi-in.facebook.com facebook.com 0.0.0.0 inyour-slb-01-05-ash3.facebook.com facebook.com 0.0.0.0 it-it.facebook.com facebook.com 0.0.0.0 ja-jp.facebook.com facebook.com 0.0.0.0 messages-facebook.com facebook.com 0.0.0.0 mqtt.facebook.com facebook.com 0.0.0.0 orcart.facebook.com facebook.com 0.0.0.0 origincache-starfacebook-ai-01-05-ash3.facebook.com facebook.com 0.0.0.0 pixel.facebook.com facebook.com 0.0.0.0 profile.ak.facebook.com facebook.com 0.0.0.0 pt-br.facebook.com facebook.com 0.0.0.0 s-static.ak.facebook.com facebook.com 0.0.0.0 s-static.facebook.com facebook.com 0.0.0.0 secure-profile.facebook.com facebook.com 0.0.0.0 ssl.connect.facebook.com facebook.com 0.0.0.0 star.c10r.facebook.com facebook.com 0.0.0.0 star.facebook.com facebook.com 0.0.0.0 static.ak.connect.facebook.com facebook.com 0.0.0.0 static.ak.facebook.com facebook.com 0.0.0.0 staticxx.facebook.com facebook.com 0.0.0.0 touch.facebook.com facebook.com 0.0.0.0 upload.facebook.com facebook.com 0.0.0.0 vupload.facebook.com facebook.com 0.0.0.0 vupload2.vvv.facebook.com facebook.com 0.0.0.0 www.login.facebook.com facebook.com 0.0.0.0 zh-cn.facebook.com facebook.com 0.0.0.0 zh-tw.facebook.com fbcdn.net 0.0.0.0 b.static.ak.fbcdn.net fbcdn.net 0.0.0.0 creative.ak.fbcdn.net fbcdn.net 0.0.0.0 ent-a.xx.fbcdn.net fbcdn.net 0.0.0.0 ent-b.xx.fbcdn.net fbcdn.net 0.0.0.0 ent-c.xx.fbcdn.net fbcdn.net 0.0.0.0 ent-d.xx.fbcdn.net fbcdn.net 0.0.0.0 ent-e.xx.fbcdn.net fbcdn.net 0.0.0.0 external.ak.fbcdn.net fbcdn.net 0.0.0.0 origincache-ai-01-05-ash3.fbcdn.net fbcdn.net 0.0.0.0 photos-a.ak.fbcdn.net fbcdn.net 0.0.0.0 photos-b.ak.fbcdn.net fbcdn.net 0.0.0.0 photos-c.ak.fbcdn.net fbcdn.net 0.0.0.0 photos-d.ak.fbcdn.net fbcdn.net 0.0.0.0 photos-e.ak.fbcdn.net fbcdn.net 0.0.0.0 photos-f.ak.fbcdn.net fbcdn.net 0.0.0.0 photos-g.ak.fbcdn.net fbcdn.net 0.0.0.0 photos-h.ak.fbcdn.net fbcdn.net 0.0.0.0 profile.ak.fbcdn.net fbcdn.net 0.0.0.0 s-external.ak.fbcdn.net fbcdn.net 0.0.0.0 s-static.ak.fbcdn.net fbcdn.net 0.0.0.0 scontent-a-lax.xx.fbcdn.net fbcdn.net 0.0.0.0 scontent-a-sin.xx.fbcdn.net fbcdn.net 0.0.0.0 scontent-a.xx.fbcdn.net fbcdn.net 0.0.0.0 scontent-b-lax.xx.fbcdn.net fbcdn.net 0.0.0.0 scontent-b-sin.xx.fbcdn.net fbcdn.net 0.0.0.0 scontent-b.xx.fbcdn.net fbcdn.net 0.0.0.0 scontent-c.xx.fbcdn.net fbcdn.net 0.0.0.0 scontent-d.xx.fbcdn.net fbcdn.net 0.0.0.0 scontent-e.xx.fbcdn.net fbcdn.net 0.0.0.0 scontent-mxp.xx.fbcdn.net fbcdn.net 0.0.0.0 scontent.xx.fbcdn.net fbcdn.net 0.0.0.0 sphotos-a.xx.fbcdn.net fbcdn.net 0.0.0.0 static.ak.fbcdn.net fbcdn.net 0.0.0.0 video.xx.fbcdn.net fbcdn.net 0.0.0.0 vthumb.ak.fbcdn.net fbcdn.net 0.0.0.0 xx-fbcdn-shv-01-lax3.fbcdn.net fbcdn.net 0.0.0.0 xx-fbcdn-shv-02-lax3.fbcdn.net net23.net 0.0.0.0 scontent-vie-224-xx-fbcdn.net23.net net23.net 0.0.0.0 scontent-vie-73-xx-fbcdn.net23.net net23.net 0.0.0.0 scontent-vie-75-xx-fbcdn.net23.net
And a quick examination shows that all these
113
sites, considered as safe, have an address ending with one of the4
values, below :amazonaws.com facebook.com fbcdn.net net23.net
Thus, from the remaining
255,268
lines, ofDocC.txt
, only those which address ends with amazonaws.com , facebook.com , fbcdn.net , or net23.net have to be compared with the list of the113
safe servers !-
So, in
docC.txt
, I bookmarked all lines, matching the regex(amazonaws|facebook)\.com$|(fbcdn|net23)\.net$
. I obtained301 bookmarked
results that I copied to the clipboard with the option Search -> Bookmark -> Copy Bookmarked lines -
Then, I pasted the
301
bookmarked lines, at the very beginning ofDoc D.txt
, followed with a line of, at least,3
dashes, as a separator. Finally,DocD.txt
contained301
results to verify +1
line of dashes +113
safe sites, that is to say a total of415
lines
Now, with the regex
(?-s)^(.+)\R(?s)(?=.*\R\1\R?)
, I marked all the lines which have a duplicate, further on, inDocD.txt
=>106
lines bookmarked !And using the command Search > Bookmark > Remove
Unmarked
lines, I was left with aDocD.txt
file, containing106
lines, only, which are, both, inDocA.txt
andDocB.txt
. So, these lines / sites needed to be deleted from theDocC.txt
file !To that purpose :
-
I added these
106
lines at the end ofDoc C.txt
file, giving a file with255 374
lines -
I did a last sort operation ( Edit > Line Operations > Sort Lines Lexicographically Ascending )
=> After sort, these
106
sites should appear, each, as a block of two consecutive duplicate linesAnd, finally, with the regex S/R :
SEARCH
^(.+\R)\1+
REPLACE
Leave EMPTY
I got rid of these safe servers and obtained a
Doc.txt
file of255,162
lines (255,374 - 106 * 2
) which contains, only, unwanted servers !
Now, if you do not want to repeat all the above process, by yourself, just send me an e-mail to :
And I’ll send you, back, this
DocC.txt
( =DocA.txt
-DocB.txt
) of255,162
linesCheers,
guy038
P.S. :
Timmy, to be exhaustive, and regarding your initial files :
DocB.txt
contains113
lines / safe servers :-
7
lines, not present inDocA.txt
-
106
lines, present inDocA.txt
And
DocA.txt
contains255,386
lines :-
118
duplicate lines , ( due to case normalization ), which have been deleted -
106
lines which are, both, present inDocA.txt
andDocB.txt
and have been deleted, too ! -
254,967
lines, with an end of line, different fromamazonaws.com, facebook.com, fbcdn.net, and net23.net
, which have to remain inDocA.txt
-
195
lines, with an end of line, equal toamazonaws.com, facebook.com, fbcdn.net, or net23.net
but not present inDocB.txt
. So, they must remain inDocA.txt
, too !
Hence, the final
DocA.txt
( = myDocC.txt
) which contains255162
unwanted servers (254,967 + 195
) -
-
“Replace All: 106 occurrences were replaced.”
@guy038 You sir, are a regex LEGEND! :D
Of course I have taken every step you’ve explained, it’s the least I could do after you’ve put in such effort. I even found a minor mistake in your explanation (I think):
Now, with the regex (?-s)^(.+)\R(?s)(?=.*\R\1\R?), I marked all the lines which have a duplicate, further on, in DocD.txt
Should be bookmarked (with regular marking, the “Remove Unmarked Lines” removes all lines).
@Scott-Sumner @Claudia-Frank
I guess it’s time to face that monkey then… I installed the Python plugin script, navigated to Python Script > New Script and created ScreamingMonkey.py
Now, selecting Run Previous Script (ScreamingMonkey) doesn’t do anything. I’ve done some searching and found that I should install Python first from python.org/downloads/ , should I? Version 3?
Before installing something I know nothing about, I’d like to verify with you guys whether that is the right course of action. I hope that’s okay :) -
no, there is no need to install an additional python package,
as python script plugin delivers its own.What you need to do is to open you two files like I’ve shown above as the
script will reference it by using editor1 and editor2 and click on ScreamingMonkey.
Once it has been run you can then call it again by using Run Previous Script.What do you see if you try to open the python script console?
Plugin->PythonScript->Show Console.If nothing happens, no new window opens, then how did
you install it, via plugin manager or via the msi package ?The msi installation is preferred as it was reported that the installation via plugin manager sometimes doesn’t copy all needed files.
Cheers
Claudia -
Hello, @timmy-m and All,
Ah, yes ! You’re right, although what I wrote was correct, too ! Indeed, I used the
Remove Unmarked lines
, in order to changeDocD.txt
, first => So, it remained106
lines that I, then, added toDocC.txt
contents.But you, certainly, used this easier solution, below :
…
…Now, with the regex
(?-s)^(.+)\R(?s)(?=.*\R\1\R?)
, I marked all the lines which have a duplicate, further on, inDocD.txt
=>106
lines bookmarked !And using the command Search > Bookmark > Copy Bookmarked lines, I put, in the clipboard, these
106
lines, only, which are, both, inDocA.txt
andDocB.txt
and have to be deleted from theDocC.txt
file !To that purpose :
- I paste these
106
lines (Ctrl + V
) at the end ofDoc C.txt
file, giving a file with255 374
lines
…
…Best Regards,
guy038
- I paste these
-
Regarding “marked versus bookmarked”:
Not @guy038 's fault… The Notepad++ user interface is confusing on this point; it uses the terminology “mark” often when it should (IMO) use “bookmark”. Alternatively, if it used “redmark” and “bookmark” exclusively there would be no confusion, but I suppose using “red” wouldn’t totally be correct as it is just the default color and can be changed by the user.
-
I used the msi install method. First updated N++ to latest, then installed Python plugin.
@Claudia-Frank said:
What do you see if you try to open the python script console?
Plugin->PythonScript->Show Console.---------------------------.
Python 2.7.6-notepad++ r2 (default, Apr 21 2014, 19:26:54) [MSC v.1600 32 bit (Intel)]
Initialisation took 31ms
Ready.
Traceback (most recent call last):
File “%AppData%\Roaming\Notepad++\plugins\Config\PythonScript\scripts\ScreamingMonkey.py ”, line 1, in <module>
with open(‘DocA.txt’, ‘r’, encoding=‘utf-8’) as filea: linesa = filea.readlines()
TypeError: ‘encoding’ is an invalid keyword argument for this function
Traceback (most recent call last):
File “%AppData%\Roaming\Notepad++\plugins\Config\PythonScript\scripts\ScreamingMonkey.py ”, line 1, in <module>
with open(‘DocA.txt’, ‘r’, encoding=‘utf-8’) as filea: linesa = filea.readlines()
TypeError: ‘encoding’ is an invalid keyword argument for this function
---------------------------.Eh… Lil’ help please? That monkey’s screaming really loud right now O_O
-
but that is not my script.
Scotts script assumes you run it under python3 as it uses the encoding argument
which is not available in python2, which is used by python script.Did you try my 3 lines code posted above?
If you still want to use Scotts code and executed via python script plugin, then
get rid of the encoding=‘utf-8’ so something likewith open(‘DocA.txt’, ‘r’) as filea ...
but you have to ensure that DocA.txt can be found and python scripts
working directory is set to the directory where notepad++.exe is.Cheers
Claudia -
Yea, for sure go with @Claudia-Frank 's script, since you seem to want to stay more “within” Notepad++!
-
@Claudia-Frank said:
Did you try my 3 lines code posted above?
I have and eventually succeeded! On my first dozen attempts (turns out I really suck at taming monkeys) I just kept reorganizing the order of all entries in either DocA or DocB, depending on which one was selected, without any entry getting removed.
After trying a bunch of stuff, I found that I had to use View > Move/Clone Current Document > Move to Other View on Doc B. I just couldn’t get my head around “view0” and “view1” at first but didn’t want to bother you with such mundane question either ^_^
DocA’s order does get messed up eventually but that’s easily fixed with a regular sort. Perhaps that sort can be implemented in the script?Anyway, it all worked out! Thank you very much @Scott-Sumner, @Claudia-Frank and @guy038 for this learning experience. You’re a dream team ;)
-
good to see that you got it sorted out :-)
and to keep sorting, change this
editor1.setText('\r\n'.join(s1-s2))
to this
editor1.setText('\r\n'.join(sorted(s1-s2)))
that should do the trick.
Cheers
Claudia