Community
    • Login

    Subtract document B from A

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    multi deletesubtractbatch remove
    21 Posts 4 Posters 9.3k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • guy038G
      guy038
      last edited by guy038

      Hi, @timmy-M, @claudia-frank, @scott-sumner and All,

      As Scott said, I do think that an external scripting language is the best way to solve your goal ! And I’m convinced that both Claudia’s and Scott’s scripts work just fine !

      But your stubborn servant just gave a try to a regex solution, which is, of course, longer and less elegant than a script :-((


      Timmy, I could, correctly, download your two files DocA.txt and DocB.txt, without any problem !

      • The hosts file, DocA.txt, is an Unix UTF-8 encoded file, containing 255,386 lines

      • The Safe Facebook servers file, DocB.txt, is a Windows UTF-8 BOM encoded file, containing only 113 lines

      Now, if we try to change all the lines of your DocB.txt, containing the safe sites, in an unique line, we get a line of 3,701 bytes. But, if we were going to add 3,701 bytes to each line of DocA.txt to simulate my 2nd method, in my previous post, the total size of DocA.txt would have been increased, up to 908 Mo about ( 6,788,298 + 255,386 * 3,701 = 951,971,884 bytes ! ) Obviously, this resulting file would be too big to handle while running even simple regexes :-((

      So, …we need to find out a 3rd way !

      To begin with, I copied DocA.txt as DocC.txt and DocB.txt as DocD.txt

      Then, I did some verifications on these two files :

      • I normalized all letters of these files in lower-case ( with Ctrl-A and Ctrl + U )

      • I performed a classical sort ( Edit > Line Operations > Sort Lines Lexicographically Ascending )

      Note : Sort comes after the Ctrl + U command, as the sort is a true Unicode sort !

      • Then I ran the regex ^(.+\R)\1+ to search possible duplicate consecutive lines

        • DocD.txt did not contain any duplicate line

        • DocC.txt contained 118 duplicates, with appeared, due to the previous normalization to lower-case. So, in order to delete them, I used the regex S/R :

      SEARCH ^(.+\R)\1+

      REPLACE \1

      So, DocC.txt, now, contained 255,268 lines

      • I noticed that, in DocD.txt, many occurrences of addresses ended with “facebook.com” or “fbcdn.net”, for instance. So, after copying DocD.txt contents, in a new tab, I decided to have an idea of the different sites, regarded as safe, with the regex S/R, below :

      SEARCH ^.{8}.+[.-]([^.\r\n]+\.[^.\r\n]+)$

      REPLACE \1\t\t\t\t\t$0

      After a classical sort, I obtained the following list :

      amazonaws.com					0.0.0.0 ec2-34-193-80-93.compute-1.amazonaws.com
      facebook.com					0.0.0.0 0-act.channel.facebook.com
      facebook.com					0.0.0.0 0-edge-chat.facebook.com
      facebook.com					0.0.0.0 1-act.channel.facebook.com
      facebook.com					0.0.0.0 1-edge-chat.facebook.com
      facebook.com					0.0.0.0 2-act.channel.facebook.com
      facebook.com					0.0.0.0 2-edge-chat.facebook.com
      facebook.com					0.0.0.0 3-act.channel.facebook.com
      facebook.com					0.0.0.0 3-edge-chat.facebook.com
      facebook.com					0.0.0.0 4-act.channel.facebook.com
      facebook.com					0.0.0.0 4-edge-chat.facebook.com
      facebook.com					0.0.0.0 5-act.channel.facebook.com
      facebook.com					0.0.0.0 5-edge-chat.facebook.com
      facebook.com					0.0.0.0 6-act.channel.facebook.com
      facebook.com					0.0.0.0 6-edge-chat.facebook.com
      facebook.com					0.0.0.0 act.channel.facebook.com
      facebook.com					0.0.0.0 api-read.facebook.com
      facebook.com					0.0.0.0 api.ak.facebook.com
      facebook.com					0.0.0.0 api.connect.facebook.com
      facebook.com					0.0.0.0 app.logs-facebook.com
      facebook.com					0.0.0.0 ar-ar.facebook.com
      facebook.com					0.0.0.0 attachments.facebook.com
      facebook.com					0.0.0.0 b-api.facebook.com
      facebook.com					0.0.0.0 b-graph.facebook.com
      facebook.com					0.0.0.0 b.static.ak.facebook.com
      facebook.com					0.0.0.0 badge.facebook.com
      facebook.com					0.0.0.0 beta-chat-01-05-ash3.facebook.com
      facebook.com					0.0.0.0 bigzipfiles.facebook.com
      facebook.com					0.0.0.0 channel-ecmp-05-ash3.facebook.com
      facebook.com					0.0.0.0 channel-staging-ecmp-05-ash3.facebook.com
      facebook.com					0.0.0.0 channel-testing-ecmp-05-ash3.facebook.com
      facebook.com					0.0.0.0 check4.facebook.com
      facebook.com					0.0.0.0 check6.facebook.com
      facebook.com					0.0.0.0 creative.ak.facebook.com
      facebook.com					0.0.0.0 d.facebook.com
      facebook.com					0.0.0.0 de-de.facebook.com
      facebook.com					0.0.0.0 developers.facebook.com
      facebook.com					0.0.0.0 edge-chat.facebook.com
      facebook.com					0.0.0.0 edge-mqtt-mini-shv-01-lax3.facebook.com
      facebook.com					0.0.0.0 edge-mqtt-mini-shv-02-lax3.facebook.com
      facebook.com					0.0.0.0 edge-star-shv-01-lax3.facebook.com
      facebook.com					0.0.0.0 edge-star-shv-02-lax3.facebook.com
      facebook.com					0.0.0.0 error.facebook.com
      facebook.com					0.0.0.0 es-la.facebook.com
      facebook.com					0.0.0.0 fr-fr.facebook.com
      facebook.com					0.0.0.0 graph.facebook.com
      facebook.com					0.0.0.0 hi-in.facebook.com
      facebook.com					0.0.0.0 inyour-slb-01-05-ash3.facebook.com
      facebook.com					0.0.0.0 it-it.facebook.com
      facebook.com					0.0.0.0 ja-jp.facebook.com
      facebook.com					0.0.0.0 messages-facebook.com
      facebook.com					0.0.0.0 mqtt.facebook.com
      facebook.com					0.0.0.0 orcart.facebook.com
      facebook.com					0.0.0.0 origincache-starfacebook-ai-01-05-ash3.facebook.com
      facebook.com					0.0.0.0 pixel.facebook.com
      facebook.com					0.0.0.0 profile.ak.facebook.com
      facebook.com					0.0.0.0 pt-br.facebook.com
      facebook.com					0.0.0.0 s-static.ak.facebook.com
      facebook.com					0.0.0.0 s-static.facebook.com
      facebook.com					0.0.0.0 secure-profile.facebook.com
      facebook.com					0.0.0.0 ssl.connect.facebook.com
      facebook.com					0.0.0.0 star.c10r.facebook.com
      facebook.com					0.0.0.0 star.facebook.com
      facebook.com					0.0.0.0 static.ak.connect.facebook.com
      facebook.com					0.0.0.0 static.ak.facebook.com
      facebook.com					0.0.0.0 staticxx.facebook.com
      facebook.com					0.0.0.0 touch.facebook.com
      facebook.com					0.0.0.0 upload.facebook.com
      facebook.com					0.0.0.0 vupload.facebook.com
      facebook.com					0.0.0.0 vupload2.vvv.facebook.com
      facebook.com					0.0.0.0 www.login.facebook.com
      facebook.com					0.0.0.0 zh-cn.facebook.com
      facebook.com					0.0.0.0 zh-tw.facebook.com
      fbcdn.net					0.0.0.0 b.static.ak.fbcdn.net
      fbcdn.net					0.0.0.0 creative.ak.fbcdn.net
      fbcdn.net					0.0.0.0 ent-a.xx.fbcdn.net
      fbcdn.net					0.0.0.0 ent-b.xx.fbcdn.net
      fbcdn.net					0.0.0.0 ent-c.xx.fbcdn.net
      fbcdn.net					0.0.0.0 ent-d.xx.fbcdn.net
      fbcdn.net					0.0.0.0 ent-e.xx.fbcdn.net
      fbcdn.net					0.0.0.0 external.ak.fbcdn.net
      fbcdn.net					0.0.0.0 origincache-ai-01-05-ash3.fbcdn.net
      fbcdn.net					0.0.0.0 photos-a.ak.fbcdn.net
      fbcdn.net					0.0.0.0 photos-b.ak.fbcdn.net
      fbcdn.net					0.0.0.0 photos-c.ak.fbcdn.net
      fbcdn.net					0.0.0.0 photos-d.ak.fbcdn.net
      fbcdn.net					0.0.0.0 photos-e.ak.fbcdn.net
      fbcdn.net					0.0.0.0 photos-f.ak.fbcdn.net
      fbcdn.net					0.0.0.0 photos-g.ak.fbcdn.net
      fbcdn.net					0.0.0.0 photos-h.ak.fbcdn.net
      fbcdn.net					0.0.0.0 profile.ak.fbcdn.net
      fbcdn.net					0.0.0.0 s-external.ak.fbcdn.net
      fbcdn.net					0.0.0.0 s-static.ak.fbcdn.net
      fbcdn.net					0.0.0.0 scontent-a-lax.xx.fbcdn.net
      fbcdn.net					0.0.0.0 scontent-a-sin.xx.fbcdn.net
      fbcdn.net					0.0.0.0 scontent-a.xx.fbcdn.net
      fbcdn.net					0.0.0.0 scontent-b-lax.xx.fbcdn.net
      fbcdn.net					0.0.0.0 scontent-b-sin.xx.fbcdn.net
      fbcdn.net					0.0.0.0 scontent-b.xx.fbcdn.net
      fbcdn.net					0.0.0.0 scontent-c.xx.fbcdn.net
      fbcdn.net					0.0.0.0 scontent-d.xx.fbcdn.net
      fbcdn.net					0.0.0.0 scontent-e.xx.fbcdn.net
      fbcdn.net					0.0.0.0 scontent-mxp.xx.fbcdn.net
      fbcdn.net					0.0.0.0 scontent.xx.fbcdn.net
      fbcdn.net					0.0.0.0 sphotos-a.xx.fbcdn.net
      fbcdn.net					0.0.0.0 static.ak.fbcdn.net
      fbcdn.net					0.0.0.0 video.xx.fbcdn.net
      fbcdn.net					0.0.0.0 vthumb.ak.fbcdn.net
      fbcdn.net					0.0.0.0 xx-fbcdn-shv-01-lax3.fbcdn.net
      fbcdn.net					0.0.0.0 xx-fbcdn-shv-02-lax3.fbcdn.net
      net23.net					0.0.0.0 scontent-vie-224-xx-fbcdn.net23.net
      net23.net					0.0.0.0 scontent-vie-73-xx-fbcdn.net23.net
      net23.net					0.0.0.0 scontent-vie-75-xx-fbcdn.net23.net
      

      And a quick examination shows that all these 113 sites, considered as safe, have an address ending with one of the 4 values, below :

      amazonaws.com
      facebook.com
      fbcdn.net
      net23.net
      

      Thus, from the remaining 255,268 lines, of DocC.txt, only those which address ends with amazonaws.com, facebook.com, fbcdn.net, or net23.net have to be compared with the list of the 113 safe servers !

      • So, in docC.txt, I bookmarked all lines, matching the regex (amazonaws|facebook)\.com$|(fbcdn|net23)\.net$. I obtained 301 bookmarked results that I copied to the clipboard with the option Search -> Bookmark -> Copy Bookmarked lines

      • Then, I pasted the 301 bookmarked lines, at the very beginning of Doc D.txt, followed with a line of, at least, 3 dashes, as a separator. Finally, DocD.txt contained 301 results to verify + 1 line of dashes + 113 safe sites, that is to say a total of 415 lines

      Now, with the regex (?-s)^(.+)\R(?s)(?=.*\R\1\R?), I marked all the lines which have a duplicate, further on, in DocD.txt => 106 lines bookmarked !

      And using the command Search > Bookmark > Remove Unmarked lines, I was left with a DocD.txt file, containing 106 lines, only, which are, both, in DocA.txt and DocB.txt. So, these lines / sites needed to be deleted from the DocC.txt file !

      To that purpose :

      • I added these 106 lines at the end of Doc C.txt file, giving a file with 255 374 lines

      • I did a last sort operation ( Edit > Line Operations > Sort Lines Lexicographically Ascending )

      => After sort, these 106 sites should appear, each, as a block of two consecutive duplicate lines

      And, finally, with the regex S/R :

      SEARCH ^(.+\R)\1+

      REPLACE Leave EMPTY

      I got rid of these safe servers and obtained a Doc.txt file of 255,162 lines ( 255,374 - 106 * 2 ) which contains, only, unwanted servers !


      Now, if you do not want to repeat all the above process, by yourself, just send me an e-mail to :

      And I’ll send you, back, this DocC.txt ( = DocA.txt - DocB.txt ) of 255,162 lines

      Cheers,

      guy038

      P.S. :

      Timmy, to be exhaustive, and regarding your initial files :

      DocB.txt contains 113 lines / safe servers :

      • 7 lines, not present in DocA.txt

      • 106 lines, present in DocA.txt


      And DocA.txt contains 255,386 lines :

      • 118 duplicate lines , ( due to case normalization ), which have been deleted

      • 106 lines which are, both, present in DocA.txt and DocB.txt and have been deleted, too !

      • 254,967 lines, with an end of line, different from amazonaws.com, facebook.com, fbcdn.net, and net23.net, which have to remain in DocA.txt

      • 195 lines, with an end of line, equal to amazonaws.com, facebook.com, fbcdn.net, or net23.net but not present in DocB.txt. So, they must remain in DocA.txt, too !


      Hence, the final DocA.txt ( = my DocC.txt ) which contains 255162 unwanted servers ( 254,967 + 195 )

      1 Reply Last reply Reply Quote 2
      • Timmy MT
        Timmy M
        last edited by

        “Replace All: 106 occurrences were replaced.”

        @guy038 You sir, are a regex LEGEND! :D

        Of course I have taken every step you’ve explained, it’s the least I could do after you’ve put in such effort. I even found a minor mistake in your explanation (I think):

        Now, with the regex (?-s)^(.+)\R(?s)(?=.*\R\1\R?), I marked all the lines which have a duplicate, further on, in DocD.txt

        Should be bookmarked (with regular marking, the “Remove Unmarked Lines” removes all lines).

        @Scott-Sumner @Claudia-Frank
        I guess it’s time to face that monkey then… I installed the Python plugin script, navigated to Python Script > New Script and created ScreamingMonkey.py
        Now, selecting Run Previous Script (ScreamingMonkey) doesn’t do anything. I’ve done some searching and found that I should install Python first from python.org/downloads/, should I? Version 3?
        Before installing something I know nothing about, I’d like to verify with you guys whether that is the right course of action. I hope that’s okay :)

        Claudia FrankC Scott SumnerS 2 Replies Last reply Reply Quote 0
        • Claudia FrankC
          Claudia Frank @Timmy M
          last edited by

          @Timmy-M

          no, there is no need to install an additional python package,
          as python script plugin delivers its own.

          What you need to do is to open you two files like I’ve shown above as the
          script will reference it by using editor1 and editor2 and click on ScreamingMonkey.
          Once it has been run you can then call it again by using Run Previous Script.

          What do you see if you try to open the python script console?
          Plugin->PythonScript->Show Console.

          If nothing happens, no new window opens, then how did
          you install it, via plugin manager or via the msi package?

          The msi installation is preferred as it was reported that the installation via plugin manager sometimes doesn’t copy all needed files.

          Cheers
          Claudia

          1 Reply Last reply Reply Quote 2
          • guy038G
            guy038
            last edited by guy038

            Hello, @timmy-m and All,

            Ah, yes ! You’re right, although what I wrote was correct, too ! Indeed, I used the Remove Unmarked lines, in order to change DocD.txt, first => So, it remained 106 lines that I, then, added to DocC.txt contents.

            But you, certainly, used this easier solution, below :


            …
            …

            Now, with the regex (?-s)^(.+)\R(?s)(?=.*\R\1\R?), I marked all the lines which have a duplicate, further on, in DocD.txt => 106 lines bookmarked !

            And using the command Search > Bookmark > Copy Bookmarked lines, I put, in the clipboard, these 106 lines, only, which are, both, in DocA.txt and DocB.txt and have to be deleted from the DocC.txt file !

            To that purpose :

            • I paste these 106 lines ( Ctrl + V ) at the end of Doc C.txt file, giving a file with 255 374 lines

            …
            …

            Best Regards,

            guy038

            1 Reply Last reply Reply Quote 1
            • Scott SumnerS
              Scott Sumner @Timmy M
              last edited by

              @Timmy-M

              Regarding “marked versus bookmarked”:

              Not @guy038 's fault… The Notepad++ user interface is confusing on this point; it uses the terminology “mark” often when it should (IMO) use “bookmark”. Alternatively, if it used “redmark” and “bookmark” exclusively there would be no confusion, but I suppose using “red” wouldn’t totally be correct as it is just the default color and can be changed by the user.

              1 Reply Last reply Reply Quote 2
              • Timmy MT
                Timmy M
                last edited by

                I used the msi install method. First updated N++ to latest, then installed Python plugin.

                @Claudia-Frank said:

                What do you see if you try to open the python script console?
                Plugin->PythonScript->Show Console.

                ---------------------------.
                Python 2.7.6-notepad++ r2 (default, Apr 21 2014, 19:26:54) [MSC v.1600 32 bit (Intel)]
                Initialisation took 31ms
                Ready.
                Traceback (most recent call last):
                File “%AppData%\Roaming\Notepad++\plugins\Config\PythonScript\scripts\ScreamingMonkey.py”, line 1, in <module>
                with open(‘DocA.txt’, ‘r’, encoding=‘utf-8’) as filea: linesa = filea.readlines()
                TypeError: ‘encoding’ is an invalid keyword argument for this function
                Traceback (most recent call last):
                File “%AppData%\Roaming\Notepad++\plugins\Config\PythonScript\scripts\ScreamingMonkey.py”, line 1, in <module>
                with open(‘DocA.txt’, ‘r’, encoding=‘utf-8’) as filea: linesa = filea.readlines()
                TypeError: ‘encoding’ is an invalid keyword argument for this function
                ---------------------------.

                Eh… Lil’ help please? That monkey’s screaming really loud right now O_O

                Claudia FrankC Scott SumnerS 2 Replies Last reply Reply Quote 0
                • Claudia FrankC
                  Claudia Frank @Timmy M
                  last edited by Claudia Frank

                  @Timmy-M

                  but that is not my script.
                  Scotts script assumes you run it under python3 as it uses the encoding argument
                  which is not available in python2, which is used by python script.

                  Did you try my 3 lines code posted above?

                  If you still want to use Scotts code and executed via python script plugin, then
                  get rid of the encoding=‘utf-8’ so something like

                  with open(‘DocA.txt’, ‘r’) as filea ...
                  

                  but you have to ensure that DocA.txt can be found and python scripts
                  working directory is set to the directory where notepad++.exe is.

                  Cheers
                  Claudia

                  1 Reply Last reply Reply Quote 1
                  • Scott SumnerS
                    Scott Sumner @Timmy M
                    last edited by

                    @Timmy-M

                    Yea, for sure go with @Claudia-Frank 's script, since you seem to want to stay more “within” Notepad++!

                    1 Reply Last reply Reply Quote 1
                    • Timmy MT
                      Timmy M
                      last edited by

                      @Claudia-Frank said:

                      Did you try my 3 lines code posted above?

                      I have and eventually succeeded! On my first dozen attempts (turns out I really suck at taming monkeys) I just kept reorganizing the order of all entries in either DocA or DocB, depending on which one was selected, without any entry getting removed.
                      After trying a bunch of stuff, I found that I had to use View > Move/Clone Current Document > Move to Other View on Doc B. I just couldn’t get my head around “view0” and “view1” at first but didn’t want to bother you with such mundane question either ^_^
                      DocA’s order does get messed up eventually but that’s easily fixed with a regular sort. Perhaps that sort can be implemented in the script?

                      Anyway, it all worked out! Thank you very much @Scott-Sumner, @Claudia-Frank and @guy038 for this learning experience. You’re a dream team ;)

                      Claudia FrankC 1 Reply Last reply Reply Quote 1
                      • Claudia FrankC
                        Claudia Frank @Timmy M
                        last edited by

                        @Timmy-M

                        good to see that you got it sorted out :-)

                        and to keep sorting, change this

                        editor1.setText('\r\n'.join(s1-s2))
                        

                        to this

                        editor1.setText('\r\n'.join(sorted(s1-s2)))
                        

                        that should do the trick.

                        Cheers
                        Claudia

                        1 Reply Last reply Reply Quote 2
                        • First post
                          Last post
                        The Community of users of the Notepad++ text editor.
                        Powered by NodeBB | Contributors