Community
    • Login

    Subtract document B from A

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    multi deletesubtractbatch remove
    21 Posts 4 Posters 9.3k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Timmy MT
      Timmy M
      last edited by

      I have one document with over 200k lines: DocA.txt
      I have another document with about 100 lines: DocB.txt

      I would like to subtract the 100 lines in DocB.txt from all the lines in DocA.txt

      Example

      Let’s say DocA.txt looks like this:
      Giraffe
      Crocodile
      Ant
      Panther
      Elephant
      Mosquito
      Zebra
      Lion
      Butterfly
      Antelope

      …and DocB.txt looks like this:
      Ant
      Mosquito
      Butterfly

      Is there a way, natively in N++ or by using a plugin, to make sure all lines present in DocB.txt are removed from DocA.txt so the result looks like this?
      Giraffe
      Crocodile
      Panther
      Elephant
      Zebra
      Lion
      Antelope

      Note that a value like “Ant” did not interfere with a value like “Antelope”, only the entire line should be taken into account. Any help would be greatly appreciated :)

      Scott SumnerS Claudia FrankC 2 Replies Last reply Reply Quote 0
      • Scott SumnerS
        Scott Sumner @Timmy M
        last edited by Scott Sumner

        @Timmy-M

        This thread will probably get you going.
        If not, there is a link to yet-another-thread in there that might help.
        You’d have to put the two files together into a single file, hopefully that is not a deal-breaker for you.

        1 Reply Last reply Reply Quote 1
        • Claudia FrankC
          Claudia Frank @Timmy M
          last edited by

          @Timmy-M

          just in case you are using python script plugin and the
          lines in document A are unique (or you don’t bother if they are unique afterwards)
          you could use something like this

          s1 = set(editor1.getText().splitlines())
          s2 = set(editor2.getText().splitlines())
          editor1.setText('\r\n'.join(s1-s2))
          

          which turns

          into

          Document A needs to be in View0 and Document B needs to be in View1

          Cheers
          Claudia

          1 Reply Last reply Reply Quote 1
          • guy038G
            guy038
            last edited by guy038

            Hello, @timmy-M, @claudia-frank and All,

            Sorry, again, Claudia… !

            I think, Timmy, that your goal can, easily, be achieved with a regex S/R ! So :

            • First, open a new tab ( Ctrl + N )

            • Then, copy all the Doc A contents in this new tab

            • Now, at the end of the list, add a new line, beginning with, at least, 3 dashes ( --- ) and followed by a normal line-break

            • Then, copy all the Doc B contents, after this line of dashes

            • Open the Replace dialog ( Ctrl + H )

            • Type in the regex (?-s)^(.+)\R(?s)(?=.*\R\1\R?)|^---.+ , in the Find what: zone

            • Leave the Replace with: zone EMPTY

            • Tick the Wrap around option

            • Select the Regular expression search mode

            • Finally, click on the Replace All button

            Et voilà !


            Notes :

            • This regex is looking for a complete line, which is repeated, somewhere, afterwards, in the file

            • When a match occurs, this duplicate line is, then, deleted

            • Finally, the string --- followed by the totality of the Doc B contents, are also caught and deleted

            • Note also that IF your list, in Doc A, already contains duplicates names, they are suppressed, as well ;-))

            Best Regards,

            guy038

            1 Reply Last reply Reply Quote 2
            • Timmy MT
              Timmy M
              last edited by

              Thank you @Scott Sumner, @Claudia-Frank and @guy038 for your suggestions.
              I’ll be honest with you, when I see “Python”, the monkey in my head starts jumping around screaming so that might not be such a good idea for me :)

              I tried the step-by-step that @guy038 wrote here but that reduced a file with 255.507 lines to 1 single line and it stated “Replace All: 5 occurrences were replaced”.

              Sorry for being a rookie in all of this but “regex” means “regular expression”, right? I don’t need to install a plug-in or something? And by “followed by a normal line-break”, you just mean pressing enter (another new line), right?
              I’ve made a screenshot of the replace box to verify if I’ve set everything right: photos (dot) app (dot) goo (dot) gl (slash) FlszelbbwReIXJWl1

              To clarify my intention a bit more: I’m trying to apply a whitelist on a hosts file. The structure of the text looks like this:
              0.0.0.0 www.notepad-plus-plus.org
              0.0.0.0 www.google.com
              Etc.

              1 Reply Last reply Reply Quote 3
              • Scott SumnerS
                Scott Sumner
                last edited by Scott Sumner

                @Timmy-M said:

                the monkey in my head starts jumping around screaming

                :-D

                So if I take the advice I gave earlier and look at the other thread, and apply the technique from there, here’s what I get on your test data:

                Imgur

                …which seems to meet your request because I can then execute the command to delete all bookmarked lines.

                Of course, maybe this doesn’t work on your real data. It seems like sometimes the regular expression (a.k.a. “regex”) engine can have trouble with a search of this sort on “big” data, or maybe it just seems this way when a inefficient regex is used.

                1 Reply Last reply Reply Quote 2
                • guy038G
                  guy038
                  last edited by guy038

                  Hi, @timmy-M, @claudia-frank, @scott-sumner and All,

                  I did some tests with different data configuration and, indeed, sometimes, the previous regex find a unique match which is the totality of the file contents :-((. I suppose that this behavior is due to the amount of characters, matched by the .* part, in the lookahead (?=.*\R\1\R?), which can be quite important, in some huge files. So, my previous regex (?-s)^(.+)\R(?s)(?=.*\R\1\R?) is not a reliable one, because it seems to work, only, on small or middle size files !

                  Refer to similar problems ( catastrophic backtracking ) at the address, below :

                  https://www.regular-expressions.info/catastrophic.html


                  So I tried, ( hard ! ), these two last days, to find out a new regex, which would work for your specific task ( Subtract B from A ). My idea was to find a way to get each line of Document A close to every item of document B !

                  I simply assume that, even if your Doc A is quite important, in number of lines, your Doc B, containing the different words to get rid of, is rather small with a limited number of lines ( I mean… less than 100, about ! ). But, this new regex, does NOT change, at all, the structure of your Doc A :

                  • It keeps possible duplicates, in Doc A

                  • It does not perform any sort operation


                  To test this new regex, I decided to start with the license.txt file, generated by the N++ v7.5.6 install. With different regexes, I extracted all words, one per line, without any punctuation or symbol character. I obtained a list of 2654 words/lines.

                  And for readability, I lowercased any word, too ! So this list begins and ends, as below :

                  copying
                  describes
                  the
                  terms
                  under
                  which
                  notepad++
                  is
                  distributed
                  .....
                  .....
                  .....
                  possibility
                  of
                  such
                  damages
                  end
                  of
                  terms
                  and
                  conditions
                  

                  Then, I duplicated 19 times this list , in order to get a file of 50426 lines ( So each word has 18 duplicates, further on !). This file will be used as Doc A

                  Now, I decided that every word, without digits, containing 2 or 3 letters, in Doc A ( so 60 words ) will be part of Doc B. I also added 9 other words, frequently used in License.txt. Finally, the contents of Doc B are :

                  a
                  act
                  add
                  all
                  an
                  and
                  any
                  are
                  as
                  ask
                  at
                  be
                  but
                  by
                  can
                  do
                  don
                  end
                  fee
                  for
                  get
                  gnu
                  gpl
                  has
                  he
                  ho
                  how
                  if
                  in
                  inc
                  is
                  it
                  its
                  law
                  ma
                  may
                  new
                  no
                  not
                  of
                  on
                  one
                  or
                  our
                  out
                  run
                  say
                  see
                  she
                  so
                  the
                  to
                  two
                  up
                  usa
                  use
                  way
                  we
                  who
                  you
                  conditions
                  program
                  license
                  copyright
                  software
                  modification
                  modifications
                  free
                  distribution
                  

                  Then, the main idea is to change the Doc B list of words, in a single line, with the simple regex :

                  SEARCH \R

                  REPLACE ,

                  After immediate replacement, the Doc B should contain an unique line :

                  ,a,act,add,all,an,and,any,are,as,ask,at,be,but,by,can,do,don,end,fee,for,get,gnu,gpl,has,he,ho,how,if,in,inc,is,it,its,law,ma,may,new,no,not,of,on,one,or,our,out,run,say,see,she,so,the,to,two,up,usa,use,way,we,who,you,conditions,program,license,copyright,software,modification,modifications,free,distribution,
                  

                  IMPORTANT :

                  • After replacement, your list of words MUST begin and end with a comma ( , ). If not, just add a comma symbol, at beginning and/or end of the line !

                  • From now on, I strongly advice you to unset the Word Wrap feature ( View > Word wrap ). Indeed, navigation among huge files are really easier with that option off ;-)

                  Now, in Doc A, use the following regex S/R :

                  SEARCH $

                  REPLACE : ,a,act,add,all,an,and,any,are,as,ask,at,be,but,by,can,do,don,end,fee,for,get,gnu,gpl,has,he,ho,how,if,in,inc,is,it,its,law,ma,may,new,no,not,of,on,one,or,our,out,run,say,see,she,so,the,to,two,up,usa,use,way,we,who,you,conditions,program,license,copyright,software,modification,modifications,free,distribution,

                  After replacement ( 5s with my laptop ), the list of Doc B words, in one line, are added at the end of the 50426 lines, of Doc A,


                  Finally, here is, below, the new regex S/R, which will delete any line of Doc A, also contained in Doc B :

                  SEARCH (?-s)^(.+?,)(.*,)?\1.*\R|,.+

                  REPLACE Leave EMPTY

                  After replacement ( 1m 15s with my laptop ), I obtained a list of 25764 words/lines, in the same initial order, than in Doc A, containing, at least, 4 letters and, also, different of the 9 words, below :

                  conditions
                  program
                  license
                  copyright
                  software
                  modification
                  modifications
                  free
                  distribution
                  

                  Notes :

                  • You, certainly, have understood, that, this time, each initial word , of Doc A, at beginning of line, is simply compared with a possible identical word, in the SAME current line, only ! If so, the entire line, with its line break, is deleted. If not, the part ,.+, after the alternation symbol ( | ), which represents the list of Doc B words, is only deleted :-))

                  • So, I supposed that we can rely on that regex, even in case of a very big file. Of course, the replace operation will be longer but safe ! Last New : I gave a try with a new Doc A, which is 10 times bigger than the initial one. So, a file of 159,275,670 bytes, with 504,260 lines ! And, after the final replacement, with the regex below :

                  SEARCH (?-s)^(.+?,)(.*,)?\1.*\R|,.+

                  REPLACE Leave EMPTY

                  I was left with 257,640 lines ( so, as expected, 10 times the last result ), after 12m 27s of process, while, simultaneously, listening music on Net ! Now, I’m quite sure that it’s a reliable regex !


                  So, Timmy, if you get some spare time, just give a try to this method and tell me if you get positive results !

                  Cheers,

                  guy038

                  1 Reply Last reply Reply Quote 2
                  • Timmy MT
                    Timmy M
                    last edited by

                    Hi @guy038 !

                    Thank you so much for your efforts, they are quite admirable.

                    I believe we’re getting closer here but when taking your “replace $ with [line of values]” step (55 seconds on my laptop) it stops at line 161,454.
                    To mitigate this, I’ve cut all data from line 161,455 to a new document and ran the replace again. This time, it stopped at line 48,192.
                    At this point I noticed a strange phenomenon: I couldn’t delete any text from these files anymore so I decided to start over. A new S/R attempt stopped at line 128,653 and everything after that came out garbled (showing “LF” blocks).
                    I believe this is related to the punctuation inherent to the content. To better understand this, perhaps it’s easier to just have a look at the actual data. Therefore I have uploaded the files on Gdrive:
                    DocA.txt
                    DocB.txt
                    Warning: do not attempt to visit any of the websites mentioned within these files as they may contain malicious data!
                    DocA is a hosts file that blocks unwanted servers. DocB holds safe Facebook servers that are supposed to be deleted from DocA.

                    I hope that helps. If this is too difficult to achieve or too much effort has gone into this already, you really shouldn’t bother. I’ll do it manually each time if I must, no worries :)

                    Cheers
                    Timmy

                    Scott SumnerS 1 Reply Last reply Reply Quote 0
                    • Scott SumnerS
                      Scott Sumner @Timmy M
                      last edited by

                      @Timmy-M

                      It may be time to consider the “screaming monkey”…

                      I tried @Claudia-Frank 's Pythonscript code on your documents and, I don’t know, I either didn’t wait long enough, or something else went wrong, but I ended up having to kill Notepad++ to put a stop to whatever it (or the PS plugin) was doing.

                      I thought about trying @guy038’s regular expression on the data, but then I thought, “this is too much work; no one will really be willing to do this, let alone remember it”.

                      So…it may be time to bite the monkey and give this task to an external scripting language. Starting with @Claudia-Frank 's Pythonscript code, I came up with the following Python3 code that runs on your data files so quickly that I didn’t even think to time it:

                      with open('DocA.txt', 'r', encoding='utf-8') as filea: linesa = filea.readlines()
                      with open('DocB.txt', 'r', encoding='utf-8') as fileb: setb = set(fileb.readlines())
                      
                      linesa = sorted(linesa)
                      seta = set(linesa)
                      
                      linesc = sorted(list(seta - setb))
                      
                      with open('DocA_sorted.txt', 'w', encoding='utf-8') as filea: filea.write(''.join(linesa))
                      with open('DocC.txt', 'w', encoding='utf-8') as filec: filec.write(''.join(linesc))
                      

                      I sorted the lines because that makes comparing DocA_sorted and DocC (where C = B - A) easy when attempting to validate the results.

                      Sometimes it’s just best to leave the confines of Notepad++ for a task, and I’m sure Perl or AWK or “insert your favorite programming language here” works just as well as Python3 for this problem.

                      Claudia FrankC 1 Reply Last reply Reply Quote 1
                      • Claudia FrankC
                        Claudia Frank @Scott Sumner
                        last edited by

                        @Scott-Sumner

                        tried it ten times in a row without a problem. Takes < 1sec to do the job.

                        Cheers
                        Claudia

                        Scott SumnerS 1 Reply Last reply Reply Quote 1
                        • Scott SumnerS
                          Scott Sumner @Claudia Frank
                          last edited by Scott Sumner

                          @Claudia-Frank

                          Interesting…still don’t know what was going wrong with it for me…but I’m not inclined to spend any more time to find out. :-)

                          Some tasks just sort of feel right outside Notepad++, and for me this is one of them. I guess if there was a no-brainer menu function that did it I’d feel differently, but I dunno, do you think we’ll ever see that? :-)

                          And BTW, I should have said C = A - B above.

                          1 Reply Last reply Reply Quote 1
                          • guy038G
                            guy038
                            last edited by guy038

                            Hi, @timmy-M, @claudia-frank, @scott-sumner and All,

                            As Scott said, I do think that an external scripting language is the best way to solve your goal ! And I’m convinced that both Claudia’s and Scott’s scripts work just fine !

                            But your stubborn servant just gave a try to a regex solution, which is, of course, longer and less elegant than a script :-((


                            Timmy, I could, correctly, download your two files DocA.txt and DocB.txt, without any problem !

                            • The hosts file, DocA.txt, is an Unix UTF-8 encoded file, containing 255,386 lines

                            • The Safe Facebook servers file, DocB.txt, is a Windows UTF-8 BOM encoded file, containing only 113 lines

                            Now, if we try to change all the lines of your DocB.txt, containing the safe sites, in an unique line, we get a line of 3,701 bytes. But, if we were going to add 3,701 bytes to each line of DocA.txt to simulate my 2nd method, in my previous post, the total size of DocA.txt would have been increased, up to 908 Mo about ( 6,788,298 + 255,386 * 3,701 = 951,971,884 bytes ! ) Obviously, this resulting file would be too big to handle while running even simple regexes :-((

                            So, …we need to find out a 3rd way !

                            To begin with, I copied DocA.txt as DocC.txt and DocB.txt as DocD.txt

                            Then, I did some verifications on these two files :

                            • I normalized all letters of these files in lower-case ( with Ctrl-A and Ctrl + U )

                            • I performed a classical sort ( Edit > Line Operations > Sort Lines Lexicographically Ascending )

                            Note : Sort comes after the Ctrl + U command, as the sort is a true Unicode sort !

                            • Then I ran the regex ^(.+\R)\1+ to search possible duplicate consecutive lines

                              • DocD.txt did not contain any duplicate line

                              • DocC.txt contained 118 duplicates, with appeared, due to the previous normalization to lower-case. So, in order to delete them, I used the regex S/R :

                            SEARCH ^(.+\R)\1+

                            REPLACE \1

                            So, DocC.txt, now, contained 255,268 lines

                            • I noticed that, in DocD.txt, many occurrences of addresses ended with “facebook.com” or “fbcdn.net”, for instance. So, after copying DocD.txt contents, in a new tab, I decided to have an idea of the different sites, regarded as safe, with the regex S/R, below :

                            SEARCH ^.{8}.+[.-]([^.\r\n]+\.[^.\r\n]+)$

                            REPLACE \1\t\t\t\t\t$0

                            After a classical sort, I obtained the following list :

                            amazonaws.com					0.0.0.0 ec2-34-193-80-93.compute-1.amazonaws.com
                            facebook.com					0.0.0.0 0-act.channel.facebook.com
                            facebook.com					0.0.0.0 0-edge-chat.facebook.com
                            facebook.com					0.0.0.0 1-act.channel.facebook.com
                            facebook.com					0.0.0.0 1-edge-chat.facebook.com
                            facebook.com					0.0.0.0 2-act.channel.facebook.com
                            facebook.com					0.0.0.0 2-edge-chat.facebook.com
                            facebook.com					0.0.0.0 3-act.channel.facebook.com
                            facebook.com					0.0.0.0 3-edge-chat.facebook.com
                            facebook.com					0.0.0.0 4-act.channel.facebook.com
                            facebook.com					0.0.0.0 4-edge-chat.facebook.com
                            facebook.com					0.0.0.0 5-act.channel.facebook.com
                            facebook.com					0.0.0.0 5-edge-chat.facebook.com
                            facebook.com					0.0.0.0 6-act.channel.facebook.com
                            facebook.com					0.0.0.0 6-edge-chat.facebook.com
                            facebook.com					0.0.0.0 act.channel.facebook.com
                            facebook.com					0.0.0.0 api-read.facebook.com
                            facebook.com					0.0.0.0 api.ak.facebook.com
                            facebook.com					0.0.0.0 api.connect.facebook.com
                            facebook.com					0.0.0.0 app.logs-facebook.com
                            facebook.com					0.0.0.0 ar-ar.facebook.com
                            facebook.com					0.0.0.0 attachments.facebook.com
                            facebook.com					0.0.0.0 b-api.facebook.com
                            facebook.com					0.0.0.0 b-graph.facebook.com
                            facebook.com					0.0.0.0 b.static.ak.facebook.com
                            facebook.com					0.0.0.0 badge.facebook.com
                            facebook.com					0.0.0.0 beta-chat-01-05-ash3.facebook.com
                            facebook.com					0.0.0.0 bigzipfiles.facebook.com
                            facebook.com					0.0.0.0 channel-ecmp-05-ash3.facebook.com
                            facebook.com					0.0.0.0 channel-staging-ecmp-05-ash3.facebook.com
                            facebook.com					0.0.0.0 channel-testing-ecmp-05-ash3.facebook.com
                            facebook.com					0.0.0.0 check4.facebook.com
                            facebook.com					0.0.0.0 check6.facebook.com
                            facebook.com					0.0.0.0 creative.ak.facebook.com
                            facebook.com					0.0.0.0 d.facebook.com
                            facebook.com					0.0.0.0 de-de.facebook.com
                            facebook.com					0.0.0.0 developers.facebook.com
                            facebook.com					0.0.0.0 edge-chat.facebook.com
                            facebook.com					0.0.0.0 edge-mqtt-mini-shv-01-lax3.facebook.com
                            facebook.com					0.0.0.0 edge-mqtt-mini-shv-02-lax3.facebook.com
                            facebook.com					0.0.0.0 edge-star-shv-01-lax3.facebook.com
                            facebook.com					0.0.0.0 edge-star-shv-02-lax3.facebook.com
                            facebook.com					0.0.0.0 error.facebook.com
                            facebook.com					0.0.0.0 es-la.facebook.com
                            facebook.com					0.0.0.0 fr-fr.facebook.com
                            facebook.com					0.0.0.0 graph.facebook.com
                            facebook.com					0.0.0.0 hi-in.facebook.com
                            facebook.com					0.0.0.0 inyour-slb-01-05-ash3.facebook.com
                            facebook.com					0.0.0.0 it-it.facebook.com
                            facebook.com					0.0.0.0 ja-jp.facebook.com
                            facebook.com					0.0.0.0 messages-facebook.com
                            facebook.com					0.0.0.0 mqtt.facebook.com
                            facebook.com					0.0.0.0 orcart.facebook.com
                            facebook.com					0.0.0.0 origincache-starfacebook-ai-01-05-ash3.facebook.com
                            facebook.com					0.0.0.0 pixel.facebook.com
                            facebook.com					0.0.0.0 profile.ak.facebook.com
                            facebook.com					0.0.0.0 pt-br.facebook.com
                            facebook.com					0.0.0.0 s-static.ak.facebook.com
                            facebook.com					0.0.0.0 s-static.facebook.com
                            facebook.com					0.0.0.0 secure-profile.facebook.com
                            facebook.com					0.0.0.0 ssl.connect.facebook.com
                            facebook.com					0.0.0.0 star.c10r.facebook.com
                            facebook.com					0.0.0.0 star.facebook.com
                            facebook.com					0.0.0.0 static.ak.connect.facebook.com
                            facebook.com					0.0.0.0 static.ak.facebook.com
                            facebook.com					0.0.0.0 staticxx.facebook.com
                            facebook.com					0.0.0.0 touch.facebook.com
                            facebook.com					0.0.0.0 upload.facebook.com
                            facebook.com					0.0.0.0 vupload.facebook.com
                            facebook.com					0.0.0.0 vupload2.vvv.facebook.com
                            facebook.com					0.0.0.0 www.login.facebook.com
                            facebook.com					0.0.0.0 zh-cn.facebook.com
                            facebook.com					0.0.0.0 zh-tw.facebook.com
                            fbcdn.net					0.0.0.0 b.static.ak.fbcdn.net
                            fbcdn.net					0.0.0.0 creative.ak.fbcdn.net
                            fbcdn.net					0.0.0.0 ent-a.xx.fbcdn.net
                            fbcdn.net					0.0.0.0 ent-b.xx.fbcdn.net
                            fbcdn.net					0.0.0.0 ent-c.xx.fbcdn.net
                            fbcdn.net					0.0.0.0 ent-d.xx.fbcdn.net
                            fbcdn.net					0.0.0.0 ent-e.xx.fbcdn.net
                            fbcdn.net					0.0.0.0 external.ak.fbcdn.net
                            fbcdn.net					0.0.0.0 origincache-ai-01-05-ash3.fbcdn.net
                            fbcdn.net					0.0.0.0 photos-a.ak.fbcdn.net
                            fbcdn.net					0.0.0.0 photos-b.ak.fbcdn.net
                            fbcdn.net					0.0.0.0 photos-c.ak.fbcdn.net
                            fbcdn.net					0.0.0.0 photos-d.ak.fbcdn.net
                            fbcdn.net					0.0.0.0 photos-e.ak.fbcdn.net
                            fbcdn.net					0.0.0.0 photos-f.ak.fbcdn.net
                            fbcdn.net					0.0.0.0 photos-g.ak.fbcdn.net
                            fbcdn.net					0.0.0.0 photos-h.ak.fbcdn.net
                            fbcdn.net					0.0.0.0 profile.ak.fbcdn.net
                            fbcdn.net					0.0.0.0 s-external.ak.fbcdn.net
                            fbcdn.net					0.0.0.0 s-static.ak.fbcdn.net
                            fbcdn.net					0.0.0.0 scontent-a-lax.xx.fbcdn.net
                            fbcdn.net					0.0.0.0 scontent-a-sin.xx.fbcdn.net
                            fbcdn.net					0.0.0.0 scontent-a.xx.fbcdn.net
                            fbcdn.net					0.0.0.0 scontent-b-lax.xx.fbcdn.net
                            fbcdn.net					0.0.0.0 scontent-b-sin.xx.fbcdn.net
                            fbcdn.net					0.0.0.0 scontent-b.xx.fbcdn.net
                            fbcdn.net					0.0.0.0 scontent-c.xx.fbcdn.net
                            fbcdn.net					0.0.0.0 scontent-d.xx.fbcdn.net
                            fbcdn.net					0.0.0.0 scontent-e.xx.fbcdn.net
                            fbcdn.net					0.0.0.0 scontent-mxp.xx.fbcdn.net
                            fbcdn.net					0.0.0.0 scontent.xx.fbcdn.net
                            fbcdn.net					0.0.0.0 sphotos-a.xx.fbcdn.net
                            fbcdn.net					0.0.0.0 static.ak.fbcdn.net
                            fbcdn.net					0.0.0.0 video.xx.fbcdn.net
                            fbcdn.net					0.0.0.0 vthumb.ak.fbcdn.net
                            fbcdn.net					0.0.0.0 xx-fbcdn-shv-01-lax3.fbcdn.net
                            fbcdn.net					0.0.0.0 xx-fbcdn-shv-02-lax3.fbcdn.net
                            net23.net					0.0.0.0 scontent-vie-224-xx-fbcdn.net23.net
                            net23.net					0.0.0.0 scontent-vie-73-xx-fbcdn.net23.net
                            net23.net					0.0.0.0 scontent-vie-75-xx-fbcdn.net23.net
                            

                            And a quick examination shows that all these 113 sites, considered as safe, have an address ending with one of the 4 values, below :

                            amazonaws.com
                            facebook.com
                            fbcdn.net
                            net23.net
                            

                            Thus, from the remaining 255,268 lines, of DocC.txt, only those which address ends with amazonaws.com, facebook.com, fbcdn.net, or net23.net have to be compared with the list of the 113 safe servers !

                            • So, in docC.txt, I bookmarked all lines, matching the regex (amazonaws|facebook)\.com$|(fbcdn|net23)\.net$. I obtained 301 bookmarked results that I copied to the clipboard with the option Search -> Bookmark -> Copy Bookmarked lines

                            • Then, I pasted the 301 bookmarked lines, at the very beginning of Doc D.txt, followed with a line of, at least, 3 dashes, as a separator. Finally, DocD.txt contained 301 results to verify + 1 line of dashes + 113 safe sites, that is to say a total of 415 lines

                            Now, with the regex (?-s)^(.+)\R(?s)(?=.*\R\1\R?), I marked all the lines which have a duplicate, further on, in DocD.txt => 106 lines bookmarked !

                            And using the command Search > Bookmark > Remove Unmarked lines, I was left with a DocD.txt file, containing 106 lines, only, which are, both, in DocA.txt and DocB.txt. So, these lines / sites needed to be deleted from the DocC.txt file !

                            To that purpose :

                            • I added these 106 lines at the end of Doc C.txt file, giving a file with 255 374 lines

                            • I did a last sort operation ( Edit > Line Operations > Sort Lines Lexicographically Ascending )

                            => After sort, these 106 sites should appear, each, as a block of two consecutive duplicate lines

                            And, finally, with the regex S/R :

                            SEARCH ^(.+\R)\1+

                            REPLACE Leave EMPTY

                            I got rid of these safe servers and obtained a Doc.txt file of 255,162 lines ( 255,374 - 106 * 2 ) which contains, only, unwanted servers !


                            Now, if you do not want to repeat all the above process, by yourself, just send me an e-mail to :

                            And I’ll send you, back, this DocC.txt ( = DocA.txt - DocB.txt ) of 255,162 lines

                            Cheers,

                            guy038

                            P.S. :

                            Timmy, to be exhaustive, and regarding your initial files :

                            DocB.txt contains 113 lines / safe servers :

                            • 7 lines, not present in DocA.txt

                            • 106 lines, present in DocA.txt


                            And DocA.txt contains 255,386 lines :

                            • 118 duplicate lines , ( due to case normalization ), which have been deleted

                            • 106 lines which are, both, present in DocA.txt and DocB.txt and have been deleted, too !

                            • 254,967 lines, with an end of line, different from amazonaws.com, facebook.com, fbcdn.net, and net23.net, which have to remain in DocA.txt

                            • 195 lines, with an end of line, equal to amazonaws.com, facebook.com, fbcdn.net, or net23.net but not present in DocB.txt. So, they must remain in DocA.txt, too !


                            Hence, the final DocA.txt ( = my DocC.txt ) which contains 255162 unwanted servers ( 254,967 + 195 )

                            1 Reply Last reply Reply Quote 2
                            • Timmy MT
                              Timmy M
                              last edited by

                              “Replace All: 106 occurrences were replaced.”

                              @guy038 You sir, are a regex LEGEND! :D

                              Of course I have taken every step you’ve explained, it’s the least I could do after you’ve put in such effort. I even found a minor mistake in your explanation (I think):

                              Now, with the regex (?-s)^(.+)\R(?s)(?=.*\R\1\R?), I marked all the lines which have a duplicate, further on, in DocD.txt

                              Should be bookmarked (with regular marking, the “Remove Unmarked Lines” removes all lines).

                              @Scott-Sumner @Claudia-Frank
                              I guess it’s time to face that monkey then… I installed the Python plugin script, navigated to Python Script > New Script and created ScreamingMonkey.py
                              Now, selecting Run Previous Script (ScreamingMonkey) doesn’t do anything. I’ve done some searching and found that I should install Python first from python.org/downloads/, should I? Version 3?
                              Before installing something I know nothing about, I’d like to verify with you guys whether that is the right course of action. I hope that’s okay :)

                              Claudia FrankC Scott SumnerS 2 Replies Last reply Reply Quote 0
                              • Claudia FrankC
                                Claudia Frank @Timmy M
                                last edited by

                                @Timmy-M

                                no, there is no need to install an additional python package,
                                as python script plugin delivers its own.

                                What you need to do is to open you two files like I’ve shown above as the
                                script will reference it by using editor1 and editor2 and click on ScreamingMonkey.
                                Once it has been run you can then call it again by using Run Previous Script.

                                What do you see if you try to open the python script console?
                                Plugin->PythonScript->Show Console.

                                If nothing happens, no new window opens, then how did
                                you install it, via plugin manager or via the msi package?

                                The msi installation is preferred as it was reported that the installation via plugin manager sometimes doesn’t copy all needed files.

                                Cheers
                                Claudia

                                1 Reply Last reply Reply Quote 2
                                • guy038G
                                  guy038
                                  last edited by guy038

                                  Hello, @timmy-m and All,

                                  Ah, yes ! You’re right, although what I wrote was correct, too ! Indeed, I used the Remove Unmarked lines, in order to change DocD.txt, first => So, it remained 106 lines that I, then, added to DocC.txt contents.

                                  But you, certainly, used this easier solution, below :


                                  …
                                  …

                                  Now, with the regex (?-s)^(.+)\R(?s)(?=.*\R\1\R?), I marked all the lines which have a duplicate, further on, in DocD.txt => 106 lines bookmarked !

                                  And using the command Search > Bookmark > Copy Bookmarked lines, I put, in the clipboard, these 106 lines, only, which are, both, in DocA.txt and DocB.txt and have to be deleted from the DocC.txt file !

                                  To that purpose :

                                  • I paste these 106 lines ( Ctrl + V ) at the end of Doc C.txt file, giving a file with 255 374 lines

                                  …
                                  …

                                  Best Regards,

                                  guy038

                                  1 Reply Last reply Reply Quote 1
                                  • Scott SumnerS
                                    Scott Sumner @Timmy M
                                    last edited by

                                    @Timmy-M

                                    Regarding “marked versus bookmarked”:

                                    Not @guy038 's fault… The Notepad++ user interface is confusing on this point; it uses the terminology “mark” often when it should (IMO) use “bookmark”. Alternatively, if it used “redmark” and “bookmark” exclusively there would be no confusion, but I suppose using “red” wouldn’t totally be correct as it is just the default color and can be changed by the user.

                                    1 Reply Last reply Reply Quote 2
                                    • Timmy MT
                                      Timmy M
                                      last edited by

                                      I used the msi install method. First updated N++ to latest, then installed Python plugin.

                                      @Claudia-Frank said:

                                      What do you see if you try to open the python script console?
                                      Plugin->PythonScript->Show Console.

                                      ---------------------------.
                                      Python 2.7.6-notepad++ r2 (default, Apr 21 2014, 19:26:54) [MSC v.1600 32 bit (Intel)]
                                      Initialisation took 31ms
                                      Ready.
                                      Traceback (most recent call last):
                                      File “%AppData%\Roaming\Notepad++\plugins\Config\PythonScript\scripts\ScreamingMonkey.py”, line 1, in <module>
                                      with open(‘DocA.txt’, ‘r’, encoding=‘utf-8’) as filea: linesa = filea.readlines()
                                      TypeError: ‘encoding’ is an invalid keyword argument for this function
                                      Traceback (most recent call last):
                                      File “%AppData%\Roaming\Notepad++\plugins\Config\PythonScript\scripts\ScreamingMonkey.py”, line 1, in <module>
                                      with open(‘DocA.txt’, ‘r’, encoding=‘utf-8’) as filea: linesa = filea.readlines()
                                      TypeError: ‘encoding’ is an invalid keyword argument for this function
                                      ---------------------------.

                                      Eh… Lil’ help please? That monkey’s screaming really loud right now O_O

                                      Claudia FrankC Scott SumnerS 2 Replies Last reply Reply Quote 0
                                      • Claudia FrankC
                                        Claudia Frank @Timmy M
                                        last edited by Claudia Frank

                                        @Timmy-M

                                        but that is not my script.
                                        Scotts script assumes you run it under python3 as it uses the encoding argument
                                        which is not available in python2, which is used by python script.

                                        Did you try my 3 lines code posted above?

                                        If you still want to use Scotts code and executed via python script plugin, then
                                        get rid of the encoding=‘utf-8’ so something like

                                        with open(‘DocA.txt’, ‘r’) as filea ...
                                        

                                        but you have to ensure that DocA.txt can be found and python scripts
                                        working directory is set to the directory where notepad++.exe is.

                                        Cheers
                                        Claudia

                                        1 Reply Last reply Reply Quote 1
                                        • Scott SumnerS
                                          Scott Sumner @Timmy M
                                          last edited by

                                          @Timmy-M

                                          Yea, for sure go with @Claudia-Frank 's script, since you seem to want to stay more “within” Notepad++!

                                          1 Reply Last reply Reply Quote 1
                                          • Timmy MT
                                            Timmy M
                                            last edited by

                                            @Claudia-Frank said:

                                            Did you try my 3 lines code posted above?

                                            I have and eventually succeeded! On my first dozen attempts (turns out I really suck at taming monkeys) I just kept reorganizing the order of all entries in either DocA or DocB, depending on which one was selected, without any entry getting removed.
                                            After trying a bunch of stuff, I found that I had to use View > Move/Clone Current Document > Move to Other View on Doc B. I just couldn’t get my head around “view0” and “view1” at first but didn’t want to bother you with such mundane question either ^_^
                                            DocA’s order does get messed up eventually but that’s easily fixed with a regular sort. Perhaps that sort can be implemented in the script?

                                            Anyway, it all worked out! Thank you very much @Scott-Sumner, @Claudia-Frank and @guy038 for this learning experience. You’re a dream team ;)

                                            Claudia FrankC 1 Reply Last reply Reply Quote 1
                                            • First post
                                              Last post
                                            The Community of users of the Notepad++ text editor.
                                            Powered by NodeBB | Contributors