Deleting numbers from LIST 1, that also appear in LIST 2



  • hello all!
    I would like to delete the numbers from LIST 1 that also appear in LIST 2

    the process should look like this:

    before:
    LIST 1
    12345
    23456
    34567

    LIST 2
    12345

    after:
    LIST 1
    23456
    34567

    in total i have 2million numbers in list one
    and 1.6 million number in list two. so there won’t be any manual work possible.

    what would you guys recommend for me to do?
    Im thankful for any help!



  • @M-P ,

    Hmmm… I know that @guy038 has previously posted sequences which delete lines from FILE1 which match entries in FILE2… but I think those are more for hundreds or thousands of entries. With millions of entries, I am not sure some of those work.

    There are also so PythonScript plugin solutions that have been presented for similar things. If you’re willing to use that scripting plugin to help you, let us know, and we can try to dig up the solutions (and/or craft a new one). (To install PythonScript, go to Plugins > Plugins Admin and use that interface to install the PythonScript plugin.)



  • @M-P

    I scripted a solution to a similar problem HERE. It could probably be adapted easily to bookmark all lines with hits, providing a mechanism by which to delete said lines.

    If you’re willing to use the Pythonscript plugin, it is a possibility.

    @PeterJones Yea, regex solutions such as those of @guy038 are likely to run into the “regex engine” overflow issue when run on such large data.



  • @M-P said in Deleting numbers from LIST 1, that also appear in LIST 2:

    It’s easier for me to write the program itself that will do this than to explain how to write it. even vbs will do.



  • @TroshinDV said in Deleting numbers from LIST 1, that also appear in LIST 2:

    It’s easier for me to write the program itself that will do this than to explain how to write it. even vbs will do.

    Please DON’T do that.
    Let’s stick to “inside Notepad++” solutions here (which includes plugins).



  • @Alan-Kilborn Ок.
    Notepad ++ is still a text editor. But it does have the ability to process files. 2 options: python with pom. python script and javascript from WSH using the JH plugin.
    here you just need to throw in a script that will process the text file in a certain way.



  • thank you - i have downloaded the script and followed the instructions that were mentioned in the thread, but how can I put it to use now? or like michael scott would say; why don’t you explain it to me like im an 8 year old? ;) - sorry im new to this.



  • @M-P

    No worries, if you’re an intelligent 8 year old, then, well, we’ve got something to work with. We’ll guide you through.

    But, what stage did you get to, can you run the script?
    Do you see where to put your data from your lists before running the script?



  • well, yes, im at the stage saying: Enter PREFIX-word-SUFFIX - should the LIST 2 be entered there?



  • @M-P

    Hmm, you’ve lost me as to where you are in the process as I can’t find “prefix” in anything I’ve pointed to…



  • @M-P

    List 2 should be entered in the “secondary view” of Notepad++, example:

    fef8dadd-2665-4390-90ed-3e44fe9b0f68-image.png



  • Alright i have done the secondary view and downloaded the script that you published 19 hours ago. i tried it with only a couple thousand numbers and it took the program quite a while, but the program also marked all 9s that are in list 1, together with the numbers that are in list 2.

    here i’ve done a video about that:
    https://streamable.com/l81voi



  • @M-P

    It wasn’t considered that numbers in the list could be part of larger numbers. Meaning that you probably have a line with only 9 on it in your list in the secondary view?

    A similar thing would happen if you had a line of 123 and in the main view you had 1234, 61235, 7891230, etc.

    We can work around that, but some other further experimentation has shown that this might not be a viable way to do what you need. I don’t think the script logic is flawed, but I don’t have time immediately to dig in deeper to see what is truly going on.

    In light of that I suppose I’d advise you to seek an alternate solution if your need is short-term. I’d really not like to see a non-Notepad++ solution worked out in this forum, because it is off-topic, but perhaps this one time is OK (because I feel bad taking you down the N++ road only to abort) if someone has time and wants to present one of those.



  • solved a problem?



  • @Alan-Kilborn i have found the 9, thank you. And i believe that everything worked now. How can i delete the numbers from FILE 1 in the final stage?



  • @M-P

    I’m amazed you didn’t encounter any of the “weirdness” I did.
    But I suppose that is a good thing.
    I have a “highly scripted” setup, so perhaps that is what is causing the weirdness I see when I was experimenting?

    But, your next step should be to right-click where I have the red dot in the following (doesn’t have to be on line 6!, just in that same vertical “bookmark margin” area), and choose the indicated command:

    fcd7b7d3-8745-4c32-809e-7e9536232776-image.png



  • thank you! hopefully there won’t be any problems when working with the millions of numbers but i would come back in that case. for now everything went super smoothly.



  • @M-P

    BTW, I would be very cautious before finishing up your task, that more things similar to the 9 problem discussed earlier, might have occurred. You should check on that; let me know, there’s an easy way to avoid that.



  • I got thinking more about how this problem would be better solved.
    If we allow ourselves to dream about features that Notepad++ maybe itself should have, here’s how I think I’d solve it:

    1. Add a delimiter line at the bottom of the file from which lines are to be removed from (delimiter line contains data that doesn’t otherwise occur in either file)
    2. Paste all lines from the second file (containing the list of things to be removed from the first file), below the delimiter line in the first file
    3. Choose Delete all non-unique lines from the Edit menu’s Line Operations submenu <— special note: fantasy Notepad++ feature that does not currently exist!!
    4. Remove the delimiter line added earlier and any lines that remain after it

    After that the first file would contain the desired data.

    I’m fairly certain I’ve seen a “Delete all non-unique lines” (or maybe a “Keep only unique lines”) in a different editor, but I can’t for sure remember which one. Ultraedit? Hmm.

    Anyway, we recently have had Delete Duplicate Lines functionality added, how about the addition of another new command?

    @PeterJones Yep, don’t say it…I will…FEATURE REQUEST



  • Hello, @m-p, @peterjones, @alan-kilborn, @troshindv and All,

    As @peterJones said, a simple regex S/R could work for moderate files size. But with files of 2,000,000 lines about, this S/R would probably be totally wrong because of the regex engine’s overflow issue :-((

    But all is not lost ! The problem is that, in huge files, it may occur a very large gap between a line and its first duplicate one. This problem can, luckily, be eliminated by using these following steps :

    • First, number all the lines

    • Then, sort the lines in an ascending order

    • Delete all lines which exist in more than 1 copy, which should be easy as these lines are, now, consecutive

    • Re-sort all the remaining unique lines to restore their initial list order

    Below, I’ll try to explain these steps with a short text. However, I quite confident that this method should work with huge lists, too, minus the necessary time, of course, to perform sorts and regex search/replacements !


    Let’s go :

    • From the license.txt file, I extracted only the non-blank lines and shortened the others to, roughly, their first 32 characters, ending with this 42-lines text :
    Preamble
    The licenses for most software
    When we speak of free software,
    To protect your rights, we need
    For example, if you distribute
    We protect your rights with two
    Also, for each author's protect
    Finally, any free program is
    The precise terms and condition
    TERMS AND CONDITIONS FOR COPYING
    0. This License applies to any
    Activities other copying,
    1. You may copy and distribute
    You may charge a fee for the
    2. You may modify your copy or
    a) You must cause the modified
    b) You must cause any work that
    c) If the modified program
    These requirements apply to the
    Thus, it is not the intent of
    In addition, mere aggregation
    3. You may copy and distribute
    a) Accompany it with the
    b) Accompany it with a written
    c) Accompany it with the
    The source code for a work mean
    If distribution of executable
    4. You may not copy, modify,
    5. You are not required to
    6. Each time you redistribute
    7. If, as a consequence of a
    If any portion of this section
    It is not the purpose of this
    This section is intended to make
    8. If the distribution and/or
    9. The Free Software Foundation
    Each version is given a
    10. If you wish to incorporate
    NO WARRANTY
    11. BECAUSE THE PROGRAM IS
    12. IN NO EVENT UNLESS REQUIRED
    END OF TERMS AND CONDITIONS
    
    • Then I appended, to this list, 33 lines out of these 42 lines ( So about 80 % of the total, i.e. the same proportion that your lists 1,600,000 / 2,000,000 )

    No separation line is needed. Thus, we now start with this text, where the added lines begin at line 43 :

    Preamble
    The licenses for most software
    When we speak of free software,
    To protect your rights, we need
    For example, if you distribute
    We protect your rights with two
    Also, for each author's protect
    Finally, any free program is
    The precise terms and condition
    TERMS AND CONDITIONS FOR COPYING
    0. This License applies to any
    Activities other copying,
    1. You may copy and distribute
    You may charge a fee for the
    2. You may modify your copy or
    a) You must cause the modified
    b) You must cause any work that
    c) If the modified program
    These requirements apply to the
    Thus, it is not the intent of
    In addition, mere aggregation
    3. You may copy and distribute
    a) Accompany it with the
    b) Accompany it with a written
    c) Accompany it with the
    The source code for a work mean
    If distribution of executable
    4. You may not copy, modify,
    5. You are not required to
    6. Each time you redistribute
    7. If, as a consequence of a
    If any portion of this section
    It is not the purpose of this
    This section is intended to make
    8. If the distribution and/or
    9. The Free Software Foundation
    Each version is given a
    10. If you wish to incorporate
    NO WARRANTY
    11. BECAUSE THE PROGRAM IS
    12. IN NO EVENT UNLESS REQUIRED
    END OF TERMS AND CONDITIONS
    The licenses for most software
    When we speak of free software,
    To protect your rights, we need
    We protect your rights with two
    Also, for each author's protect
    Finally, any free program is
    The precise terms and condition
    TERMS AND CONDITIONS FOR COPYING
    Activities other copying,
    1. You may copy and distribute
    You may charge a fee for the
    2. You may modify your copy or
    b) You must cause any work that
    c) If the modified program
    These requirements apply to the
    Thus, it is not the intent of
    In addition, mere aggregation
    b) Accompany it with a written
    c) Accompany it with the
    The source code for a work mean
    If distribution of executable
    4. You may not copy, modify,
    5. You are not required to
    6. Each time you redistribute
    If any portion of this section
    It is not the purpose of this
    This section is intended to make
    8. If the distribution and/or
    9. The Free Software Foundation
    10. If you wish to incorporate
    NO WARRANTY
    12. IN NO EVENT UNLESS REQUIRED
    END OF TERMS AND CONDITIONS
    

    Note : So, you agree that, after all this stuff done, we should be left with a 9 unique lines text ! ( 42 - 33 )

    • From the end of the first line, we add some space characters till, let’s say, the column 110

    • We open the column editor ( Alt + C )

      • We select the Number to Insert option

      • Type in the value 1 in each zone

      • Tick the Leading zeros box

      • Verify that the Dec format is ticked

      • Click on the OK button

      • Delete the last virtual line 76

    => We get this text :

    Preamble                                                                                                     01
    The licenses for most software                                                                               02
    When we speak of free software,                                                                              03
    To protect your rights, we need                                                                              04
    For example, if you distribute                                                                               05
    We protect your rights with two                                                                              06
    Also, for each author's protect                                                                              07
    Finally, any free program is                                                                                 08
    The precise terms and condition                                                                              09
    TERMS AND CONDITIONS FOR COPYING                                                                             10
    0. This License applies to any                                                                               11
    Activities other copying,                                                                                    12
    1. You may copy and distribute                                                                               13
    You may charge a fee for the                                                                                 14
    2. You may modify your copy or                                                                               15
    a) You must cause the modified                                                                               16
    b) You must cause any work that                                                                              17
    c) If the modified program                                                                                   18
    These requirements apply to the                                                                              19
    Thus, it is not the intent of                                                                                20
    In addition, mere aggregation                                                                                21
    3. You may copy and distribute                                                                               22
    a) Accompany it with the                                                                                     23
    b) Accompany it with a written                                                                               24
    c) Accompany it with the                                                                                     25
    The source code for a work mean                                                                              26
    If distribution of executable                                                                                27
    4. You may not copy, modify,                                                                                 28
    5. You are not required to                                                                                   29
    6. Each time you redistribute                                                                                30
    7. If, as a consequence of a                                                                                 31
    If any portion of this section                                                                               32
    It is not the purpose of this                                                                                33
    This section is intended to make                                                                             34
    8. If the distribution and/or                                                                                35
    9. The Free Software Foundation                                                                              36
    Each version is given a                                                                                      37
    10. If you wish to incorporate                                                                               38
    NO WARRANTY                                                                                                  39
    11. BECAUSE THE PROGRAM IS                                                                                   40
    12. IN NO EVENT UNLESS REQUIRED                                                                              41
    END OF TERMS AND CONDITIONS                                                                                  42
    The licenses for most software                                                                               43
    When we speak of free software,                                                                              44
    To protect your rights, we need                                                                              45
    We protect your rights with two                                                                              46
    Also, for each author's protect                                                                              47
    Finally, any free program is                                                                                 48
    The precise terms and condition                                                                              49
    TERMS AND CONDITIONS FOR COPYING                                                                             50
    Activities other copying,                                                                                    51
    1. You may copy and distribute                                                                               52
    You may charge a fee for the                                                                                 53
    2. You may modify your copy or                                                                               54
    b) You must cause any work that                                                                              55
    c) If the modified program                                                                                   56
    These requirements apply to the                                                                              57
    Thus, it is not the intent of                                                                                58
    In addition, mere aggregation                                                                                59
    b) Accompany it with a written                                                                               60
    c) Accompany it with the                                                                                     61
    The source code for a work mean                                                                              62
    If distribution of executable                                                                                63
    4. You may not copy, modify,                                                                                 64
    5. You are not required to                                                                                   65
    6. Each time you redistribute                                                                                66
    If any portion of this section                                                                               67
    It is not the purpose of this                                                                                68
    This section is intended to make                                                                             69
    8. If the distribution and/or                                                                                70
    9. The Free Software Foundation                                                                              71
    10. If you wish to incorporate                                                                               72
    NO WARRANTY                                                                                                  73
    12. IN NO EVENT UNLESS REQUIRED                                                                              74
    END OF TERMS AND CONDITIONS                                                                                  75
    

    More in the next post !

    guy038


Log in to reply