Deleting numbers from LIST 1, that also appear in LIST 2
-
hello all!
I would like to delete the numbers from LIST 1 that also appear in LIST 2the process should look like this:
before:
LIST 1
12345
23456
34567LIST 2
12345after:
LIST 1
23456
34567in total i have 2million numbers in list one
and 1.6 million number in list two. so there won’t be any manual work possible.what would you guys recommend for me to do?
Im thankful for any help! -
@M-P ,
Hmmm… I know that @guy038 has previously posted sequences which delete lines from FILE1 which match entries in FILE2… but I think those are more for hundreds or thousands of entries. With millions of entries, I am not sure some of those work.
There are also so PythonScript plugin solutions that have been presented for similar things. If you’re willing to use that scripting plugin to help you, let us know, and we can try to dig up the solutions (and/or craft a new one). (To install PythonScript, go to Plugins > Plugins Admin and use that interface to install the PythonScript plugin.)
-
I scripted a solution to a similar problem HERE. It could probably be adapted easily to bookmark all lines with hits, providing a mechanism by which to delete said lines.
If you’re willing to use the Pythonscript plugin, it is a possibility.
@PeterJones Yea, regex solutions such as those of @guy038 are likely to run into the “regex engine” overflow issue when run on such large data.
-
@M-P said in Deleting numbers from LIST 1, that also appear in LIST 2:
It’s easier for me to write the program itself that will do this than to explain how to write it. even vbs will do.
-
@TroshinDV said in Deleting numbers from LIST 1, that also appear in LIST 2:
It’s easier for me to write the program itself that will do this than to explain how to write it. even vbs will do.
Please DON’T do that.
Let’s stick to “inside Notepad++” solutions here (which includes plugins). -
@Alan-Kilborn Ок.
Notepad ++ is still a text editor. But it does have the ability to process files. 2 options: python with pom. python script and javascript from WSH using the JH plugin.
here you just need to throw in a script that will process the text file in a certain way. -
thank you - i have downloaded the script and followed the instructions that were mentioned in the thread, but how can I put it to use now? or like michael scott would say; why don’t you explain it to me like im an 8 year old? ;) - sorry im new to this.
-
No worries, if you’re an intelligent 8 year old, then, well, we’ve got something to work with. We’ll guide you through.
But, what stage did you get to, can you run the script?
Do you see where to put your data from your lists before running the script? -
well, yes, im at the stage saying: Enter PREFIX-word-SUFFIX - should the LIST 2 be entered there?
-
Hmm, you’ve lost me as to where you are in the process as I can’t find “prefix” in anything I’ve pointed to…
-
List 2 should be entered in the “secondary view” of Notepad++, example:
-
Alright i have done the secondary view and downloaded the script that you published 19 hours ago. i tried it with only a couple thousand numbers and it took the program quite a while, but the program also marked all 9s that are in list 1, together with the numbers that are in list 2.
here i’ve done a video about that:
https://streamable.com/l81voi -
It wasn’t considered that numbers in the list could be part of larger numbers. Meaning that you probably have a line with only
9
on it in your list in the secondary view?A similar thing would happen if you had a line of
123
and in the main view you had1234
,61235
,7891230
, etc.We can work around that, but some other further experimentation has shown that this might not be a viable way to do what you need. I don’t think the script logic is flawed, but I don’t have time immediately to dig in deeper to see what is truly going on.
In light of that I suppose I’d advise you to seek an alternate solution if your need is short-term. I’d really not like to see a non-Notepad++ solution worked out in this forum, because it is off-topic, but perhaps this one time is OK (because I feel bad taking you down the N++ road only to abort) if someone has time and wants to present one of those.
-
solved a problem?
-
@Alan-Kilborn i have found the 9, thank you. And i believe that everything worked now. How can i delete the numbers from FILE 1 in the final stage?
-
I’m amazed you didn’t encounter any of the “weirdness” I did.
But I suppose that is a good thing.
I have a “highly scripted” setup, so perhaps that is what is causing the weirdness I see when I was experimenting?But, your next step should be to right-click where I have the red dot in the following (doesn’t have to be on line 6!, just in that same vertical “bookmark margin” area), and choose the indicated command:
-
thank you! hopefully there won’t be any problems when working with the millions of numbers but i would come back in that case. for now everything went super smoothly.
-
BTW, I would be very cautious before finishing up your task, that more things similar to the
9
problem discussed earlier, might have occurred. You should check on that; let me know, there’s an easy way to avoid that. -
I got thinking more about how this problem would be better solved.
If we allow ourselves to dream about features that Notepad++ maybe itself should have, here’s how I think I’d solve it:- Add a delimiter line at the bottom of the file from which lines are to be removed from (delimiter line contains data that doesn’t otherwise occur in either file)
- Paste all lines from the second file (containing the list of things to be removed from the first file), below the delimiter line in the first file
- Choose Delete all non-unique lines from the Edit menu’s Line Operations submenu <— special note: fantasy Notepad++ feature that does not currently exist!!
- Remove the delimiter line added earlier and any lines that remain after it
After that the first file would contain the desired data.
I’m fairly certain I’ve seen a “Delete all non-unique lines” (or maybe a “Keep only unique lines”) in a different editor, but I can’t for sure remember which one. Ultraedit? Hmm.
Anyway, we recently have had Delete Duplicate Lines functionality added, how about the addition of another new command?
@PeterJones Yep, don’t say it…I will…FEATURE REQUEST
-
Hello, @m-p, @peterjones, @alan-kilborn, @troshindv and All,
As @peterJones said, a simple regex S/R could work for moderate files size. But with files of
2,000,000
lines about, this S/R would probably be totally wrong because of the regex engine’s overflow issue :-((But all is not lost ! The problem is that, in huge files, it may occur a very large gap between a line and its first duplicate one. This problem can, luckily, be eliminated by using these following steps :
-
First, number all the lines
-
Then, sort the lines in an ascending order
-
Delete all lines which exist in more than
1
copy, which should be easy as these lines are, now, consecutive -
Re-sort all the remaining unique lines to restore their initial list order
Below, I’ll try to explain these steps with a short text. However, I quite confident that this method should work with huge lists, too, minus the necessary time, of course, to perform sorts and regex search/replacements !
Let’s go :
- From the
license.txt
file, I extracted only the non-blank lines and shortened the others to, roughly, their first32
characters, ending with this42
-lines text :
Preamble The licenses for most software When we speak of free software, To protect your rights, we need For example, if you distribute We protect your rights with two Also, for each author's protect Finally, any free program is The precise terms and condition TERMS AND CONDITIONS FOR COPYING 0. This License applies to any Activities other copying, 1. You may copy and distribute You may charge a fee for the 2. You may modify your copy or a) You must cause the modified b) You must cause any work that c) If the modified program These requirements apply to the Thus, it is not the intent of In addition, mere aggregation 3. You may copy and distribute a) Accompany it with the b) Accompany it with a written c) Accompany it with the The source code for a work mean If distribution of executable 4. You may not copy, modify, 5. You are not required to 6. Each time you redistribute 7. If, as a consequence of a If any portion of this section It is not the purpose of this This section is intended to make 8. If the distribution and/or 9. The Free Software Foundation Each version is given a 10. If you wish to incorporate NO WARRANTY 11. BECAUSE THE PROGRAM IS 12. IN NO EVENT UNLESS REQUIRED END OF TERMS AND CONDITIONS
- Then I appended, to this list,
33
lines out of these42
lines ( So about80 %
of the total, i.e. the same proportion that your lists1,600,000 / 2,000,000
)
No separation line is needed. Thus, we now start with this text, where the added lines begin at line
43
:Preamble The licenses for most software When we speak of free software, To protect your rights, we need For example, if you distribute We protect your rights with two Also, for each author's protect Finally, any free program is The precise terms and condition TERMS AND CONDITIONS FOR COPYING 0. This License applies to any Activities other copying, 1. You may copy and distribute You may charge a fee for the 2. You may modify your copy or a) You must cause the modified b) You must cause any work that c) If the modified program These requirements apply to the Thus, it is not the intent of In addition, mere aggregation 3. You may copy and distribute a) Accompany it with the b) Accompany it with a written c) Accompany it with the The source code for a work mean If distribution of executable 4. You may not copy, modify, 5. You are not required to 6. Each time you redistribute 7. If, as a consequence of a If any portion of this section It is not the purpose of this This section is intended to make 8. If the distribution and/or 9. The Free Software Foundation Each version is given a 10. If you wish to incorporate NO WARRANTY 11. BECAUSE THE PROGRAM IS 12. IN NO EVENT UNLESS REQUIRED END OF TERMS AND CONDITIONS The licenses for most software When we speak of free software, To protect your rights, we need We protect your rights with two Also, for each author's protect Finally, any free program is The precise terms and condition TERMS AND CONDITIONS FOR COPYING Activities other copying, 1. You may copy and distribute You may charge a fee for the 2. You may modify your copy or b) You must cause any work that c) If the modified program These requirements apply to the Thus, it is not the intent of In addition, mere aggregation b) Accompany it with a written c) Accompany it with the The source code for a work mean If distribution of executable 4. You may not copy, modify, 5. You are not required to 6. Each time you redistribute If any portion of this section It is not the purpose of this This section is intended to make 8. If the distribution and/or 9. The Free Software Foundation 10. If you wish to incorporate NO WARRANTY 12. IN NO EVENT UNLESS REQUIRED END OF TERMS AND CONDITIONS
Note : So, you agree that, after all this stuff done, we should be left with a
9
unique lines text ! (42 - 33
)-
From the end of the first line, we add some space characters till, let’s say, the column
110
-
We open the column editor (
Alt + C
)-
We select the
Number to Insert
option -
Type in the value
1
in each zone -
Tick the
Leading zeros
box -
Verify that the
Dec
format is ticked -
Click on the
OK
button -
Delete the last virtual line
76
-
=> We get this text :
Preamble 01 The licenses for most software 02 When we speak of free software, 03 To protect your rights, we need 04 For example, if you distribute 05 We protect your rights with two 06 Also, for each author's protect 07 Finally, any free program is 08 The precise terms and condition 09 TERMS AND CONDITIONS FOR COPYING 10 0. This License applies to any 11 Activities other copying, 12 1. You may copy and distribute 13 You may charge a fee for the 14 2. You may modify your copy or 15 a) You must cause the modified 16 b) You must cause any work that 17 c) If the modified program 18 These requirements apply to the 19 Thus, it is not the intent of 20 In addition, mere aggregation 21 3. You may copy and distribute 22 a) Accompany it with the 23 b) Accompany it with a written 24 c) Accompany it with the 25 The source code for a work mean 26 If distribution of executable 27 4. You may not copy, modify, 28 5. You are not required to 29 6. Each time you redistribute 30 7. If, as a consequence of a 31 If any portion of this section 32 It is not the purpose of this 33 This section is intended to make 34 8. If the distribution and/or 35 9. The Free Software Foundation 36 Each version is given a 37 10. If you wish to incorporate 38 NO WARRANTY 39 11. BECAUSE THE PROGRAM IS 40 12. IN NO EVENT UNLESS REQUIRED 41 END OF TERMS AND CONDITIONS 42 The licenses for most software 43 When we speak of free software, 44 To protect your rights, we need 45 We protect your rights with two 46 Also, for each author's protect 47 Finally, any free program is 48 The precise terms and condition 49 TERMS AND CONDITIONS FOR COPYING 50 Activities other copying, 51 1. You may copy and distribute 52 You may charge a fee for the 53 2. You may modify your copy or 54 b) You must cause any work that 55 c) If the modified program 56 These requirements apply to the 57 Thus, it is not the intent of 58 In addition, mere aggregation 59 b) Accompany it with a written 60 c) Accompany it with the 61 The source code for a work mean 62 If distribution of executable 63 4. You may not copy, modify, 64 5. You are not required to 65 6. Each time you redistribute 66 If any portion of this section 67 It is not the purpose of this 68 This section is intended to make 69 8. If the distribution and/or 70 9. The Free Software Foundation 71 10. If you wish to incorporate 72 NO WARRANTY 73 12. IN NO EVENT UNLESS REQUIRED 74 END OF TERMS AND CONDITIONS 75
More in the next post !
guy038
-