Deleting numbers from LIST 1, that also appear in LIST 2
-
I’m amazed you didn’t encounter any of the “weirdness” I did.
But I suppose that is a good thing.
I have a “highly scripted” setup, so perhaps that is what is causing the weirdness I see when I was experimenting?But, your next step should be to right-click where I have the red dot in the following (doesn’t have to be on line 6!, just in that same vertical “bookmark margin” area), and choose the indicated command:
-
thank you! hopefully there won’t be any problems when working with the millions of numbers but i would come back in that case. for now everything went super smoothly.
-
BTW, I would be very cautious before finishing up your task, that more things similar to the
9
problem discussed earlier, might have occurred. You should check on that; let me know, there’s an easy way to avoid that. -
I got thinking more about how this problem would be better solved.
If we allow ourselves to dream about features that Notepad++ maybe itself should have, here’s how I think I’d solve it:- Add a delimiter line at the bottom of the file from which lines are to be removed from (delimiter line contains data that doesn’t otherwise occur in either file)
- Paste all lines from the second file (containing the list of things to be removed from the first file), below the delimiter line in the first file
- Choose Delete all non-unique lines from the Edit menu’s Line Operations submenu <— special note: fantasy Notepad++ feature that does not currently exist!!
- Remove the delimiter line added earlier and any lines that remain after it
After that the first file would contain the desired data.
I’m fairly certain I’ve seen a “Delete all non-unique lines” (or maybe a “Keep only unique lines”) in a different editor, but I can’t for sure remember which one. Ultraedit? Hmm.
Anyway, we recently have had Delete Duplicate Lines functionality added, how about the addition of another new command?
@PeterJones Yep, don’t say it…I will…FEATURE REQUEST
-
Hello, @m-p, @peterjones, @alan-kilborn, @troshindv and All,
As @peterJones said, a simple regex S/R could work for moderate files size. But with files of
2,000,000
lines about, this S/R would probably be totally wrong because of the regex engine’s overflow issue :-((But all is not lost ! The problem is that, in huge files, it may occur a very large gap between a line and its first duplicate one. This problem can, luckily, be eliminated by using these following steps :
-
First, number all the lines
-
Then, sort the lines in an ascending order
-
Delete all lines which exist in more than
1
copy, which should be easy as these lines are, now, consecutive -
Re-sort all the remaining unique lines to restore their initial list order
Below, I’ll try to explain these steps with a short text. However, I quite confident that this method should work with huge lists, too, minus the necessary time, of course, to perform sorts and regex search/replacements !
Let’s go :
- From the
license.txt
file, I extracted only the non-blank lines and shortened the others to, roughly, their first32
characters, ending with this42
-lines text :
Preamble The licenses for most software When we speak of free software, To protect your rights, we need For example, if you distribute We protect your rights with two Also, for each author's protect Finally, any free program is The precise terms and condition TERMS AND CONDITIONS FOR COPYING 0. This License applies to any Activities other copying, 1. You may copy and distribute You may charge a fee for the 2. You may modify your copy or a) You must cause the modified b) You must cause any work that c) If the modified program These requirements apply to the Thus, it is not the intent of In addition, mere aggregation 3. You may copy and distribute a) Accompany it with the b) Accompany it with a written c) Accompany it with the The source code for a work mean If distribution of executable 4. You may not copy, modify, 5. You are not required to 6. Each time you redistribute 7. If, as a consequence of a If any portion of this section It is not the purpose of this This section is intended to make 8. If the distribution and/or 9. The Free Software Foundation Each version is given a 10. If you wish to incorporate NO WARRANTY 11. BECAUSE THE PROGRAM IS 12. IN NO EVENT UNLESS REQUIRED END OF TERMS AND CONDITIONS
- Then I appended, to this list,
33
lines out of these42
lines ( So about80 %
of the total, i.e. the same proportion that your lists1,600,000 / 2,000,000
)
No separation line is needed. Thus, we now start with this text, where the added lines begin at line
43
:Preamble The licenses for most software When we speak of free software, To protect your rights, we need For example, if you distribute We protect your rights with two Also, for each author's protect Finally, any free program is The precise terms and condition TERMS AND CONDITIONS FOR COPYING 0. This License applies to any Activities other copying, 1. You may copy and distribute You may charge a fee for the 2. You may modify your copy or a) You must cause the modified b) You must cause any work that c) If the modified program These requirements apply to the Thus, it is not the intent of In addition, mere aggregation 3. You may copy and distribute a) Accompany it with the b) Accompany it with a written c) Accompany it with the The source code for a work mean If distribution of executable 4. You may not copy, modify, 5. You are not required to 6. Each time you redistribute 7. If, as a consequence of a If any portion of this section It is not the purpose of this This section is intended to make 8. If the distribution and/or 9. The Free Software Foundation Each version is given a 10. If you wish to incorporate NO WARRANTY 11. BECAUSE THE PROGRAM IS 12. IN NO EVENT UNLESS REQUIRED END OF TERMS AND CONDITIONS The licenses for most software When we speak of free software, To protect your rights, we need We protect your rights with two Also, for each author's protect Finally, any free program is The precise terms and condition TERMS AND CONDITIONS FOR COPYING Activities other copying, 1. You may copy and distribute You may charge a fee for the 2. You may modify your copy or b) You must cause any work that c) If the modified program These requirements apply to the Thus, it is not the intent of In addition, mere aggregation b) Accompany it with a written c) Accompany it with the The source code for a work mean If distribution of executable 4. You may not copy, modify, 5. You are not required to 6. Each time you redistribute If any portion of this section It is not the purpose of this This section is intended to make 8. If the distribution and/or 9. The Free Software Foundation 10. If you wish to incorporate NO WARRANTY 12. IN NO EVENT UNLESS REQUIRED END OF TERMS AND CONDITIONS
Note : So, you agree that, after all this stuff done, we should be left with a
9
unique lines text ! (42 - 33
)-
From the end of the first line, we add some space characters till, let’s say, the column
110
-
We open the column editor (
Alt + C
)-
We select the
Number to Insert
option -
Type in the value
1
in each zone -
Tick the
Leading zeros
box -
Verify that the
Dec
format is ticked -
Click on the
OK
button -
Delete the last virtual line
76
-
=> We get this text :
Preamble 01 The licenses for most software 02 When we speak of free software, 03 To protect your rights, we need 04 For example, if you distribute 05 We protect your rights with two 06 Also, for each author's protect 07 Finally, any free program is 08 The precise terms and condition 09 TERMS AND CONDITIONS FOR COPYING 10 0. This License applies to any 11 Activities other copying, 12 1. You may copy and distribute 13 You may charge a fee for the 14 2. You may modify your copy or 15 a) You must cause the modified 16 b) You must cause any work that 17 c) If the modified program 18 These requirements apply to the 19 Thus, it is not the intent of 20 In addition, mere aggregation 21 3. You may copy and distribute 22 a) Accompany it with the 23 b) Accompany it with a written 24 c) Accompany it with the 25 The source code for a work mean 26 If distribution of executable 27 4. You may not copy, modify, 28 5. You are not required to 29 6. Each time you redistribute 30 7. If, as a consequence of a 31 If any portion of this section 32 It is not the purpose of this 33 This section is intended to make 34 8. If the distribution and/or 35 9. The Free Software Foundation 36 Each version is given a 37 10. If you wish to incorporate 38 NO WARRANTY 39 11. BECAUSE THE PROGRAM IS 40 12. IN NO EVENT UNLESS REQUIRED 41 END OF TERMS AND CONDITIONS 42 The licenses for most software 43 When we speak of free software, 44 To protect your rights, we need 45 We protect your rights with two 46 Also, for each author's protect 47 Finally, any free program is 48 The precise terms and condition 49 TERMS AND CONDITIONS FOR COPYING 50 Activities other copying, 51 1. You may copy and distribute 52 You may charge a fee for the 53 2. You may modify your copy or 54 b) You must cause any work that 55 c) If the modified program 56 These requirements apply to the 57 Thus, it is not the intent of 58 In addition, mere aggregation 59 b) Accompany it with a written 60 c) Accompany it with the 61 The source code for a work mean 62 If distribution of executable 63 4. You may not copy, modify, 64 5. You are not required to 65 6. Each time you redistribute 66 If any portion of this section 67 It is not the purpose of this 68 This section is intended to make 69 8. If the distribution and/or 70 9. The Free Software Foundation 71 10. If you wish to incorporate 72 NO WARRANTY 73 12. IN NO EVENT UNLESS REQUIRED 74 END OF TERMS AND CONDITIONS 75
More in the next post !
guy038
-
-
Hello, @m-p, @peterjones, @alan-kilborn, @troshindv and All,
Continuation of my previous post !
- Then, we use the
Edit > Line Operations > Sort Lines Lexicographically Ascending
menu option, without any selection
=> The example text becomes :
0. This License applies to any 11 1. You may copy and distribute 13 1. You may copy and distribute 52 10. If you wish to incorporate 38 10. If you wish to incorporate 72 11. BECAUSE THE PROGRAM IS 40 12. IN NO EVENT UNLESS REQUIRED 41 12. IN NO EVENT UNLESS REQUIRED 74 2. You may modify your copy or 15 2. You may modify your copy or 54 3. You may copy and distribute 22 4. You may not copy, modify, 28 4. You may not copy, modify, 64 5. You are not required to 29 5. You are not required to 65 6. Each time you redistribute 30 6. Each time you redistribute 66 7. If, as a consequence of a 31 8. If the distribution and/or 35 8. If the distribution and/or 70 9. The Free Software Foundation 36 9. The Free Software Foundation 71 Activities other copying, 12 Activities other copying, 51 Also, for each author's protect 07 Also, for each author's protect 47 END OF TERMS AND CONDITIONS 42 END OF TERMS AND CONDITIONS 75 Each version is given a 37 Finally, any free program is 08 Finally, any free program is 48 For example, if you distribute 05 If any portion of this section 32 If any portion of this section 67 If distribution of executable 27 If distribution of executable 63 In addition, mere aggregation 21 In addition, mere aggregation 59 It is not the purpose of this 33 It is not the purpose of this 68 NO WARRANTY 39 NO WARRANTY 73 Preamble 01 TERMS AND CONDITIONS FOR COPYING 10 TERMS AND CONDITIONS FOR COPYING 50 The licenses for most software 02 The licenses for most software 43 The precise terms and condition 09 The precise terms and condition 49 The source code for a work mean 26 The source code for a work mean 62 These requirements apply to the 19 These requirements apply to the 57 This section is intended to make 34 This section is intended to make 69 Thus, it is not the intent of 20 Thus, it is not the intent of 58 To protect your rights, we need 04 To protect your rights, we need 45 We protect your rights with two 06 We protect your rights with two 46 When we speak of free software, 03 When we speak of free software, 44 You may charge a fee for the 14 You may charge a fee for the 53 a) Accompany it with the 23 a) You must cause the modified 16 b) Accompany it with a written 24 b) Accompany it with a written 60 b) You must cause any work that 17 b) You must cause any work that 55 c) Accompany it with the 25 c) Accompany it with the 61 c) If the modified program 18 c) If the modified program 56
-
Now, we open the Replace dialog (
Ctrl + H
)-
SEARCH
^(.+)(\x20+\d+\R)(\1(?2))+
-
REPLACE
Leave EMPTY
-
Tick the
Wrap around
option -
Select the
Regular expression
search mode -
Click on the
Replace All
button
-
=> You should get the status message
33 occurrences were replaced
, leading to this text :0. This License applies to any 11 11. BECAUSE THE PROGRAM IS 40 3. You may copy and distribute 22 7. If, as a consequence of a 31 Each version is given a 37 For example, if you distribute 05 Preamble 01 a) Accompany it with the 23 a) You must cause the modified 16
-
Although it would be possible, to use the
column mode
selection, to sort the lines by the number, at end of the lines, I’m not sure it would work properly with an huge list. So, I prefer to take a safer method and perform an other regex S/R :-
SEARCH
^(.+?)\x20{2,}(\d+)
-
REPLACE
\2\t\t\1
-
And we end with :
11 0. This License applies to any 40 11. BECAUSE THE PROGRAM IS 22 3. You may copy and distribute 31 7. If, as a consequence of a 37 Each version is given a 05 For example, if you distribute 01 Preamble 23 a) Accompany it with the 16 a) You must cause the modified
- Again, we use the
Edit > Line Operations > Sort Lines Lexicographically Ascending
menu option, to restore the initial file order, giving :
01 Preamble 05 For example, if you distribute 11 0. This License applies to any 16 a) You must cause the modified 22 3. You may copy and distribute 23 a) Accompany it with the 31 7. If, as a consequence of a 37 Each version is given a 40 11. BECAUSE THE PROGRAM IS
And, finally, we perform a last regex S/R, below, to get rid of the temporary numbering !
-
SEARCH
^\d+\t+
-
REPLACE
Leave EMPTY
=> Our expected text, with the
9
unique lines :Preamble For example, if you distribute 0. This License applies to any a) You must cause the modified 3. You may copy and distribute a) Accompany it with the 7. If, as a consequence of a Each version is given a 11. BECAUSE THE PROGRAM IS
Best Regards,
guy038
- Then, we use the
-
Perhaps that regex-intensive solution becomes the defacto standard way of solving this problem.
But, it might be nice to see in Notepad++ itself, a command to “Remove lines from primary view tab that occur in secondary view tab”, or some such less-wordy verbage.
-
@Alan-Kilborn said in Deleting numbers from LIST 1, that also appear in LIST 2:
Perhaps that regex-intensive solution becomes the defacto standard way of solving this problem.
Will be for small volumes.
The volume of data dictates its own terms.
PS. It is better to wrap all actions in a macro. -
@TroshinDV said in Deleting numbers from LIST 1, that also appear in LIST 2:
Will be for small volumes.
The volume of data dictates its own terms.That doesn’t make sense as the solution crafted by @guy038 was specifically considering large “volumes”.
PS. It is better to wrap all actions in a macro.
I don’t believe @guy038 's solution can be made into a macro; can you explain how you think it can be?
-
well, there were definitely some problems. LIST 1 has got 7.4mil number whilst LIST 2 got 1.2mil numbers. I believe that this is definitely too much to deal with. I’ll try the method of @guy038 now even though im not sure if i understood it all. Let’s try it at least
-
Hello, @m-p, @peterjones, @alan-kilborn, @troshindv and All,
Oh…! Indeed, dealing with two files of
7,400,000
and1,200,000
lines is not an easy task ! So you will have to work with a8,600,000
lines file : good luck !Do not hesitate to ask me for more information if you encounter difficulties in implementing my method !
-
First, I would advice you to repeat my own tiny example, first, to get its general idea
-
Regarding your real example, I would say that :
-
The N++ sort feature is very quick, in all cases
-
I suppose that the numbering operation, with the column editor, should not be very long, too !
-
May be, the first of the three regex S/R will probably take some time. Just be patient : it should work in the end !
-
Best Regards
guy038
-
-