Delete Partially Duplicate Lines
-
“<File>*C:\temp\wow64_curl_31bf3856ad364e35_10.0.17763.1_none_79864446a9b50636.manifest</File>”
“<File>*C:\temp\wow64_curl_31bf3856ad364e35_10.0.17763.2452_none_d64763ca235f44ee.manifest</File>”“<File>*C:\temp\wow64_desktop_shell-search-srchadmin_31bf3856ad364e35_7.0.17763.1697_none_25f969e6c88ea71c.manifest</File>”
“<File>*C:\temp\wow64_desktop_shell-search-srchadmin_31bf3856ad364e35_7.0.17763.1_none_ee97cccf7037d170.manifest</File>”“<File>*C:\temp\wow64_fdssdp_31bf3856ad364e35_10.0.17763.1697_none_92a793d47a7fee4c.manifest</File>”
“<File>*C:\temp\wow64_fdssdp_31bf3856ad364e35_10.0.17763.1_none_35f886cf00c88f58.manifest</File>”“<File>*C:\temp\wow64_libarchive-internal_31bf3856ad364e35_10.0.17763.1_none_75bc5302db7de6ed.manifest</File>”
“<File>*C:\temp\wow64_libarchive-internal_31bf3856ad364e35_10.0.17763.2452_none_d27d7286552825a5.manifest</File>”3 files, 2 versions = 6 files
You could say they are partially duplicated.
To function correctly what must be accomplished is as follows:
You can assume all lines contain 17763.
Find 17763
Compare text preceeding 17763 to text preceeding the next 17763. If the preceedings match, delete both lines. -
@e-l said in Delete Partially Duplicate Lines:
Compare text preceeding 17763 to text preceeding the next 17763. If the preceedings match, delete both lines.
Your example seems to imply there are sets of 2 lines followed by an empty line. Assuming this is the case the following regex (regular expression) will likely work.
We will use the “Mark” function (Search, Mark, Ctrl+M default shortcut). Enter the following into the Find What window:
Find What:(?-s)(.+?)17763.+\R\117763.+\R\R?
Have the Search mode set to “Regular Expression” and tick “Bookmark line”. Click on Mark All and you will see the set of 2 lines with “partial duplicate” text preceding the 17763 string along with the following empty line be bookmarked with the blue circle.Once you have verified a sample number of those marked lines you can right click in the left margin (where the blue circle is) and select “Remove Bookmarked Lines”.
If this doesn’t work for you it will likely be as a result of a misrepresentation of your example data. In that case you need to read the “pinned” post at the start of the Help wanted section called “Please read this before posting”. It outlines the need for examples to appear inside of black boxes and how to achieve it.
Terry
-
I figured it out.
First the text must be reverse sorted via [https://www.textfixer.com/tools/remove-duplicate-lines.php][1]
Result:
“*C:\temp\wow64_libarchive-internal_31bf3856ad364e35_10.0.17763.2452_none_d27d7286552825a5.manifest”
“*C:\temp\wow64_libarchive-internal_31bf3856ad364e35_10.0.17763.1_none_75bc5302db7de6ed.manifest”
“*C:\temp\wow64_fdssdp_31bf3856ad364e35_10.0.17763.1697_none_92a793d47a7fee4c.manifest”
“*C:\temp\wow64_fdssdp_31bf3856ad364e35_10.0.17763.1_none_35f886cf00c88f58.manifest”
“*C:\temp\wow64_desktop_shell-search-srchadmin_31bf3856ad364e35_7.0.17763.1697_none_25f969e6c88ea71c.manifest”
“*C:\temp\wow64_desktop_shell-search-srchadmin_31bf3856ad364e35_7.0.17763.1_none_ee97cccf7037d170.manifest”
“*C:\temp\wow64_curl_31bf3856ad364e35_10.0.17763.2452_none_d64763ca235f44ee.manifest”
“*C:\temp\wow64_curl_31bf3856ad364e35_10.0.17763.1_none_79864446a9b50636.manifest”
- Then this must be copied into Notepad++ with no blank lines.
- Ctrl+H
- Find what:
.*17763.[123456789][0123456789].*\R.*(?:\R|$)
- Replace with:
LEAVE EMPTY
- check Wrap around
- check Regular expression
- DO NOT CHECK
. matches newline
- Replace all
This is a change of strategy. I said before to focus on the preceedings of 17763, but it turns out there was an easier way. Being both sorted in general and reverse sorted are crucial as the expression deletes the line where it finds it and the following line.
As indicated the desired output was blank, but in changing my strategy I botched my test files test17763…For reference:
```
.17763.[123456789][0123456789] : deletes within asterix, dot after 17763 is literal, square brackets each pair indicates 1 acceptable character
\R : any kind of line break .* : 0 or more any character but newline (?: : start non \R : any kind of line break | : OR $ : end of line ) : end group
-
@e-l said in Delete Partially Duplicate Lines:
.17763.[123456789][0123456789] : deletes within asterix, dot after 17763 is literal, square brackets each pair indicates 1 acceptable character
There are some issues with the description of your regex. The first wrong belief is that the DOT following the string 17763 is a literal DOT, instead it represents a single character position, whatever that character is. Given you say “do not check . matches newline” then the DOT will not equal end of line characters.
Second inaccuracy is thinking you need to identify every number within each of your 2[
]
groups. You can represent these as [1-9] and [0-9] as they are comprised of consecutive characters in each group. If; for example; you wanted to identify [01236789] you would write this as [0-36-9].
Also be careful with the.*
as this is greedy. If your line contains multiple17763
strings then the greedy modifier will consider the last 17763 string to be the important one. Probably not really a problem here but in other tests you may find incorrect results.I think you need to start with basics in learning regex. If you look in our FAQ section you will find posts that will steer you in the right direction. I feel you may need to undo some of your beliefs. Be aware that there are several forms of regex. By reading from our FAQ you will see the one used within Notepad++.
In the end I suppose what I say may well fall on deaf ears. You have something that (in your mind) achieves the desired result. So good on you for trying it yourself.
Terry