Need help filtering lines starting with same strings

Jamie W

Hello, I have some very large files which I need to filter. The files contain a lot of lines starting with the same characters, I want to keep only the unique lines. If there are 5 lines starting with the same x amount of characters, I want to bookmark/remove them.
Example

Bob1919:12345
Bob1:12345
Bob1919:982623
Sam10:12345
Bob1919:55555
Alex:888888

I want the result;

Bob1:12345
Alex:888888
Sam10:12345

Due to the latter part of the lines being difference, it isn’t possible to sort by occurance and remove the most occuring lines. It also isn’t possible to bookmark Bob1919, because there are too many lines similiar to this, meaning you’d have to bookmark those lines one by one. Thanks!

guy038

Hello @jamie-w and All,

Why not a simple regex S/R ?

Open the Replace dialog ( Ctrl + H )
- SEARCH (?-si)^Bob1919.+\R OR (?i-s)^Bob1919.+\R for a search insensitive to case
- REPLACE Leave EMPTY
Tick the Wrap around option
Click on the Replace All button

Voila !

Probably, you would get the same result, marking lines containing the string Bob1919 and then deleting all bookmarked lines but I presume it won’t be as fast as the search/replacement !

Best Regards,

guy038

Jamie W

@guy038

I need to do this for thousands of different usernames. Not just Bob1919, that was one example.
It would not be efficient to find the names 1 by 1 and search. I need something that bookmarks lines if there is more than 5 lines which start with the same strings. Sorry for my bad explanation.

guy038

Hi, @jamie-w,

OK, I see !

Would you mind if your files need sorting ? Indeed, after suppression of lines whose beginning, let’s say, till the colon, occurs more than 5 times, you wouldn’t get the initial list of lines !

BR

guy038

Jamie W

Yes this would be perfect. If there is a way to make it so lines which are the same from ^beggining to :, it would resolve my problem. How can I achieve this result?

Terry R

@Jamie-W said in Need help filtering lines starting with same strings:

If there is a way to make it so lines which are the same from ^beggining to :

Hi, welcome to the NPP forum. I saw your question and instantly thought of a solution. Given you would be OK with sorting the file so ALL similar lines are together it makes for an easy regular expression (regex) solution. So first, you MUST sort the lines lexicographically ascending, actually even descending should work. This is done by selecting the Edit menu, then Line Operations, then Sort Lines Lexico…

So, using the “Replace” function we have:
Find What:(?-s)^([^:]+:).+?\R(\1.+?\R){4,}
Replace With: empty, nothing here
Make sure “search mode” is set to regular expression, wrap around probably should NOT be ticked as need to make sure cursor is in very first position in file, although with correct positioning of the cursor it won’t matter.

You should have the cursor before the first first column on the first line, so that it will include any “similar duplicates” including the first line, very important.

To give some background on what the regex is doing:
(?-s) means the . dot character cannot include carriage return/line feeds.
^ means start at very first position on any line, also not important if cursor in correct position
([^:]+:).+?\R identifies the characters up to the :, including the : which is called group 1 (identified as being inside the brackets), then also capture the remainder of the line including carriage return/line feed.
(\1.+?\R){4,} looks for a “duplicate” of the first line, that means up to the : and if found also capture the rest of the line and carriage return/line feed. The {4,} requires that we find at least 4 copies of the first line, so at least 5 “duplicates”.

Use the “Replace All” button and all “duplicate copies” (must be at least 5 lines of the same starting characters) will be removed as we have nothing in the “replace with” field.

By changing the 4 in the {4,} to any number you can adjust how many copies must exist in order to be removed. Note that as we are using the : as a delimiter we don’t need to specify how many characters MUST be considered. The first line tested will decide that for each sequence found.

Give it a go and let us know how you got on. possibly there might be adjustments required but so long as your example data was representative of the real data my test on it (also adding some other copies of it) worked as expected.

Terry

Jamie W

@Terry-R said in Need help filtering lines starting with same strings:

(?-s)^([^:]+:).+?\R(\1.+?\R){4,}

Hello Terry, thank you so much for your help with this. And to you @guy038. Couldn’t find this anywhere on the web and it has saved me many hours of work. Is there a way I can donate/tip to you or the NPP Community? Thanks again.

Terry R

@Jamie-W said in Need help filtering lines starting with same strings:

Is there a way I can donate/tip to you or the NPP Community?

Most certainly, if you went to:
https://notepad-plus-plus.org/donate/
that’s the location to “pay back/forward” if you wish. I presume my regex worked without any issues. Don’t feel that you MUST, but it is nice to get feedback. It’s also nice to “upvote” on posts if you agree/like the information. That’s possible by using the ^ character just below each post on the right side. If anything, getting positive feedback is what mostly drives us volunteers on the forum, we pay it forward by helping.

Terry

guy038

Hello, @jamie-w, @terry-R and All,

Ah, Terry, nice shot ! Just one question : why do you add the lazy quantifier +? to get end of lines after the colon char ? I suppose the normal geedy one ( + ) should work as well !

But on the contrary, I would use it, for part of text, before the : char. Thus, this new version :

SEARCH (?-s)^(.+?:).+\R(\1.+\R){4,}

Now, after posting my question to @jamie-w, about sorting, I thought of a method, a bit longer, which would keep the initial order of lines :

First, we would number all lines of file with the column editor ( Alt + C ), adding this number at the end of lines
Then we would sort text lexicographically ascending
Now, we would perform the regex S/R, deleting lines with more than X identical beginnings
Then, we would move the numbers from end to beginning of each line, with an other regex S/R
Again, we would sort all the remaining lines of the file
And, finally, we would delete the temporary numbering, at beginning of each line

@jamie-w, you said :

Is there a way I can donate/tip to you or the NPP Community?

We really appreciate !

Best Regards,

guy038

Terry R

@guy038 said in Need help filtering lines starting with same strings:

Ah, Terry, nice shot ! Just one question :

I guess I’m not a greedy person😉. Actually I guess I just didn’t proof my solution well enough. I had a concept in my mind, quickly tested it, found it worked and posted.

Of course as I was intending to grab the whole line use of the lazy parameter was just a rookie mistake. At least no harm, no foul, eh? (<-- oops, there it goes again)

Terry