Community
    • Login

    Need help filtering lines starting with same strings

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    10 Posts 3 Posters 3.6k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Jamie WJ
      Jamie W
      last edited by

      Hello, I have some very large files which I need to filter. The files contain a lot of lines starting with the same characters, I want to keep only the unique lines. If there are 5 lines starting with the same x amount of characters, I want to bookmark/remove them.
      Example

      Bob1919:12345
      Bob1:12345
      Bob1919:982623
      Sam10:12345
      Bob1919:55555
      Alex:888888

      I want the result;

      Bob1:12345
      Alex:888888
      Sam10:12345

      Due to the latter part of the lines being difference, it isn’t possible to sort by occurance and remove the most occuring lines. It also isn’t possible to bookmark Bob1919, because there are too many lines similiar to this, meaning you’d have to bookmark those lines one by one. Thanks!

      1 Reply Last reply Reply Quote 0
      • guy038G
        guy038
        last edited by guy038

        Hello @jamie-w and All,

        Why not a simple regex S/R ?

        • Open the Replace dialog ( Ctrl + H )

          • SEARCH (?-si)^Bob1919.+\R    OR    (?i-s)^Bob1919.+\R for a search insensitive to case

          • REPLACE Leave EMPTY

        • Tick the Wrap around option

        • Click on the Replace All button

        Voila !


        Probably, you would get the same result, marking lines containing the string Bob1919 and then deleting all bookmarked lines but I presume it won’t be as fast as the search/replacement !

        Best Regards,

        guy038

        Jamie WJ 1 Reply Last reply Reply Quote 0
        • Jamie WJ
          Jamie W @guy038
          last edited by

          @guy038

          I need to do this for thousands of different usernames. Not just Bob1919, that was one example.
          It would not be efficient to find the names 1 by 1 and search. I need something that bookmarks lines if there is more than 5 lines which start with the same strings. Sorry for my bad explanation.

          1 Reply Last reply Reply Quote 0
          • guy038G
            guy038
            last edited by

            Hi, @jamie-w,

            OK, I see !

            Would you mind if your files need sorting ? Indeed, after suppression of lines whose beginning, let’s say, till the colon, occurs more than 5 times, you wouldn’t get the initial list of lines !

            BR

            guy038

            1 Reply Last reply Reply Quote 1
            • Jamie WJ
              Jamie W
              last edited by

              Yes this would be perfect. If there is a way to make it so lines which are the same from ^beggining to :, it would resolve my problem. How can I achieve this result?

              1 Reply Last reply Reply Quote 0
              • Terry RT
                Terry R
                last edited by

                @Jamie-W said in Need help filtering lines starting with same strings:

                If there is a way to make it so lines which are the same from ^beggining to :

                Hi, welcome to the NPP forum. I saw your question and instantly thought of a solution. Given you would be OK with sorting the file so ALL similar lines are together it makes for an easy regular expression (regex) solution. So first, you MUST sort the lines lexicographically ascending, actually even descending should work. This is done by selecting the Edit menu, then Line Operations, then Sort Lines Lexico…

                So, using the “Replace” function we have:
                Find What:(?-s)^([^:]+:).+?\R(\1.+?\R){4,}
                Replace With: empty, nothing here
                Make sure “search mode” is set to regular expression, wrap around probably should NOT be ticked as need to make sure cursor is in very first position in file, although with correct positioning of the cursor it won’t matter.

                You should have the cursor before the first first column on the first line, so that it will include any “similar duplicates” including the first line, very important.

                To give some background on what the regex is doing:
                (?-s) means the . dot character cannot include carriage return/line feeds.
                ^ means start at very first position on any line, also not important if cursor in correct position
                ([^:]+:).+?\R identifies the characters up to the :, including the : which is called group 1 (identified as being inside the brackets), then also capture the remainder of the line including carriage return/line feed.
                (\1.+?\R){4,} looks for a “duplicate” of the first line, that means up to the : and if found also capture the rest of the line and carriage return/line feed. The {4,} requires that we find at least 4 copies of the first line, so at least 5 “duplicates”.

                Use the “Replace All” button and all “duplicate copies” (must be at least 5 lines of the same starting characters) will be removed as we have nothing in the “replace with” field.

                By changing the 4 in the {4,} to any number you can adjust how many copies must exist in order to be removed. Note that as we are using the : as a delimiter we don’t need to specify how many characters MUST be considered. The first line tested will decide that for each sequence found.

                Give it a go and let us know how you got on. possibly there might be adjustments required but so long as your example data was representative of the real data my test on it (also adding some other copies of it) worked as expected.

                Terry

                1 Reply Last reply Reply Quote 4
                • Jamie WJ
                  Jamie W
                  last edited by

                  @Terry-R said in Need help filtering lines starting with same strings:

                  (?-s)^([^:]+:).+?\R(\1.+?\R){4,}

                  Hello Terry, thank you so much for your help with this. And to you @guy038. Couldn’t find this anywhere on the web and it has saved me many hours of work. Is there a way I can donate/tip to you or the NPP Community? Thanks again.

                  1 Reply Last reply Reply Quote 2
                  • Terry RT
                    Terry R
                    last edited by Terry R

                    @Jamie-W said in Need help filtering lines starting with same strings:

                    Is there a way I can donate/tip to you or the NPP Community?

                    Most certainly, if you went to:
                    https://notepad-plus-plus.org/donate/
                    that’s the location to “pay back/forward” if you wish. I presume my regex worked without any issues. Don’t feel that you MUST, but it is nice to get feedback. It’s also nice to “upvote” on posts if you agree/like the information. That’s possible by using the ^ character just below each post on the right side. If anything, getting positive feedback is what mostly drives us volunteers on the forum, we pay it forward by helping.

                    Terry

                    1 Reply Last reply Reply Quote 2
                    • guy038G
                      guy038
                      last edited by guy038

                      Hello, @jamie-w, @terry-R and All,

                      Ah, Terry, nice shot ! Just one question : why do you add the lazy quantifier +? to get end of lines after the colon char ? I suppose the normal geedy one ( + ) should work as well !

                      But on the contrary, I would use it, for part of text, before the : char. Thus, this new version :

                      SEARCH (?-s)^(.+?:).+\R(\1.+\R){4,}


                      Now, after posting my question to @jamie-w, about sorting, I thought of a method, a bit longer, which would keep the initial order of lines :

                      • First, we would number all lines of file with the column editor ( Alt + C ), adding this number at the end of lines

                      • Then we would sort text lexicographically ascending

                      • Now, we would perform the regex S/R, deleting lines with more than X identical beginnings

                      • Then, we would move the numbers from end to beginning of each line, with an other regex S/R

                      • Again, we would sort all the remaining lines of the file

                      • And, finally, we would delete the temporary numbering, at beginning of each line


                      @jamie-w, you said :

                      Is there a way I can donate/tip to you or the NPP Community?

                      We really appreciate !

                      Best Regards,

                      guy038

                      1 Reply Last reply Reply Quote 3
                      • Terry RT
                        Terry R
                        last edited by

                        @guy038 said in Need help filtering lines starting with same strings:

                        Ah, Terry, nice shot ! Just one question :

                        I guess I’m not a greedy person😉. Actually I guess I just didn’t proof my solution well enough. I had a concept in my mind, quickly tested it, found it worked and posted.

                        Of course as I was intending to grab the whole line use of the lazy parameter was just a rookie mistake. At least no harm, no foul, eh? (<-- oops, there it goes again)

                        Terry

                        1 Reply Last reply Reply Quote 3
                        • First post
                          Last post
                        The Community of users of the Notepad++ text editor.
                        Powered by NodeBB | Contributors