Community
    • Login

    How to group lines with same beginning

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    12 Posts 4 Posters 1.1k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Mark OlsonM
      Mark Olson
      last edited by

      The following sequence of regex-replace should do the trick.

      Open the find/replace form with Search->Replace... or Ctrl+H (using default keybindings).

      Make sure Search Mode is set to Regular expression and Wrap around is checked.

      1. Mark the first line in each series of lines with the same start using the following replacement:
        • Find what: (?-s)^([^\$\r\n]*)(.*\R)(?:\1(.*)(?:\R|\z))*
        • Replace with: \x07${0}
      2. Convert your document into the desired final form (shown below)
        • Find what: (?-s)(\R?)(\x07)?([^\$]*)(\$+)(.*)
        • Replace with: (?2$1$3$4$5: / $5)

      Result:

      aaaa$$$$$$hello1 / hello2 / hello3
      bbbb$$$$$$$$hello4 / hello5
      cccc$$$$hello6 / hello7 / hello8
      

      The first and second regular expressions can be tweaked a fair amount to solve similar problems to this one.

      Alan KilbornA Nataly FlowerN 2 Replies Last reply Reply Quote 1
      • Alan KilbornA
        Alan Kilborn @Mark Olson
        last edited by

        @Mark-Olson

        I wanted to see if someone could learn to fish.
        Apparently you just wanted to feed them for today.
        :-)

        Nataly FlowerN 1 Reply Last reply Reply Quote 2
        • Nataly FlowerN
          Nataly Flower @Mark Olson
          last edited by

          @Mark-Olson

          Thank you so much, Mark, for your reply! :)

          I’m very sorry I couldn’t reply sooner. Due to several issues, I haven’t been able to. I apologize for it.

          Anyway, I really appreciate your help. I’ve tried to follow the steps you indicate, but at point 1, after making sure that Search Mode is set to Regular expression and Wrap around is checked, when I try to do the replacement (I’ve tried to do it both by clicking “Replace All” and “Replace”), Notepad throws error message ‘Find: Invalid regular expression’.

          What do you think we could change in the expression of that first Find what?

          PeterJonesP 1 Reply Last reply Reply Quote 0
          • Nataly FlowerN
            Nataly Flower @Alan Kilborn
            last edited by

            @Alan-Kilborn

            Thank you for the link, but although the question in there may be somewhat similar, I would prefer to learn through my own one, if possible, which is the one that matches exactly with my needs.

            You could consider this a fish of courtesy, and later, if I feel like eating similar fishes like this one, don’t worry I will do my best to catch them by myself. :)

            Alan KilbornA 1 Reply Last reply Reply Quote 0
            • Alan KilbornA
              Alan Kilborn @Nataly Flower
              last edited by

              @Nataly-Flower

              It’s so similar to the other one that you missed a golden opportunity to learn something. And since you were spoon fed, I’m confident you won’t be doing any learning.

              1 Reply Last reply Reply Quote 1
              • PeterJonesP
                PeterJones @Nataly Flower
                last edited by PeterJones

                @Nataly-Flower said in How to group lines with same beginning:

                after making sure that Search Mode is set to Regular expression and Wrap around is checked, when I try to do the replacement (I’ve tried to do it both by clicking “Replace All” and “Replace”), Notepad throws error message ‘Find: Invalid regular expression’.

                Then you have done something wrong, because when I copy @Mark-Olson’s first regex, and use it on your data, it works perfectly, without giving an error:

                dcdb199f-6b66-4c4d-9230-109e87e863e6-image.png

                So does the second:
                d851302d-9a47-4f82-baea-60a1d6db32b8-image.png

                We cannot begin to guess what you did wrong, because you shared nothing about the error message: If you hover over the little popout-icon in the Find: Invalid Regular Expression message, it will tell you exactly what’s wrong with your regex. Here’s one where I intentionally edited the regex to be invalid
                061b577f-f312-423b-8e45-92d4ec5465e8-image.png

                Hmm, I said you had done something wrong… but another question might be: how big is your file? And how many lines in a row might match with the exact same prefix? Because there is a limit to how many bytes can fit inside a capture group in a regex. So if you have lines that are thousands of characters wide, or have thousands of the aaaa-prefixed lines, it might be enough to overwhelm the regex engine. If it gets too big, it can come back with an Invalid Regular Expression message, even though it’s really an “invalid size” problem. But you haven’t given us enough to be sure.

                Nataly FlowerN 1 Reply Last reply Reply Quote 2
                • Nataly FlowerN
                  Nataly Flower @PeterJones
                  last edited by

                  @PeterJones

                  Thank you very much for your assistance. I appreciate it a lot. :)

                  I must say you have given me the key to realize what was happening. I was not really doing anything wrong, nor there is anything wrong with the regex formulated above by @Mark-Olson, whom I also thank for his help once again.

                  When you asked about the size of my file, I tried to perform the replacements in the original file and indeed it gives an error, and hovering the mouse over the little icon you said, I could see the following message: ‘Ran out of stack space trying to match the regular expression’. I must say it is noticeable here that I am still a novice in the use of Notepad++, because I didn’t know that a message could be seen in that icon, I thought it was an icon accompanying the Find: Invalid Regular Expression message.

                  Therefore, I have tried again the replacements this time on a smaller version of the file, and this time they have worked perfectly, without problems. It was a question of file size.

                  Having said this, I would like to thank you once again for helping me. Thank you so much.

                  1 Reply Last reply Reply Quote 3
                  • Mark OlsonM
                    Mark Olson
                    last edited by Mark Olson

                    @Nataly-Flower

                    Fortunately, this problem can still be solved with regular expressions, because the regex engines used by most scripting languages (including Python and C#) are not susceptible to the same match length limit that Notepad++ suffers from.

                    Here’s a PythonScript script that would solve the problem. I’ve tested it with documents that have as many as 400000 consecutive lines that all have the same beginning before the first $ character. And before someone complains that this is basically just a couple of simple PythonScript commands wrapped around a pure Python script, I’m aware of that and I just wanted to post this anyway.

                    By the way, this highlights a general pattern: if you are dissatisfied with the performance of Notepad++'s native find/replace form, try using re.sub to do the same find/replace operation in PythonScript, and you will likely see a massive performance improvement.

                    '''
                    ====== SOURCE ======
                    Requires PythonScript (https://github.com/bruderstein/PythonScript/releases)
                    Based on this question: https://community.notepad-plus-plus.org/post/96456
                    ====== DESCRIPTION ======
                    The goal of this script is to replace a document of the form
                    """
                    aaaa$$$$$$hello1
                    aaaa$$$$hello2
                    aaaa$$$$hello3
                    bbbb$$$$$$$$hello4
                    bbbb$$$$hello5
                    cccc$$$$hello6
                    cccc$$$$hello7
                    cccc$$$$hello8
                    """
                    with
                    """
                    aaaa$$$$$$hello1 / hello2 / hello3
                    bbbb$$$$$$$$hello4 / hello5
                    cccc$$$$hello6 / hello7 / hello8
                    """
                    
                    In the words of the original poster, "So, I would like to know how to group easily, maybe with a macro or a built-in feature in Notepad, these lines that share the same beginning (aaaa, bbbb, cccc) up to the mark $"
                    ====== EXAMPLE ======
                    See above.
                    '''
                    from Npp import editor
                    import re
                    
                    # You could add a bunch of text that matches the problem with the below:
                    # editor.setText('\r\n'.join(('%s$$$$hello%d' % (x * 4, ii)) for x in 'abcdefghijklm' for ii in range(400_000)))
                    
                    oldText = editor.getText()
                    
                    # Use BEL to mark the first line in each series of lines with the same start
                    FIRST_REX = r'(?m)^([^$\r\n]*)([^\r\n]*(?:\r?\n|\r))((?:\1(?:[^\r\n]*)(?:\r?\n|\r)?)*)'
                    # print('======= DOING FIRST REPLACEMENTS ======')
                    def replacer1(m):
                        # print(m.groups())
                        return '\x07' + m.group(0)
                    
                    newText1 = re.sub(FIRST_REX, replacer1, oldText)
                    
                    # Convert your document into the desired final form
                    # print('======= DOING SECOND REPLACEMENTS ======')
                    def replacer2(m):
                        grps = m.groups()
                        # print(grps)
                        return ('%s%s%s%s' % (grps[0], grps[2], grps[3], grps[4])) if grps[1] else (' / ' + grps[4])
                    
                    newText2 = re.sub(r'((?:\r?\n|\r)?)(\x07)?([^$\r\n]*)(\$+)([^\r\n]*)', replacer2, newText1)
                    editor.setText(newText2)
                    

                    EDIT: The regex search form in JsonTools can also achieve this task much faster than Notepad++, but still noticeably slower than PythonScript. The main advantage of the regex search form is that, when using the s_fa function, it provides a tree view making it easy to see all the capture groups of each regex search result.

                    I have not tested other plugins like ColumnsPlusPlus or MultiReplace for this task, but in my experience neither of those plugins comes anywhere close to the raw performance of Python’s re.sub when the number of replacements is very large.

                    Alan KilbornA 1 Reply Last reply Reply Quote 1
                    • Alan KilbornA
                      Alan Kilborn @Mark Olson
                      last edited by

                      @Mark-Olson said:

                      And before someone complains that this is basically just a couple of simple PythonScript commands wrapped around a pure Python script

                      It’s no problem. In fact, it’s totally appropriate. :-)


                      Just to point out:

                      Python’s re uses a different engine than Notepad++ does. While this can be a “good thing”, sometimes it will trip a user up – they’ll get a “tricky” regular expression working in Notepad++, and then run into trouble when trying to automate using the same expression in a script. I’m not saying anything like that would happen with the specific problem of this thread…it’s just something to be aware of.

                      Mark OlsonM 1 Reply Last reply Reply Quote 2
                      • Mark OlsonM
                        Mark Olson @Alan Kilborn
                        last edited by Mark Olson

                        @Alan-Kilborn said in How to group lines with same beginning:

                        Python’s re uses a different engine than Notepad++ does. While this can be a “good thing”, sometimes it will trip a user up – they’ll get a “tricky” regular expression working in Notepad++, and then run into trouble when trying to automate using the same expression in a script.

                        This does in fact happen in multiple places in my script. I’ll just break down how the regular expressions I used had to change to be compatible with Python’s re engine.

                        STEP 1 REGEX CHANGES

                        (?-s)^([^\$\r\n]*)(.*\R)(?:\1(.*)(?:\R|\z))*
                        becomes
                        (?m)^([^$\r\n]*)([^\r\n]*(?:\r?\n|\r))((?:\1(?:[^\r\n]*)(?:\r?\n|\r)?)*)

                        1. (?-s) is unnecessary (because . already does not match newline by default in re)
                        2. (?m) is necessary to make it so that ^ matches at the beginning of the file and at the beginning of lines. In Notepad++ regex, ^ matches the beginning of lines by default.
                        3. Every instance of . must become [^\r\n] because in Python . matches \r, which is bad because that is the first character of the \r\n sequence that indicates a newline in Windows.
                        4. Every instance of \R (shorthand for any newline) must become (?:\r?\n|\r), which matches the three most common newlines (\n, \r\n, and \r)

                        The Step 2 regex also needs to be changed from (?-s)(\R?)(\x07)?([^\$]*)(\$+)(.*) to ((?:\r?\n|\r)?)(\x07)?([^$\r\n]*)(\$+)([^\r\n]*) because of point 3 above (the lack of \R in Python’s re)

                        Finally, I had to create callback functions (the def replacer1(m): and def replacer2(m)) above, because the replacement regexes I used in Notepad++ don’t work in Python.

                        1 Reply Last reply Reply Quote 2
                        • First post
                          Last post
                        The Community of users of the Notepad++ text editor.
                        Powered by NodeBB | Contributors