Community
    • Login

    How to group lines with same beginning

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    12 Posts 4 Posters 1.1k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Alan KilbornA
      Alan Kilborn @Nataly Flower
      last edited by Alan Kilborn

      @Nataly-Flower

      Your problem seems very similar to this one: https://community.notepad-plus-plus.org/topic/26062/eliminate-excess-verbiage-from-audio-transcript

      Maybe you can figure out how to do what you want, based on the solution of that other one.

      1 Reply Last reply Reply Quote 2
      • Mark OlsonM
        Mark Olson
        last edited by

        The following sequence of regex-replace should do the trick.

        Open the find/replace form with Search->Replace... or Ctrl+H (using default keybindings).

        Make sure Search Mode is set to Regular expression and Wrap around is checked.

        1. Mark the first line in each series of lines with the same start using the following replacement:
          • Find what: (?-s)^([^\$\r\n]*)(.*\R)(?:\1(.*)(?:\R|\z))*
          • Replace with: \x07${0}
        2. Convert your document into the desired final form (shown below)
          • Find what: (?-s)(\R?)(\x07)?([^\$]*)(\$+)(.*)
          • Replace with: (?2$1$3$4$5: / $5)

        Result:

        aaaa$$$$$$hello1 / hello2 / hello3
        bbbb$$$$$$$$hello4 / hello5
        cccc$$$$hello6 / hello7 / hello8
        

        The first and second regular expressions can be tweaked a fair amount to solve similar problems to this one.

        Alan KilbornA Nataly FlowerN 2 Replies Last reply Reply Quote 1
        • Alan KilbornA
          Alan Kilborn @Mark Olson
          last edited by

          @Mark-Olson

          I wanted to see if someone could learn to fish.
          Apparently you just wanted to feed them for today.
          :-)

          Nataly FlowerN 1 Reply Last reply Reply Quote 2
          • Nataly FlowerN
            Nataly Flower @Mark Olson
            last edited by

            @Mark-Olson

            Thank you so much, Mark, for your reply! :)

            I’m very sorry I couldn’t reply sooner. Due to several issues, I haven’t been able to. I apologize for it.

            Anyway, I really appreciate your help. I’ve tried to follow the steps you indicate, but at point 1, after making sure that Search Mode is set to Regular expression and Wrap around is checked, when I try to do the replacement (I’ve tried to do it both by clicking “Replace All” and “Replace”), Notepad throws error message ‘Find: Invalid regular expression’.

            What do you think we could change in the expression of that first Find what?

            PeterJonesP 1 Reply Last reply Reply Quote 0
            • Nataly FlowerN
              Nataly Flower @Alan Kilborn
              last edited by

              @Alan-Kilborn

              Thank you for the link, but although the question in there may be somewhat similar, I would prefer to learn through my own one, if possible, which is the one that matches exactly with my needs.

              You could consider this a fish of courtesy, and later, if I feel like eating similar fishes like this one, don’t worry I will do my best to catch them by myself. :)

              Alan KilbornA 1 Reply Last reply Reply Quote 0
              • Alan KilbornA
                Alan Kilborn @Nataly Flower
                last edited by

                @Nataly-Flower

                It’s so similar to the other one that you missed a golden opportunity to learn something. And since you were spoon fed, I’m confident you won’t be doing any learning.

                1 Reply Last reply Reply Quote 1
                • PeterJonesP
                  PeterJones @Nataly Flower
                  last edited by PeterJones

                  @Nataly-Flower said in How to group lines with same beginning:

                  after making sure that Search Mode is set to Regular expression and Wrap around is checked, when I try to do the replacement (I’ve tried to do it both by clicking “Replace All” and “Replace”), Notepad throws error message ‘Find: Invalid regular expression’.

                  Then you have done something wrong, because when I copy @Mark-Olson’s first regex, and use it on your data, it works perfectly, without giving an error:

                  dcdb199f-6b66-4c4d-9230-109e87e863e6-image.png

                  So does the second:
                  d851302d-9a47-4f82-baea-60a1d6db32b8-image.png

                  We cannot begin to guess what you did wrong, because you shared nothing about the error message: If you hover over the little popout-icon in the Find: Invalid Regular Expression message, it will tell you exactly what’s wrong with your regex. Here’s one where I intentionally edited the regex to be invalid
                  061b577f-f312-423b-8e45-92d4ec5465e8-image.png

                  Hmm, I said you had done something wrong… but another question might be: how big is your file? And how many lines in a row might match with the exact same prefix? Because there is a limit to how many bytes can fit inside a capture group in a regex. So if you have lines that are thousands of characters wide, or have thousands of the aaaa-prefixed lines, it might be enough to overwhelm the regex engine. If it gets too big, it can come back with an Invalid Regular Expression message, even though it’s really an “invalid size” problem. But you haven’t given us enough to be sure.

                  Nataly FlowerN 1 Reply Last reply Reply Quote 2
                  • Nataly FlowerN
                    Nataly Flower @PeterJones
                    last edited by

                    @PeterJones

                    Thank you very much for your assistance. I appreciate it a lot. :)

                    I must say you have given me the key to realize what was happening. I was not really doing anything wrong, nor there is anything wrong with the regex formulated above by @Mark-Olson, whom I also thank for his help once again.

                    When you asked about the size of my file, I tried to perform the replacements in the original file and indeed it gives an error, and hovering the mouse over the little icon you said, I could see the following message: ‘Ran out of stack space trying to match the regular expression’. I must say it is noticeable here that I am still a novice in the use of Notepad++, because I didn’t know that a message could be seen in that icon, I thought it was an icon accompanying the Find: Invalid Regular Expression message.

                    Therefore, I have tried again the replacements this time on a smaller version of the file, and this time they have worked perfectly, without problems. It was a question of file size.

                    Having said this, I would like to thank you once again for helping me. Thank you so much.

                    1 Reply Last reply Reply Quote 3
                    • Mark OlsonM
                      Mark Olson
                      last edited by Mark Olson

                      @Nataly-Flower

                      Fortunately, this problem can still be solved with regular expressions, because the regex engines used by most scripting languages (including Python and C#) are not susceptible to the same match length limit that Notepad++ suffers from.

                      Here’s a PythonScript script that would solve the problem. I’ve tested it with documents that have as many as 400000 consecutive lines that all have the same beginning before the first $ character. And before someone complains that this is basically just a couple of simple PythonScript commands wrapped around a pure Python script, I’m aware of that and I just wanted to post this anyway.

                      By the way, this highlights a general pattern: if you are dissatisfied with the performance of Notepad++'s native find/replace form, try using re.sub to do the same find/replace operation in PythonScript, and you will likely see a massive performance improvement.

                      '''
                      ====== SOURCE ======
                      Requires PythonScript (https://github.com/bruderstein/PythonScript/releases)
                      Based on this question: https://community.notepad-plus-plus.org/post/96456
                      ====== DESCRIPTION ======
                      The goal of this script is to replace a document of the form
                      """
                      aaaa$$$$$$hello1
                      aaaa$$$$hello2
                      aaaa$$$$hello3
                      bbbb$$$$$$$$hello4
                      bbbb$$$$hello5
                      cccc$$$$hello6
                      cccc$$$$hello7
                      cccc$$$$hello8
                      """
                      with
                      """
                      aaaa$$$$$$hello1 / hello2 / hello3
                      bbbb$$$$$$$$hello4 / hello5
                      cccc$$$$hello6 / hello7 / hello8
                      """
                      
                      In the words of the original poster, "So, I would like to know how to group easily, maybe with a macro or a built-in feature in Notepad, these lines that share the same beginning (aaaa, bbbb, cccc) up to the mark $"
                      ====== EXAMPLE ======
                      See above.
                      '''
                      from Npp import editor
                      import re
                      
                      # You could add a bunch of text that matches the problem with the below:
                      # editor.setText('\r\n'.join(('%s$$$$hello%d' % (x * 4, ii)) for x in 'abcdefghijklm' for ii in range(400_000)))
                      
                      oldText = editor.getText()
                      
                      # Use BEL to mark the first line in each series of lines with the same start
                      FIRST_REX = r'(?m)^([^$\r\n]*)([^\r\n]*(?:\r?\n|\r))((?:\1(?:[^\r\n]*)(?:\r?\n|\r)?)*)'
                      # print('======= DOING FIRST REPLACEMENTS ======')
                      def replacer1(m):
                          # print(m.groups())
                          return '\x07' + m.group(0)
                      
                      newText1 = re.sub(FIRST_REX, replacer1, oldText)
                      
                      # Convert your document into the desired final form
                      # print('======= DOING SECOND REPLACEMENTS ======')
                      def replacer2(m):
                          grps = m.groups()
                          # print(grps)
                          return ('%s%s%s%s' % (grps[0], grps[2], grps[3], grps[4])) if grps[1] else (' / ' + grps[4])
                      
                      newText2 = re.sub(r'((?:\r?\n|\r)?)(\x07)?([^$\r\n]*)(\$+)([^\r\n]*)', replacer2, newText1)
                      editor.setText(newText2)
                      

                      EDIT: The regex search form in JsonTools can also achieve this task much faster than Notepad++, but still noticeably slower than PythonScript. The main advantage of the regex search form is that, when using the s_fa function, it provides a tree view making it easy to see all the capture groups of each regex search result.

                      I have not tested other plugins like ColumnsPlusPlus or MultiReplace for this task, but in my experience neither of those plugins comes anywhere close to the raw performance of Python’s re.sub when the number of replacements is very large.

                      Alan KilbornA 1 Reply Last reply Reply Quote 1
                      • Alan KilbornA
                        Alan Kilborn @Mark Olson
                        last edited by

                        @Mark-Olson said:

                        And before someone complains that this is basically just a couple of simple PythonScript commands wrapped around a pure Python script

                        It’s no problem. In fact, it’s totally appropriate. :-)


                        Just to point out:

                        Python’s re uses a different engine than Notepad++ does. While this can be a “good thing”, sometimes it will trip a user up – they’ll get a “tricky” regular expression working in Notepad++, and then run into trouble when trying to automate using the same expression in a script. I’m not saying anything like that would happen with the specific problem of this thread…it’s just something to be aware of.

                        Mark OlsonM 1 Reply Last reply Reply Quote 2
                        • Mark OlsonM
                          Mark Olson @Alan Kilborn
                          last edited by Mark Olson

                          @Alan-Kilborn said in How to group lines with same beginning:

                          Python’s re uses a different engine than Notepad++ does. While this can be a “good thing”, sometimes it will trip a user up – they’ll get a “tricky” regular expression working in Notepad++, and then run into trouble when trying to automate using the same expression in a script.

                          This does in fact happen in multiple places in my script. I’ll just break down how the regular expressions I used had to change to be compatible with Python’s re engine.

                          STEP 1 REGEX CHANGES

                          (?-s)^([^\$\r\n]*)(.*\R)(?:\1(.*)(?:\R|\z))*
                          becomes
                          (?m)^([^$\r\n]*)([^\r\n]*(?:\r?\n|\r))((?:\1(?:[^\r\n]*)(?:\r?\n|\r)?)*)

                          1. (?-s) is unnecessary (because . already does not match newline by default in re)
                          2. (?m) is necessary to make it so that ^ matches at the beginning of the file and at the beginning of lines. In Notepad++ regex, ^ matches the beginning of lines by default.
                          3. Every instance of . must become [^\r\n] because in Python . matches \r, which is bad because that is the first character of the \r\n sequence that indicates a newline in Windows.
                          4. Every instance of \R (shorthand for any newline) must become (?:\r?\n|\r), which matches the three most common newlines (\n, \r\n, and \r)

                          The Step 2 regex also needs to be changed from (?-s)(\R?)(\x07)?([^\$]*)(\$+)(.*) to ((?:\r?\n|\r)?)(\x07)?([^$\r\n]*)(\$+)([^\r\n]*) because of point 3 above (the lack of \R in Python’s re)

                          Finally, I had to create callback functions (the def replacer1(m): and def replacer2(m)) above, because the replacement regexes I used in Notepad++ don’t work in Python.

                          1 Reply Last reply Reply Quote 2
                          • First post
                            Last post
                          The Community of users of the Notepad++ text editor.
                          Powered by NodeBB | Contributors