Community
    • Login

    Help with semi-complicated regex / Notepad++ regex issue

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    9 Posts 5 Posters 28.8k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • DaveyDD
      DaveyD
      last edited by DaveyD

      Hi, I need help with a regex
      I have a text file that contains a folder tree structure
      The problem is that the first file of each folder is listed on the same line as the folder
      Sample text: (Note, that before each “FileA” is a tab character

      Folder1
      |--Folder1A	FileA
      |	|--FileB
      |	|--FileC
      |	|--FileD
      |--Folder1B	FileA
      |	|--FileB
      |	|--FileC
      |	|--Folder1B1	FileA
      |	|	|--FileB
      |	|	|--FileC
      |	|	|--FileD
      |	|--Folder1B2	FileA
      |	|	|--FileB
      

      What I though to do was as follows:

      1. find a tab character that is not preceeded by a pipe (|) --> put that into group 1
      2. put the remaining text on that line --> group 2
      3. the folder level of the next line --> group 3

      I was able to search that with the following:

      (?<=[^\|])(\t)(.+$\r\n)([\|\t\-]+)
      

      Then, I wanted to replace that with:

      \r\n$3$2
      

      Although I though and still think this should work, notepad++ did not replace anything! It finds the text that I want, but when I press replace, it doesn’t do anything!
      I don’t know if this is a problem with the regex or n++, but it seems weird. and just to mention, this is not the first time I had this issue that n++ wouldn’t replace anything.

      Windows 10 64bit
      Notepad++ 6.9.1

      Any help will be appreciated!
      Thanks,
      David

      1 Reply Last reply Reply Quote 1
      • dailD
        dail
        last edited by

        Definitely seems like a bug using the positive look behind. Doesn’t allow replacing the match at all.

        I managed to change (?<=[^\|]) into (?<!\|) and it seemed to actually allow replacing the match. I think the regex will need tweaked some more but you seem to know what you are doing. :)

        1 Reply Last reply Reply Quote 1
        • AdrianHHHA
          AdrianHHH
          last edited by

          I find the look behinds and look in fronts confusing and almost never use them. For this problem a simple search and replace works.

          Search for: ^([^\t\r\n]+)\t(.+)\r\n([| -]+)
          Replace with: \1\r\n\3\2\r\n\3

          Where:
          \1 contains the folder name with its preceding “^[| -]+” characters.
          \2 contains the “FileA” part
          \3 contains the “^[| -]+” characters preceding the “FileB” part.

          1 Reply Last reply Reply Quote 1
          • gerdb42G
            gerdb42
            last edited by

            So by putting all answers together we’ll get (?<![|])(\t)(.+$\r\n)([\|\t\-]+) for searching and \r\n$3$2$3 for replacement.

            1 Reply Last reply Reply Quote 1
            • DaveyDD
              DaveyD
              last edited by

              @dail - thanks :) . That is very interesting. seems like the bug is only in positive look behind.

              @AdrianHHH, you’re search needed some tweaking in the last part because the pipe and the dash needed to be escaped. However, even after that it didn’t work as expected because it didn’t only find the lines with a tab in the middle. I.e., it found every line that started with a pipe and a tab.

              @gerdb42, this works! Basically, you put dail’s search tweak, together with the last $3 that I was missing in the replacement (which I hadn’t had a chance to test since the replace wasn’t working… :) )

              Thanks guys
              David

              1 Reply Last reply Reply Quote 0
              • dailD
                dail
                last edited by

                @AdrianHHH @DaveyD

                I’ve also had luck using \K instead of look behinds. It sets the cursor at that position. For example the original RE posted (obviously it still needed work at that point):

                (?<=[^\|])(\t)(.+$\r\n)([\|\t\-]+)
                

                Would become:

                [^\|]\K(\t)(.+$\r\n)([\|\t\-]+)
                

                That way whatever [^\|] matches isn’t actually selected for replacement. To quote the boost documentation about \K

                \K Resets the start location of $0 to the current text position: in other words everything to the left of \K is “kept back” and does not form part of the regular expression match.

                1 Reply Last reply Reply Quote 1
                • guy038G
                  guy038
                  last edited by guy038

                  Hello DaveyD, Dail, AdrianHHH, gerdb42 and All,

                  DaveyD, if your just click on the Replace All button ( instead of several hits on the Replace button ), your regex does the job correctly !

                  We, also, get the same behaviour with the final search regex of Dail, built with the \K syntax

                  So, to sum up : With the given example text, below, without any space inside :

                  Folder1
                  |--Folder1A	FileA
                  |	|--FileB
                  |	|--FileC
                  |	|--FileD
                  |--Folder1B	FileA
                  |	|--FileB
                  |	|--FileC
                  |	|--Folder1B1	FileA
                  |	|	|--FileB
                  |	|	|--FileC
                  |	|	|--FileD
                  |	|--Folder1B2	FileA
                  |	|	|--FileB
                  

                  And the common replacement regex \r\n$3$2$3 or \r\n\3\2\3 :

                  • The DaveyD search regex (?<=[^\|])(\t)(.+$\r\n)([\|\t\-]+) works, with the Replace All button, ONLY

                  • The Dail search regex (?<!\|)(\t)(.+$\r\n)([|\t-]+) works with, either, the Replace All or the Replace button

                  • The Gerdb42 search regex (?<![|])(\t)(.+$\r\n)([\|\t\-]+) works with, either, the Replace All or the Replace button

                  • The Dail search regex [^\|]\K(\t)(.+$\r\n)([\|\t\-]+) works with the Replace All button, ONLY


                  Now, allow me to give you my own solution :

                  Find what : (?<!\|)\t(.+)(\R[|\t-]+)

                  Replace with : $2$1$2

                  Notes :

                  • As we don’t need the tabulation character, before FileA, we do not have to surround it with round brackets

                  • We move the End of line characters, after FileA, into the final group 2. So, we do not need the part \r\n, at the beginning of the replacement regex

                  • Inside the class range [|\t-], the escape character, before the pipe character, is useless

                  • Inside the class range [|\t-], the escape character, before the minus character isn’t mandatory, too, if this minus sign begins or ends the class range !


                  Remark :

                  With the improved regex engine, build by François-R Boyer, in June 2013 ( link below ) , as the search is done with true 32 bits codepoints, instead of 16 bits ones, the DaveyD’s regex does work with the Replace button , too !

                  https://sourceforge.net/projects/npppythonplugsq/files/Beta N%2B%2B regex code/

                  Unfortunately, this improved regex engine, does not work anymore, since the 6.9.1 version of N++ :-((( Too sad !

                  So, to use it, just replace the current SciLexer.dll by the version of François-R Boyer, based on Scintilla 2.2.7.0, in a N++ version, prior to the 6.9.1 one !

                  Best Regards,

                  guy038

                  P.S :

                  If you copy/paste the folder tree structure, above, to do some tests, you must perform the following S/R, first :

                  Find what : \x20{1,4}

                  Replace with : \t

                  in order to change all the space characters into tabulation characters !

                  1 Reply Last reply Reply Quote 1
                  • DaveyDD
                    DaveyD
                    last edited by DaveyD

                    @dail thanks for reminding me about the \K - I’ve used it in the past - it comes in handy when needing to use +*{} operators on lookbehinds.

                    @guy038 - as always, thanks for the super-duper clarification! I’ve tried your expression as well and it works great!
                    It would be nice if we can get all the working regex pieces together… :)

                    Thanks to all!
                    David

                    1 Reply Last reply Reply Quote 0
                    • guy038G
                      guy038
                      last edited by guy038

                      Hi DaveyD and All,

                      Ah, Yes ! I can apply the Dail’s \K syntax to my previous regex. So, finally, here are my two solutions :

                      Find what : (?<!\|)\t(.+)(\R[|\t-]+) which works with, either, the Replace All or the Replace button

                      OR

                      Find what : [^|]\t\K(.+)(\R[|\t-]+), which ONLY works with the Replace All button

                      For these two search regexes, the replacement regex is :

                      Replace with : $2$1$2

                      I don’t think that we can shorten them, anymore !

                      Notes :

                      • When your search regex contains a \K form, the step by step replacement never works !! That’s the normal behaviour !

                      • There are two cases, where the \K feature is mandatory and can NOT be replaced with lookbehinds :

                        • When the regex, inside the lookbehind, could match non-fixed length strings, as, for instance, the regex (?<=\d+)abc

                        • When the regex, inside the lookbehind contains alternatives, of different length, as, for instance, the regex (?<=(12|345|6789))abc

                      • So, in order to get valid regular expressions, you must change them, respectively, into the two, below, which include, both, the \K syntax :

                        • \d+\Kabc

                        • (12|345|6789)\Kabc

                      • On the contrary, for instance, the two regexes (?<=\d{3})abc and ((?<=(00|99))abc are quite valid ones. Indeed, inside the lookbehind, the former refers to a three-digits number, only and, in the later, each alternative refers to a same two-digits number :-)

                      You may test these 4 regexes against this short example text, below :

                      00abc
                      12abc
                      345abc
                      6789abc
                      99abc
                      

                      Cheers,

                      guy038

                      1 Reply Last reply Reply Quote 0
                      • First post
                        Last post
                      The Community of users of the Notepad++ text editor.
                      Powered by NodeBB | Contributors