Help with semi-complicated regex / Notepad++ regex issue



  • Hi, I need help with a regex
    I have a text file that contains a folder tree structure
    The problem is that the first file of each folder is listed on the same line as the folder
    Sample text: (Note, that before each “FileA” is a tab character

    Folder1
    |--Folder1A	FileA
    |	|--FileB
    |	|--FileC
    |	|--FileD
    |--Folder1B	FileA
    |	|--FileB
    |	|--FileC
    |	|--Folder1B1	FileA
    |	|	|--FileB
    |	|	|--FileC
    |	|	|--FileD
    |	|--Folder1B2	FileA
    |	|	|--FileB
    

    What I though to do was as follows:

    1. find a tab character that is not preceeded by a pipe (|) --> put that into group 1
    2. put the remaining text on that line --> group 2
    3. the folder level of the next line --> group 3

    I was able to search that with the following:

    (?<=[^\|])(\t)(.+$\r\n)([\|\t\-]+)
    

    Then, I wanted to replace that with:

    \r\n$3$2
    

    Although I though and still think this should work, notepad++ did not replace anything! It finds the text that I want, but when I press replace, it doesn’t do anything!
    I don’t know if this is a problem with the regex or n++, but it seems weird. and just to mention, this is not the first time I had this issue that n++ wouldn’t replace anything.

    Windows 10 64bit
    Notepad++ 6.9.1

    Any help will be appreciated!
    Thanks,
    David



  • Definitely seems like a bug using the positive look behind. Doesn’t allow replacing the match at all.

    I managed to change (?<=[^\|]) into (?<!\|) and it seemed to actually allow replacing the match. I think the regex will need tweaked some more but you seem to know what you are doing. :)



  • I find the look behinds and look in fronts confusing and almost never use them. For this problem a simple search and replace works.

    Search for: ^([^\t\r\n]+)\t(.+)\r\n([| -]+)
    Replace with: \1\r\n\3\2\r\n\3

    Where:
    \1 contains the folder name with its preceding “^[| -]+” characters.
    \2 contains the “FileA” part
    \3 contains the “^[| -]+” characters preceding the “FileB” part.



  • So by putting all answers together we’ll get (?<![|])(\t)(.+$\r\n)([\|\t\-]+) for searching and \r\n$3$2$3 for replacement.



  • @dail - thanks :) . That is very interesting. seems like the bug is only in positive look behind.

    @AdrianHHH, you’re search needed some tweaking in the last part because the pipe and the dash needed to be escaped. However, even after that it didn’t work as expected because it didn’t only find the lines with a tab in the middle. I.e., it found every line that started with a pipe and a tab.

    @gerdb42, this works! Basically, you put dail’s search tweak, together with the last $3 that I was missing in the replacement (which I hadn’t had a chance to test since the replace wasn’t working… :) )

    Thanks guys
    David



  • @AdrianHHH @DaveyD

    I’ve also had luck using \K instead of look behinds. It sets the cursor at that position. For example the original RE posted (obviously it still needed work at that point):

    (?<=[^\|])(\t)(.+$\r\n)([\|\t\-]+)
    

    Would become:

    [^\|]\K(\t)(.+$\r\n)([\|\t\-]+)
    

    That way whatever [^\|] matches isn’t actually selected for replacement. To quote the boost documentation about \K

    \K Resets the start location of $0 to the current text position: in other words everything to the left of \K is “kept back” and does not form part of the regular expression match.



  • Hello DaveyD, Dail, AdrianHHH, gerdb42 and All,

    DaveyD, if your just click on the Replace All button ( instead of several hits on the Replace button ), your regex does the job correctly !

    We, also, get the same behaviour with the final search regex of Dail, built with the \K syntax

    So, to sum up : With the given example text, below, without any space inside :

    Folder1
    |--Folder1A	FileA
    |	|--FileB
    |	|--FileC
    |	|--FileD
    |--Folder1B	FileA
    |	|--FileB
    |	|--FileC
    |	|--Folder1B1	FileA
    |	|	|--FileB
    |	|	|--FileC
    |	|	|--FileD
    |	|--Folder1B2	FileA
    |	|	|--FileB
    

    And the common replacement regex \r\n$3$2$3 or \r\n\3\2\3 :

    • The DaveyD search regex (?<=[^\|])(\t)(.+$\r\n)([\|\t\-]+) works, with the Replace All button, ONLY

    • The Dail search regex (?<!\|)(\t)(.+$\r\n)([|\t-]+) works with, either, the Replace All or the Replace button

    • The Gerdb42 search regex (?<![|])(\t)(.+$\r\n)([\|\t\-]+) works with, either, the Replace All or the Replace button

    • The Dail search regex [^\|]\K(\t)(.+$\r\n)([\|\t\-]+) works with the Replace All button, ONLY


    Now, allow me to give you my own solution :

    Find what : (?<!\|)\t(.+)(\R[|\t-]+)

    Replace with : $2$1$2

    Notes :

    • As we don’t need the tabulation character, before FileA, we do not have to surround it with round brackets

    • We move the End of line characters, after FileA, into the final group 2. So, we do not need the part \r\n, at the beginning of the replacement regex

    • Inside the class range [|\t-], the escape character, before the pipe character, is useless

    • Inside the class range [|\t-], the escape character, before the minus character isn’t mandatory, too, if this minus sign begins or ends the class range !


    Remark :

    With the improved regex engine, build by François-R Boyer, in June 2013 ( link below ) , as the search is done with true 32 bits codepoints, instead of 16 bits ones, the DaveyD’s regex does work with the Replace button , too !

    https://sourceforge.net/projects/npppythonplugsq/files/Beta N%2B%2B regex code/

    Unfortunately, this improved regex engine, does not work anymore, since the 6.9.1 version of N++ :-((( Too sad !

    So, to use it, just replace the current SciLexer.dll by the version of François-R Boyer, based on Scintilla 2.2.7.0, in a N++ version, prior to the 6.9.1 one !

    Best Regards,

    guy038

    P.S :

    If you copy/paste the folder tree structure, above, to do some tests, you must perform the following S/R, first :

    Find what : \x20{1,4}

    Replace with : \t

    in order to change all the space characters into tabulation characters !



  • @dail thanks for reminding me about the \K - I’ve used it in the past - it comes in handy when needing to use +*{} operators on lookbehinds.

    @guy038 - as always, thanks for the super-duper clarification! I’ve tried your expression as well and it works great!
    It would be nice if we can get all the working regex pieces together… :)

    Thanks to all!
    David



  • Hi DaveyD and All,

    Ah, Yes ! I can apply the Dail’s \K syntax to my previous regex. So, finally, here are my two solutions :

    Find what : (?<!\|)\t(.+)(\R[|\t-]+) which works with, either, the Replace All or the Replace button

    OR

    Find what : [^|]\t\K(.+)(\R[|\t-]+), which ONLY works with the Replace All button

    For these two search regexes, the replacement regex is :

    Replace with : $2$1$2

    I don’t think that we can shorten them, anymore !

    Notes :

    • When your search regex contains a \K form, the step by step replacement never works !! That’s the normal behaviour !

    • There are two cases, where the \K feature is mandatory and can NOT be replaced with lookbehinds :

      • When the regex, inside the lookbehind, could match non-fixed length strings, as, for instance, the regex (?<=\d+)abc

      • When the regex, inside the lookbehind contains alternatives, of different length, as, for instance, the regex (?<=(12|345|6789))abc

    • So, in order to get valid regular expressions, you must change them, respectively, into the two, below, which include, both, the \K syntax :

      • \d+\Kabc

      • (12|345|6789)\Kabc

    • On the contrary, for instance, the two regexes (?<=\d{3})abc and ((?<=(00|99))abc are quite valid ones. Indeed, inside the lookbehind, the former refers to a three-digits number, only and, in the later, each alternative refers to a same two-digits number :-)

    You may test these 4 regexes against this short example text, below :

    00abc
    12abc
    345abc
    6789abc
    99abc
    

    Cheers,

    guy038


Log in to reply