Translate regex to notepad++'s dialect and 'Reverse expression'

litos81

Hello,
I have a file where each line contains a (not too good) concatenation of rules - both the features within each rule and the rules themselves are separated by commas ,:

SepticTank#No|Non-Compliant, SepticTank#Yes, StoredEffluent#No, ApproxAge#Pre1991|Advisory, ApproxAge#Pre1991, SepticTank#Yes|Advisory

In the previous line there are 3 rules:
SepticTank#No|Non-Compliant
SepticTank#Yes, StoredEffluent#No, ApproxAge#Pre1991|Advisory
ApproxAge#Pre1991, SepticTank#Yes|Advisory

Every rule always ends with either |Advisory or |Non-Compliant

My goal is to remove from every line any rule that doesn’t include a specific feature. For example, if I’m interested in rules that include the feature ApproxAge, I’d like the previous line to end up being:

SepticTank#Yes, StoredEffluent#No, ApproxAge#Pre1991|Advisory, ApproxAge#Pre1991, SepticTank#Yes|Advisory

(I removed the rule SepticTank#No|Non-Compliant)

I’m not an expert with regex so I’ve been trying different expressions in regex101.com and I came up with this:
(?<=Advisory,|Non-Compliant,)[^\|]*ApproxAge.*?\|[\w-,]+|^[^\|]*ApproxAge.*?\|[\w-,]+

It is not perfect but at least marks the rules I’m interested in. However, I have two problems:
1- I’m not able to translate that expression to notepad++'s regex engine.
2- Actually I’d need the reverse of that expression, so that I can replace the rules I don’t want with (Empty) in the Replace… dialog

Example:

SepticTank#No|Non-Compliant, SepticTank#Yes, StoredEffluent#No, ApproxAge#Pre1991|Advisory, ApproxAge#Pre1991, SepticTank#Yes|Advisory
ApproxAge#Post1991, SepticTank#No|Advisory, SepticTank#No, ApproxAge#Pre1991|Non-Compliant, SepticTank#No|Non-Compliant, ApproxAge#Pre1991, SepticTank#Yes|Advisory,  SepticTank#No, StoredEffluent#No, ApproxAge#Pre1991|Non-Compliant

After removing rules that do not contain the feature ApproxAge:

SepticTank#Yes, StoredEffluent#No, ApproxAge#Pre1991|Advisory, ApproxAge#Pre1991, SepticTank#Yes|Advisory
ApproxAge#Post1991, SepticTank#No|Advisory, SepticTank#No, ApproxAge#Pre1991|Non-Compliant, ApproxAge#Pre1991, SepticTank#Yes|Advisory,  SepticTank#No, StoredEffluent#No, ApproxAge#Pre1991|Non-Compliant

Any help would be very much appreciated.

Terry R

@litos81 said in Translate regex to notepad++'s dialect and 'Reverse expression':

Any help would be very much appreciated.

Yes your regex is not quite correct for the Notepad++ (NPP) engine. I tried to decipher it and adjust to work within Notepad++ and came up with this which does select some various sections in the 2 example lines.
Have a look at my interpretation and see if it does appear to select what you intended.
(?<=Advisory,|ompliant,)[^\|]*ApproxAge.*?\|[\w\-,]+|^[^\|]*ApproxAge.*?\|[\w\-,]+

Note I had to shorten the NON-Compliant as in NPP the lookbehind options must be a “fixed” length. As you had 9 vs 14 characters it does NOT comply. See
https://community.notepad-plus-plus.org/topic/19006/regex-positive-look-behind-with/6
Changing your [\w-,] to [\w\-,] as the - is the method of showing a range so it has special meaning. Thus A-Za-z, so it needed to be "delimited to provide the actual character. You could also have used [-\w,] as when the - is in the first position it is the character -, not the range identifier.

So as it stands my interpretation does select some text, however is it the same as you wished? Of course this may not be any good as you say you really wanted the reverse of this anyways.

I think you need to better identify what you are attempting. If need be just go with one option you want to alter and fully explain that, in terms of what needs identifying and the parameters for it before you select and I guess remove if that seems to be your intention. Then repeat the explanation for the other options you also wish to change.

Terry

PS You say you are not an expert in regex, could have fooled me. This is quite a complex problem and you did actually solve it, albeit using a different regex engine. So good on you. We do welcome posters having attempted to solve it, coming here when they have hit the wall. Here’s hoping together we can help you!

litos81

@Terry-R said in Translate regex to notepad++'s dialect and 'Reverse expression':

(?<=Advisory,|ompliant,)[^|]ApproxAge.?|[\w-,]+|^[^|]ApproxAge.?|[\w-,]+

@Terry-R thanks so much for your reply.

The expression marks a few of the rules but not all the ones I wanted to keep. I’ll try to explain myself better, but please bear with me as I’m really bad at explaining stuff even in my own language.

I have a file with multiple lines, each of them containing features related to silage/slurry pits and septic tanks. Those features and their values (feature#value) are part of compliance rules that can contain one or more features. Every rule ends with either |Advisory or |Non-Compliant. In the example below there are three rules.

SepticTank#No|Non-Compliant, SepticTank#Yes, StoredEffluent#No, ApproxAge#Pre1991|Advisory, ApproxAge#Pre1991, SepticTank#Yes|Advisory

The 3 rules are:
SepticTank#No|Non-Compliant
SepticTank#Yes, StoredEffluent#No, ApproxAge#Pre1991|Advisory
ApproxAge#Pre1991, SepticTank#Yes|Advisory

I need to remove all the rules that do not contain the feature ApproxAge in them, so the line above should end up being:

SepticTank#Yes, StoredEffluent#No, ApproxAge#Pre1991|Advisory, ApproxAge#Pre1991, SepticTank#Yes|Advisory

(The rule SepticTank#No|Non-Compliant was removed because doesn’t contain the feature ApproxAge)

Another example:

Before:

ApproxAge#Post1991, SepticTank#No|Advisory, SepticTank#No, ApproxAge#Pre1991|Non-Compliant, SepticTank#No|Non-Compliant, ApproxAge#Pre1991, SepticTank#Yes|Advisory,  SepticTank#No, StoredEffluent#No|Non-Compliant

After:

ApproxAge#Post1991, SepticTank#No|Advisory, SepticTank#No, ApproxAge#Pre1991|Non-Compliant, ApproxAge#Pre1991, SepticTank#Yes|Advisory

(The rules SepticTank#No|Non-Compliant and SepticTank#No, StoredEffluent#No|Non-Compliant were removed because they do not contain ApproxAge)

I started to solve the problem trying to find the rulesI wanted to keep.
Using your modified version (?<=Advisory,|ompliant,)[^\|]*ApproxAge.*?\|[\w\-,]+|^[^\|]*ApproxAge.*?\|[\w\-,]+ Notepad++ marks some fo the rules I need to keep but regex101.com still marks more (I guess it is the global/multiline modifiers I have selected in the website, but I’m not sure)

Notepad++ after ‘Marking All’ only marks one rule:

SepticTank#No|Non-Compliant, SepticTank#Yes, StoredEffluent#No, ApproxAge#Pre1991|Advisory, ApproxAge#Pre1991, SepticTank#Yes|Advisory

regex101.com correctly marks all the rules I want to keep (with /gm flags):
SepticTank#No|Non-Compliant, SepticTank#Yes, StoredEffluent#No, ApproxAge#Pre1991|Advisory, ApproxAge#Pre1991, SepticTank#Yes|Advisory

However, I realised that even if I get that pattern working it is not what I need because I don’t think there is an option in Notepad++ to delete what is not marked (I’m only aware of deleting entire lines not bookmarked)

So I guess that what I need is a regex pattern that excludes what I need to keep, in other words, the opposite regex I was trying to fix. Then I would be able to ‘Replace All’ with (Empty). I read about ?! but I’m totally incapable of negating that whole pattern with ?! I guess is not as simple as (?!((?<=Advisory,|ompliant,)[^|]ApproxAge.?|[\w-,]+|^[^|]ApproxAge.?|[\w-,]+))

Thanks again for your time

Terry R

@litos81 said in Translate regex to notepad++'s dialect and 'Reverse expression':

Notepad++ to delete what is not marked

Well in NPP the multi line modifier (generally) goes at the start of a regex and is (?s). However as you appear to show the rules as NOT crossing lines and you use .*?which is a lazy (non greedy) capture so I don’t think the regex is looking like it will cross lines. I’m not actually on a PC currently so can’t verify my assumptions.

As the rules appear to be contained within lines yet each line appears to have 3 independent rules (which you are okay to remove some portions of) why not convert each rule to a separate line? It appears we first look for a | followed by a number of characters which are covered by your group [\w-,]. The very next character can get replaced by the \r\n which is the general method of inserting a carriage return and line feed. Then you can use the “Mark” function and select a line based it having the text “ApproxAge” in it. As you are aware that you can remove “unmarked” lines would that suffice?

The example lines are hopefully a good representation of the real data, otherwise we may need to see better examples. If necessary hide or otherwise alter some of the data if it is of a sensitive or confidential nature.

Terry

guy038

Hello @litos81, @terry-R and All

Just do a test with the following regex S/R :

SEARCH (?-si)(\|(Advisory|Non-Compliant)|^)\K(,)?\h*((?!ApproxAge).)+?\|(?2)(,\x20)?
REPLACE (?3(?5,\x20))

Notes :

The regex catches the smallest non-null zone of any standard character ( in-line modifier (?-s) ) ending with, either, the string |Advisory or |Non-Compliant, with this exact case ( in-line modifier (?-i) ), as well as the possible comma, before and the possible string comma + space char, after, and which does not contain the exact string ApproxAge
In replacement, this zone is deleted and the string comma + space is rewritten ONLY IF a comma is present, both, before ( existing group 3 ) and after ( existing group 5 ), the zone to delete, due to the conditional replacement syntax !
The (?2) is a sub-routine call to group 2 which represents the simple regex Advisory|Non-Compliant, needed to end any searched zone
Of course, you may change the ApproxAge feature with any other one, just inserting its name in the negative look-ahead ( (?!......) ) which is tested at any location of a zone, after the first non-blank char, till the string |Advisory or |Non-Compliant

Best Regards,

guy038

litos81

@guy038 said in Translate regex to notepad++'s dialect and 'Reverse expression':

Just do a test with the following regex S/R :

@guy038 Thank you so much for your time and your solution to the problem.
Your regex works as expected and even it takes good care of spaces an commas in the replacements. I’ll admit that I have yet to digest the expression…

@Terry-R thanks for your suggestion on breaking the rules in separate lines, I was so focused in trying to find an expression that I forgot to ‘break-up’ the problem. I was trying to record a macro with all the steps involved (create new blank lines between existing lines, break-up rules in separate lines, mark the lines ApproxAge|\n\s*\n to bookmark, inverse boomark, remove bookmarked lines and put back together the remaining rules in lines…) I think it would have worked but I will use @guy038’s solution as it requires less modifications.

Thanks again to both of you for your time (specially being the weekend!). I was thinking in donating to the notepad++ project or perhaps any other nonprofit organisation or charity you may want to suggest as a humble payment for your efforts.

Kind Regards,
Carlos

guy038

Hi @litos81, @terry-R and All

In free-spacing mode, the search regex could be written as below :

(?x)                                      # Regex FREE-SPACING mode  ( SPACES are irrelevant, except if ESCAPED or the [ ] syntax )
                                          # Everything located AFTER the # character is IGNORED, too
(?-s)                                     # The . REGEX symbol matches a SINGLE STANDARD char, ONLY
(?-i)                                     # Search SENSIBLE to CASE
(  \|  (Advisory|Non-Compliant)  |  ^  )  # LEADING string "|Advisory" OR "|Non-Compliant" ( Group 2 ) OR ^ ( START of line )
\K                                        # CANCELS any match, so far and RESETS the engine working position
(,)?\h*                                   # A POSSIBLE COMMA ( Group 3 ), followed with POSSIBLE HORIZONTAL blank chars
( (?!ApproxAge). )+?                      # Any SMALLEST zone, NOT containing the string "ApproxAge", till...
\|                                        # A PIPE character and...
(?2)                                      # String "|Advisory" OR "|Non-Compliant" and...
(,\x20)?                                  # A POSSIBLE ENDING string ", " ( Group 5 )

Simply, select all these lines and use the Ctrl + H shortcut
Add the Replace zone (?3(?5,\x20))

Cheers,

guy038

P.S. :

You may consider to donate to the N++ project ;-))

Oh… I forgot to mention that, because of the \K syntax, in the search regex, you need to use the Replace All button, ONLY, for an effective replacement !!

guy038

@litos81, @terry-R and All

You may have been intrigued by the (?2) syntax !

One example :

The regex (^|(ABC|DEF|GHI|JKL|MNO|PQR|STU|VWX))\d+((?2)|$) does match the four lines below :

ABC1234
DEF789743STU
11100001111JKL
0123456789

Whereas the regex (^|(ABC|DEF|GHI|JKL|MNO|PQR|STU|VWX))\d+(\2|$) just matches the first and last line !

Why ? Well, when we use the second regex, with the \2 syntax, which is a back-reference to the group 2 :

In the first line, the number 1234 is preceded by ABC and ends the line, matching the $ assertion => No problem
In the last line, the number 0123456789 begins and ends the line => No problem too, as the assertions ^ and $ are both verified
In the second line, the number 789743 is preceded by DEF, stored as group 2, but ends with STU. This cannot match the last part (\2|$) of the regex, as \2 represent the present value, stored in group 2, which is DEF !
In the third line, the number 11100001111 begins the line and matches the ^ assertion. Then, the group 2 is not defined. So, any further \2 syntax is not correct for our Boost regex engine ! As the number is followed with JKL, it does not match the $ assertion, too !

Now, in the first regex, the subroutine call syntax (?2) is, indeed, a subroutine to the regex itself, contained in group 2, that is to say, the regex ABC|DEF|GHI|JKL|MNO|PQR|STU|VWX

This means that it’s just like if you would had written the regex :

(^|(ABC|DEF|GHI|JKL|MNO|PQR|STU|VWX))\d+(ABC|DEF|GHI|JKL|MNO|PQR|STU|VWX|$)

Which, of course, matches the four lines above !

So, if a long sub-regex is repeated, one or several times, in the overall regex, this syntax may reduce significantly the regex length. But, on the other hand, this syntax is more difficult to grasp !

Best Regards,

guy038

Terry R

@litos81 said in Translate regex to notepad++'s dialect and 'Reverse expression':

I was thinking in donating to the notepad++ project

I’m glad you are satisfied with the solution @guy038 has provided. In terms of donating here’s an actual link to the NPP home website with the donation page.
https://notepad-plus-plus.org/donate/
if you do feel so inclined.

As it was the weekend I was in an information gathering state and would have looked at a different type of solution possibly the one I mentioned using each rule on a different line once on a computer to do the tests. I note you suggest the need to re-combine the rules again after removal of those unwanted ones. Can I take it from your statement that each line is actually a set of 3 “dependent” rules, perhaps about 1 subject or location?

I will give you my concept, obviously I’m not about to put this into a regex as you have a solution which works for you.

Number each line using the “column editor”
The line number in step 1 is copied to each | character. this will allow for easier re-combining after the unwanted rules have been removed.
Use a “Mark” function to mark either those rules you want to keep (generally an easier to code regex) or those unwanted.
Remove either the marked or unmarked lines depending on how the regex was coded in step 3.
Re-combine the rules according to the number given each rule (from the line number issued in step 1)
Tidy up the lines, especially removal of the line number inserted to each rule.

As you see, the ability to “break up” a large complex problem such as this may sometimes allow you to see the way forward rather than attempt to treat the whole problem in 1 step.

best wishes
Terry

litos81

@guy038 said in Translate regex to notepad++'s dialect and 'Reverse expression':

Oh… I forgot to mention that, because of the \K syntax, in the search regex, you need to use the Replace All button, ONLY, for an effective replacement !!

Hi, I just wanted to say that while the expression works I have to click the ‘Replace All’ at least a couple of times because in some cases the expression seems to miss rules that should be excluded. I’m perfectly fine with that, I don’t mind clicking the button several times but I just wanted to let you know. And if the expression was meant to be used with one click only I honestly don’t know how to fix it (I tried to add a $ to the last bit (,\x20|$)? but didn’t change the output)

ApproxAge#Post1991|Advisory, Compliant#No, NotCompliantReason#Mechanical|Advisory, Compliant#No, NotCompliantReason#DischargeCapable|Advisory
ApproxAge#Pre1991|Advisory, Compliant#No, NotCompliantReason#Mechanical|Advisory, SepticTank#No|Non-Compliant
SepticTank#No, StoredEffluent#No, ApproxAge#Pre1991|Advisory, Compliant#No, NotCompliantReason#Mechanical|Advisory, SepticTank#No|Advisory

The button ‘Replace All’ needs to be clicked twice to remove the unwanted rules in the previous lines.

Regards,
Carlos

Alan Kilborn

@litos81 said in Translate regex to notepad++'s dialect and 'Reverse expression':

I have to click the ‘Replace All’ at least a couple of times because in some cases the expression seems to miss rules that should be excluded.

if the expression was meant to be used with one click only I honestly don’t know how to fix it

Sometimes, due to the nature of the data, this situation results.
Also sometimes, the search/replace expressions can be rewritten to be a “one-click” solution, sometimes not.
Often the complexity involved in such a rewrite is, well, extreme and “not worth it” – meaning not putting the time required to figure it out into it.

But, just in case you thought it was some kind of “bug”…it’s not.