Translate regex to notepad++'s dialect and 'Reverse expression'
-
@litos81 said in Translate regex to notepad++'s dialect and 'Reverse expression':
Any help would be very much appreciated.
Yes your regex is not quite correct for the Notepad++ (NPP) engine. I tried to decipher it and adjust to work within Notepad++ and came up with this which does select some various sections in the 2 example lines.
Have a look at my interpretation and see if it does appear to select what you intended.
(?<=Advisory,|ompliant,)[^\|]*ApproxAge.*?\|[\w\-,]+|^[^\|]*ApproxAge.*?\|[\w\-,]+
- Note I had to shorten the NON-Compliant as in NPP the lookbehind options must be a “fixed” length. As you had 9 vs 14 characters it does NOT comply. See
https://community.notepad-plus-plus.org/topic/19006/regex-positive-look-behind-with/6 - Changing your
[\w-,]
to[\w\-,]
as the-
is the method of showing a range so it has special meaning. ThusA-Za-z
, so it needed to be "delimited to provide the actual character. You could also have used[-\w,]
as when the-
is in the first position it is the character-
, not the range identifier.
So as it stands my interpretation does select some text, however is it the same as you wished? Of course this may not be any good as you say you really wanted the reverse of this anyways.
I think you need to better identify what you are attempting. If need be just go with one option you want to alter and fully explain that, in terms of what needs identifying and the parameters for it before you select and I guess remove if that seems to be your intention. Then repeat the explanation for the other options you also wish to change.
Terry
PS You say you are not an expert in regex, could have fooled me. This is quite a complex problem and you did actually solve it, albeit using a different regex engine. So good on you. We do welcome posters having attempted to solve it, coming here when they have hit the wall. Here’s hoping together we can help you!
- Note I had to shorten the NON-Compliant as in NPP the lookbehind options must be a “fixed” length. As you had 9 vs 14 characters it does NOT comply. See
-
@Terry-R said in Translate regex to notepad++'s dialect and 'Reverse expression':
(?<=Advisory,|ompliant,)[^|]ApproxAge.?|[\w-,]+|^[^|]ApproxAge.?|[\w-,]+
@Terry-R thanks so much for your reply.
The expression marks a few of the rules but not all the ones I wanted to keep. I’ll try to explain myself better, but please bear with me as I’m really bad at explaining stuff even in my own language.
I have a file with multiple lines, each of them containing features related to silage/slurry pits and septic tanks. Those features and their values (
feature#value
) are part of compliance rules that can contain one or more features. Every rule ends with either|Advisory
or|Non-Compliant
. In the example below there are three rules.SepticTank#No|Non-Compliant, SepticTank#Yes, StoredEffluent#No, ApproxAge#Pre1991|Advisory, ApproxAge#Pre1991, SepticTank#Yes|Advisory The 3 rules are: SepticTank#No|Non-Compliant SepticTank#Yes, StoredEffluent#No, ApproxAge#Pre1991|Advisory ApproxAge#Pre1991, SepticTank#Yes|Advisory
I need to remove all the rules that do not contain the feature
ApproxAge
in them, so the line above should end up being:SepticTank#Yes, StoredEffluent#No, ApproxAge#Pre1991|Advisory, ApproxAge#Pre1991, SepticTank#Yes|Advisory
(The rule
SepticTank#No|Non-Compliant
was removed because doesn’t contain the featureApproxAge
)
Another example:
Before:
ApproxAge#Post1991, SepticTank#No|Advisory, SepticTank#No, ApproxAge#Pre1991|Non-Compliant, SepticTank#No|Non-Compliant, ApproxAge#Pre1991, SepticTank#Yes|Advisory, SepticTank#No, StoredEffluent#No|Non-Compliant
After:
ApproxAge#Post1991, SepticTank#No|Advisory, SepticTank#No, ApproxAge#Pre1991|Non-Compliant, ApproxAge#Pre1991, SepticTank#Yes|Advisory
(The rules
SepticTank#No|Non-Compliant
andSepticTank#No, StoredEffluent#No|Non-Compliant
were removed because they do not containApproxAge
)
I started to solve the problem trying to find the rulesI wanted to keep.
Using your modified version(?<=Advisory,|ompliant,)[^\|]*ApproxAge.*?\|[\w\-,]+|^[^\|]*ApproxAge.*?\|[\w\-,]+
Notepad++ marks some fo the rules I need to keep but regex101.com still marks more (I guess it is the global/multiline modifiers I have selected in the website, but I’m not sure)Notepad++ after ‘Marking All’ only marks one rule:
SepticTank#No|Non-Compliant,
SepticTank#Yes, StoredEffluent#No, ApproxAge#Pre1991|Advisory
, ApproxAge#Pre1991, SepticTank#Yes|Advisoryregex101.com correctly marks all the rules I want to keep (with /gm flags):
SepticTank#No|Non-Compliant,SepticTank#Yes, StoredEffluent#No, ApproxAge#Pre1991|Advisory
,ApproxAge#Pre1991, SepticTank#Yes|Advisory
However, I realised that even if I get that pattern working it is not what I need because I don’t think there is an option in Notepad++ to delete what is not marked (I’m only aware of deleting entire lines not bookmarked)
So I guess that what I need is a regex pattern that excludes what I need to keep, in other words, the opposite regex I was trying to fix. Then I would be able to ‘Replace All’ with
(Empty)
. I read about?!
but I’m totally incapable of negating that whole pattern with ?! I guess is not as simple as (?!((?<=Advisory,|ompliant,)[^|]ApproxAge.?|[\w-,]+|^[^|]ApproxAge.?|[\w-,]+))Thanks again for your time
-
@litos81 said in Translate regex to notepad++'s dialect and 'Reverse expression':
Notepad++ to delete what is not marked
Well in NPP the multi line modifier (generally) goes at the start of a regex and is
(?s)
. However as you appear to show the rules as NOT crossing lines and you use.*?
which is a lazy (non greedy) capture so I don’t think the regex is looking like it will cross lines. I’m not actually on a PC currently so can’t verify my assumptions.As the rules appear to be contained within lines yet each line appears to have 3 independent rules (which you are okay to remove some portions of) why not convert each rule to a separate line? It appears we first look for a
|
followed by a number of characters which are covered by your group[\w-,]
. The very next character can get replaced by the\r\n
which is the general method of inserting a carriage return and line feed. Then you can use the “Mark” function and select a line based it having the text “ApproxAge” in it. As you are aware that you can remove “unmarked” lines would that suffice?The example lines are hopefully a good representation of the real data, otherwise we may need to see better examples. If necessary hide or otherwise alter some of the data if it is of a sensitive or confidential nature.
Terry
-
Hello @litos81, @terry-R and All
Just do a test with the following regex S/R :
-
SEARCH
(?-si)(\|(Advisory|Non-Compliant)|^)\K(,)?\h*((?!ApproxAge).)+?\|(?2)(,\x20)?
-
REPLACE
(?3(?5,\x20))
Notes :
-
The regex catches the smallest non-null zone of any standard character ( in-line modifier
(?-s)
) ending with, either, the string|Advisory
or|Non-Compliant
, with this exact case ( in-line modifier(?-i)
), as well as the possible comma, before and the possible string comma + space char, after, and which doesnot
contain the exact stringApproxAge
-
In replacement, this zone is deleted and the string comma + space is rewritten ONLY IF a comma is present, both, before ( existing group
3
) and after ( existing group5
), the zone to delete, due to the conditional replacement syntax ! -
The
(?2)
is a sub-routine call to group2
which represents the simple regexAdvisory|Non-Compliant
, needed to end any searched zone -
Of course, you may change the
ApproxAge
feature with any other one, just inserting its name in the negative look-ahead ((?!......)
) which is tested at any location of a zone, after the first non-blank char, till the string|Advisory
or|Non-Compliant
Best Regards,
guy038
-
-
@guy038 said in Translate regex to notepad++'s dialect and 'Reverse expression':
Just do a test with the following regex S/R :
@guy038 Thank you so much for your time and your solution to the problem.
Your regex works as expected and even it takes good care of spaces an commas in the replacements. I’ll admit that I have yet to digest the expression…@Terry-R thanks for your suggestion on breaking the rules in separate lines, I was so focused in trying to find an expression that I forgot to ‘break-up’ the problem. I was trying to record a macro with all the steps involved (create new blank lines between existing lines, break-up rules in separate lines, mark the lines
ApproxAge|\n\s*\n
to bookmark, inverse boomark, remove bookmarked lines and put back together the remaining rules in lines…) I think it would have worked but I will use @guy038’s solution as it requires less modifications.Thanks again to both of you for your time (specially being the weekend!). I was thinking in donating to the notepad++ project or perhaps any other nonprofit organisation or charity you may want to suggest as a humble payment for your efforts.
Kind Regards,
Carlos -
In free-spacing mode, the search regex could be written as below :
(?x) # Regex FREE-SPACING mode ( SPACES are irrelevant, except if ESCAPED or the [ ] syntax ) # Everything located AFTER the # character is IGNORED, too (?-s) # The . REGEX symbol matches a SINGLE STANDARD char, ONLY (?-i) # Search SENSIBLE to CASE ( \| (Advisory|Non-Compliant) | ^ ) # LEADING string "|Advisory" OR "|Non-Compliant" ( Group 2 ) OR ^ ( START of line ) \K # CANCELS any match, so far and RESETS the engine working position (,)?\h* # A POSSIBLE COMMA ( Group 3 ), followed with POSSIBLE HORIZONTAL blank chars ( (?!ApproxAge). )+? # Any SMALLEST zone, NOT containing the string "ApproxAge", till... \| # A PIPE character and... (?2) # String "|Advisory" OR "|Non-Compliant" and... (,\x20)? # A POSSIBLE ENDING string ", " ( Group 5 )
-
Simply, select all these lines and use the
Ctrl + H
shortcut -
Add the Replace zone
(?3(?5,\x20))
Cheers,
guy038
P.S. :
You may consider to donate to the N++ project ;-))
Oh… I forgot to mention that, because of the
\K
syntax, in the search regex, you need to use theReplace All
button, ONLY, for an effective replacement !! -
-
You may have been intrigued by the
(?2)
syntax !One example :
The regex
(^|(ABC|DEF|GHI|JKL|MNO|PQR|STU|VWX))\d+((?2)|$)
does match the four lines below :ABC1234 DEF789743STU 11100001111JKL 0123456789
Whereas the regex
(^|(ABC|DEF|GHI|JKL|MNO|PQR|STU|VWX))\d+(\2|$)
just matches the first and last line !Why ? Well, when we use the second regex, with the
\2
syntax, which is a back-reference to the group2
:-
In the first line, the number
1234
is preceded by ABC and ends the line, matching the$
assertion => No problem -
In the last line, the number
0123456789
begins and ends the line => No problem too, as the assertions^
and$
are both verified -
In the second line, the number
789743
is preceded by DEF, stored as group2
, but ends with STU. This cannot match the last part(\2|$)
of the regex, as\2
represent the present value, stored in group2
, which is DEF ! -
In the third line, the number
11100001111
begins the line and matches the^
assertion. Then, the group2
is not defined. So, any further\2
syntax is not correct for our Boost regex engine ! As the number is followed with JKL, it does not match the$
assertion, too !
Now, in the first regex, the subroutine call syntax
(?2)
is, indeed, a subroutine to the regex itself, contained in group2
, that is to say, the regexABC|DEF|GHI|JKL|MNO|PQR|STU|VWX
This means that it’s just like if you would had written the regex :
(^|(ABC|DEF|GHI|JKL|MNO|PQR|STU|VWX))\d+(ABC|DEF|GHI|JKL|MNO|PQR|STU|VWX|$)
Which, of course, matches the four lines above !
So, if a long sub-regex is repeated, one or several times, in the overall regex, this syntax may reduce significantly the regex length. But, on the other hand, this syntax is more difficult to grasp !
Best Regards,
guy038
-
-
@litos81 said in Translate regex to notepad++'s dialect and 'Reverse expression':
I was thinking in donating to the notepad++ project
I’m glad you are satisfied with the solution @guy038 has provided. In terms of donating here’s an actual link to the NPP home website with the donation page.
https://notepad-plus-plus.org/donate/
if you do feel so inclined.As it was the weekend I was in an information gathering state and would have looked at a different type of solution possibly the one I mentioned using each rule on a different line once on a computer to do the tests. I note you suggest the need to re-combine the rules again after removal of those unwanted ones. Can I take it from your statement that each line is actually a set of 3 “dependent” rules, perhaps about 1 subject or location?
I will give you my concept, obviously I’m not about to put this into a regex as you have a solution which works for you.
- Number each line using the “column editor”
- The line number in step 1 is copied to each
|
character. this will allow for easier re-combining after the unwanted rules have been removed. - Use a “Mark” function to mark either those rules you want to keep (generally an easier to code regex) or those unwanted.
- Remove either the marked or unmarked lines depending on how the regex was coded in step 3.
- Re-combine the rules according to the number given each rule (from the line number issued in step 1)
- Tidy up the lines, especially removal of the line number inserted to each rule.
As you see, the ability to “break up” a large complex problem such as this may sometimes allow you to see the way forward rather than attempt to treat the whole problem in 1 step.
best wishes
Terry -
@guy038 said in Translate regex to notepad++'s dialect and 'Reverse expression':
Oh… I forgot to mention that, because of the \K syntax, in the search regex, you need to use the Replace All button, ONLY, for an effective replacement !!
Hi, I just wanted to say that while the expression works I have to click the ‘Replace All’ at least a couple of times because in some cases the expression seems to miss rules that should be excluded. I’m perfectly fine with that, I don’t mind clicking the button several times but I just wanted to let you know. And if the expression was meant to be used with one click only I honestly don’t know how to fix it (I tried to add a
$
to the last bit(,\x20|$)?
but didn’t change the output)ApproxAge#Post1991|Advisory, Compliant#No, NotCompliantReason#Mechanical|Advisory, Compliant#No, NotCompliantReason#DischargeCapable|Advisory ApproxAge#Pre1991|Advisory, Compliant#No, NotCompliantReason#Mechanical|Advisory, SepticTank#No|Non-Compliant SepticTank#No, StoredEffluent#No, ApproxAge#Pre1991|Advisory, Compliant#No, NotCompliantReason#Mechanical|Advisory, SepticTank#No|Advisory
The button ‘Replace All’ needs to be clicked twice to remove the unwanted rules in the previous lines.
Regards,
Carlos -
@litos81 said in Translate regex to notepad++'s dialect and 'Reverse expression':
I have to click the ‘Replace All’ at least a couple of times because in some cases the expression seems to miss rules that should be excluded.
if the expression was meant to be used with one click only I honestly don’t know how to fix it
Sometimes, due to the nature of the data, this situation results.
Also sometimes, the search/replace expressions can be rewritten to be a “one-click” solution, sometimes not.
Often the complexity involved in such a rewrite is, well, extreme and “not worth it” – meaning not putting the time required to figure it out into it.But, just in case you thought it was some kind of “bug”…it’s not.