Bookmark sets of lines that does not meet criteria
-
I had another look at my (and @guy038) solutions and thought there are too many steps, it should be easier. I did eventually come up with a much neater solution. My regex will identify correct sets for 1,2 and 3 part answers in 1 step (can be expanded to as many as required). Then the final step is to inverse the bookmarks ending up with the sets to check and possibly edit.
So search mode must be set to “regular expression”. Also tick “wrap around”.
- Mark the good lines, any with 1 to 3
____
and 1 less of/
on the first answer line.
Mark function, have “Bookmark line” also ticked.
Find What:(?-s)^\d+(?=.*\x5f)([^\x5f\r\n]*\x5f+)?+([^\x5f\r\n]*\x5f+)?+([^\x5f\r\n]*\x5f+)?+[^\x5f\r\n]*\R[^/\r\n]*(?(2)/)[^/\r\n]*(?(3)/)[^/\r\n]*\R(.+\R){3}\x20*\R?
This regex can be easily expanded to suit as many parts as required to search for, obviously it gets longer with each set added. Here is a 5 part regex:
(?-s)^\d+(?=.*\x5f)([^\x5f\r\n]*\x5f+)?+([^\x5f\r\n]*\x5f+)?+([^\x5f\r\n]*\x5f+)?+([^\x5f\r\n]*\x5f+)?+([^\x5f\r\n]*\x5f+)?+[^\x5f\r\n]*\R[^/\r\n]*(?(2)/)[^/\r\n]*(?(3)/)[^/\r\n]*(?(4)/)[^/\r\n]*(?(5)/)[^/\r\n]*\R(.+\R){3}\x20*\R?
So if you have some 4 part questions you can just use the 5 part one as it checks any from 1 to 5.So that is 2 steps, first is “bookmarking” the good lines, the second is inverse bookmarks to get the lines (question sets) to check and edit.
Here is the description of what each part is performing:
(?-s)
…states that the dot character.
cannot include the newline character
^\d+(?=.*\x5f)
…find a line starting with some numbers, at least 1. If found check that the line also contains a____
string. So any question sets missing this are obviously wrong and will never be bookmarked.
([^\x5f\r\n]*\x5f+)?+
…look for some characters other than the_
(or newline), then look for a number of_
characters, at least 1 together. The?+
is possessive, so if found keep them. This sub-expression is repeated for the number of___
to be located.
[^\x5f\r\n]*\R
…look for some characters other than the_
(or newline), then look for a newline charcter.
[^/\r\n]*(?(2)/)
…look for some characters other than the/
(or newline). Check if group 2 was created earlier, if so then look for a/
. This sub-expression is repeated for the number of/
to be found, which is 1 less than the number of____
to be found.
[^/\r\n]*\R
…look for some characters other than the/
(or newline), followed by newline.
(.+\R){3}
…look for 3 lines including a newline character.
\x20*\R?
…look for any following “empty” line, so it may contain nothing or 1 (or more) spaces.Terry
- Mark the good lines, any with 1 to 3
-
Hello, @bá-hùng-lê, @terry-r and All,
When I first read the @terry-r post, I was intrigued by the
(?(2)/)
syntax. After a while, I understood that it is a conditional regex syntax. I must admit that I’ve never used this feature, yet, in my replies on this forum… and that’s a big mistake !The general syntax of a conditional regex structure is :
(?(Condition)Regex if TRUE[|Regex if FALSE])
, where Condition is, either :-
#
, a digit of a numbered group -
<Name> / 'Name'
, a name of a named group -
(?=••••) / (?!••••) / (?<=••••) / (?<!••••)
, a look-around assertion -
R
, a recursive reference to the overall regex -
Rn
, a recursive reference to a numbered group n -
R&Name
, a recursive reference to a named group Name
These
conditional
regexes, introduced by Terry, are a powerful method, especially when two sets of data are linked together. This is precisely the case of this topic, as :-
We have a lot of questions
_____
in a first line -
Then, we have four set of answers and each set must contain as many answers as there are
______
areas, in the question
So :
- In case of one area to fill in, the regex should detect this correct text :
1. 11111 _____ 22222 a. wwwww b. xxxxx c. yyyyy d. zzzzz
- In case of two areas to fill in, the regex should detect this correct text :
1. 11111 _____ 22222 _____ 33333 a. wwwww / sssss b. xxxxx / ttttt c. yyyyy / uuuuu d. zzzzz / vvvvv
- In case of three areas to fill in, the regex should detect this correct text :
1. 11111 _____ 22222 _____ 33333 _____ 44444 a. wwwww / sssss / ooooo b. xxxxx / ttttt / ppppp c. yyyyy / uuuuu / qqqqq d. zzzzz / vvvvv / rrrrr
And so on…
Therefore, if we assume a text, with three areas
_____
, like above :- The
11111_____
area stored as group1
- The
22222_____
area stored as group2
- The
33333_____
area stored as group3
Here is what the regex should look at :
-
A letter a and a dot
-
A space char and the
wwwww
area, as there is always ONE_____
area to fill in -
A space char and the
/ sssss
area, if group2
exists -
A space char and the
/ ooooo
area, if group3
exists
So, here is a regex, expressed with the free-spacing mode
(?x)
, which suits, from one to five_____
areas :(?x-is) # FREE-SPACING mode + search SENSIBLE to CASE + DOT = STANDARD character # DEFINITION of groups 1, 2 and 3 ( \x20 (?: [^_\r\n]+ \x20 )? ) # (G1) = SPACE + [ ANY char(s), DIFFERENT from _ and EOL + SPACE ] ( \x20 [^/\r\n]+ ) # (G2) = SPACE + ANY char(s), DIFFERENT from / and EOL ( \x20 / ) # (G3) = SPACE + SLASH ¤ # NEVER matches ( NON-EXISTING '¤' ) | # OR # FOR the QUESTION line : ^\d+ \. # DIGITS at START + DOT (?=.*_) # IF an UNDERSCORE exists in CURRENT line (?![\x20_]*$) # IF CURRENT line DON'T contain UNDERSCORE(S) and SPACE(s), ONLY (?1)_+ # = G1 + UNDERSCORE(S) ((?1)_+)?+ # (G4)?+ = [ G1 + UNDERSCORE(S) ] in an ATOMIC way ((?1)_+)?+ # (G5)?+ = [ G1 + UNDERSCORE(S) ] in an ATOMIC way ((?1)_+)?+ # (G6)?+ = [ G1 + UNDERSCORE(S) ] in an ATOMIC way ((?1)_+)?+ # (G7)?+ = [ G1 + UNDERSCORE(S) ] in an ATOMIC way (?:\x20 [^_\r\n]+)? \R # [ SPACE + ANY char(s), DIFFERENT from _ and EOL chars ] + EOL char(s) # END FOR # For EACH ANSWER line, below : # LOWERCASE letter [abcd] at START + DOT, if CURRENT line NOT BLANK # G2 + IF group 4 or 5 or 6 or 7 EXISTS, search G3 + G2 for each EXISTING group + EOL char(s) # END FOR ^a \. (?!\h*$) (?2) (?(4)(?3)(?2)) (?(5)(?3)(?2)) (?(6)(?3)(?2)) (?(7)(?3)(?2)) \R ^b \. (?!\h*$) (?2) (?(4)(?3)(?2)) (?(5)(?3)(?2)) (?(6)(?3)(?2)) (?(7)(?3)(?2)) \R ^c \. (?!\h*$) (?2) (?(4)(?3)(?2)) (?(5)(?3)(?2)) (?(6)(?3)(?2)) (?(7)(?3)(?2)) \R ^d \. (?!\h*$) (?2) (?(4)(?3)(?2)) (?(5)(?3)(?2)) (?(6)(?3)(?2)) (?(7)(?3)(?2)) \R \x20* \R? # [ SPACE char(s) ] + [ EOL char(s) ]
The regex’s size, with comments after the
#
symbol, seems important but this regex is able to detect numerous syntax errors, as it does not match the following cases :-
Number of answers in a set, different from the number of
____
areas ( the main condition ) -
Missing space before and/or after any
______
area and any/ symbol
-
Missing number, letter, dot and/or space at beginning of line
-
Lines without any content, after the dot and space char
-
Switching of two or more answer lines
-
Missing or excess answer lines
-
Excess blank lines between each section
How this regex works ? This regex contains two alternatives :
-
The first alternative, below, is needed to store the groups
1
,2
and3
, by reference. So the exact regex is stored ( not its present value ) -
The second one is the main regex which looks for a correct section, with a question line and a set of four answer lines
(?x-is) # FREE-SPACING mode + search SENSIBLE to CASE + DOT = STANDARD character # DEFINITION of groups 1, 2 and 3 ( \x20 (?: [^_\r\n]+ \x20 )? ) # (G1) = SPACE + [ ANY char(s), DIFFERENT from _ and EOL + SPACE ] ( \x20 [^/\r\n]+ ) # (G2) = SPACE + ANY char(s), DIFFERENT from / and EOL ( \x20 / ) # (G3) = SPACE + SLASH ¤ # NEVER matches ( NON-EXISTING '¤' ) | # OR
As you can see, in order to properly define these groups, I’m using a special symbol
¤
, not used in current file. Therefore, this first alternative will never matches ! But, fortunately, the groups1
,2
and3
are defined during this match attempt, and, above all, remain available while trying the main second alternative : that’s the KEY point ;-))You can see this first failed alternative as a group definition region, which is never part of the final match ;-))
Near the end of the main alternative, there are four very similar lines :
(?x-is) ^a\. (?!\h*$) (?2) (?(4)(?3)(?2)) (?(5)(?3)(?2)) (?(6)(?3)(?2)) (?(7)(?3)(?2)) \R
which shows that the regex, needed to match a complete
a.•••••••
line, is increasing as the number of existing groups,4
,5
,6
and7
, increases ! Theconditional
regex syntaxes(?(#)•••••)
make all this process elegant and almost obvious ;-))I hope that the explanations, given in comments, will be enough to satisfy your curiosity
Best Regards,
guy038
-
-
@guy038 said in Bookmark sets of lines that does not meet criteria:
NEVER matches ( NON-EXISTING ‘¤’ )
I was intrigued by this part.
(Well, okay, I was intrigued by other parts of Guy’s posting, too!)FWIW, if I ever want to never-match, I use
(?!)
.
In truth, I haven’t examined the situation above to see if this would work, but I presume it would.Guy, I hope we see further discussion of the conditional regex structure in future postings, because the one above may not make it clear how to use it, for regex beginners/intermediates.
-
@guy038 said in Bookmark sets of lines that does not meet criteria:
(?=••••) / (?!••••) / (?<=••••) / (?<!••••), a look-around assertion
Interestingly, the Boost documentation for “Conditional expressions” only discusses the first two (the look-aheads, not the look-behinds) of these four, but in my limited testing all four seem to work.
-
This post is deleted! -
This post is deleted! -
Sorry for the deleted posts. I see now you were talking about the “assertions” in the “Conditional expressions”, not the normal lookahead and lookbehind.
-
Hello, @bá-hùng-lê, @terry-r, @alan-kilborn, @peterjones and All,
First, Alan, I’m really sorry as you’re perfectly right about a NEVER match regex sequence. I always forget the other syntaxes :-(
Indeed, in the regex, in free-spacing mode, of my previous post, you may replace the
¤
symbol with, either :-
The empty negative look-ahead
(?!)
, as it is impossible to NOT match an empty string -
The backtracking control verb
(*F)
or(*FAIL)
, which seems the official syntax to cancel the current match attempt and, possibly, try other parts of the overall regex for a successful match attempt
Now, some points about the regex conditional structures :
I did some tests and, globally, I would say that this feature is not essential in “everyday” regexes. But, as I previously said, the conditional syntaxes are really interesting when dealing with correlated data !
For instance, let start with the regex
(?-is)^.*Paul.*\R(.*Bob.*\R.*Alice.*\R|.*Alice.*\R.*Bob.*\R)
, which matches the two blocks below, where the two last lines can be switched !Here is Paul Smith Yesterday I saw Bob who spoke with Alice, in the street Here is Paul Smith Yesterday I saw Alice who spoke with Bob, in the street
Note that we can also simplify this regex as
(?-is)^.*Paul.*\R.*(Bob.*\R.*Alice|Alice.*\R.*Bob).*\R
But we can choose to use, for instance, the conditional structure
(?(#)•••••••)
. So, our regex is changed into :(?-is)^.*Paul.*\R.*((Bob)|Alice).*\R.*(?(2)Alice|Bob).*\R
As you can see :
-
If group
2
exists, soBob
, in second line, it will search forAlice
in the third line -
Else
Alice
has been found in second line and, then, it will search forBob
, in the third line
In terms of complexity, we can’t clearly see the advantages of conditionals !
Let’s play with the text, below, with each section to match :
A X a AB XY ab ABC XYZ abc ABCD XYZT abcd
First, we can use the usual and sequential form, with four alternatives in order to match each block. In free-spacing mode, we have :
(?x-i) A \R X \R a \R | AB \R XY \R ab \R | ABC \R XYZ \R abc \R | ABCD \R XYZT \R abcd \R
Again, let’s use the approach with conditionals. We get :
(?x-i) A (B)? (C)? (D)? \R X (?(1)Y) (?(2)Z) (?(3)T) \R a (?(1)b) (?(2)c) (?(3)d) \R
Like with the previous example, the gain of this new feature is not that obvious ! Now, let’s imagine that some letters represent an important and/or complicated regex : we immediately see the benefice of the later syntax, as each letter occurs just once in the complete regex !
You might retort that some part of the former regex can be factorized. However, the irreducible form seems to be :
(?x-i) A ( \R X \R a | B \R XY \R ab | BC \R XYZ \R abc | BCD \R XYZT \R abcd ) \R
But, if X and a stands for very long regexes, this syntaxe remains tedious !
We could have the same reasoning with single-line blocks of text, like below :
AXa ABXYab ABCXYZabc ABCDXYZTabcd
The normal regex syntax is
(?x-i) A X a \R | AB XY ab \R | ABC XYZ abc \R | ABCD XYZT abcd \R
Which could be simplified as :
(?x-i) A ( X a | B XY ab | BC XYZ abc | BCD XYZT abcd ) \R
Now, the same syntax, with conditionals, is :
(?x-i) A (B)? (C)? (D)? X(?(1)Y)(?(2)Z)(?(3)T) a(?(1)b)(?(2)c)(?(3)d) \R
Again, in case of complicated regexes, standing for the letters, this last syntax seems better !
Best Regards,
guy038
-
-
Hi, @bá-hùng-lê, @terry-r, @alan-kilborn, @peterjones and All,
In retrospect, my argument for conditional expressions seems a bit weak ! Indeed, if we consider this regex, without conditionals :
(?x-i) A \R X \R a \R | AB \R XY \R ab \R | ABC \R XYZ \R abc \R | ABCD \R XYZT \R abcd \R
Even if we suppose that, let’s say, the
X
anda
stand for long regex sequences, we still can use sub-routine call syntax to simplify this example as :(?x-i) (X)(a)(*F) | A \R (?1) \R (?2) \R | AB \R (?1)Y \R (?2)b \R | ABC \R (?1)YZ \R (?2)bc \R | ABCD \R (?1)YZT \R (?2)bcd \R
Perhaps, we’ll come across examples, later, which clearly show some real advantages to use the conditional feature !
BR
guy038
-
Hello @alan-kilborn and All,
I’ve found out a simple example of the advantage of the conditional feature !
Let’s suppose that you have a particular tag
<guy>
and that you want :-
To delete the starting tag
<guy>
with, both, its leading and trailing space chars -
To delete the ending tag
</guy>
with its leading space char, only
-
The simple and obvious solution is :
-
SEARCH
\x20<guy>\x20|\x20</guy>
-
REPLACE
Leave EMPTY
-
-
Now, this shorter regex S/R, with a conditional expression, related to group
1
, is :-
SEARCH
\x20<(/)?guy>(?(1)|\x20)
-
REPLACE
Leave EMPTY
-
I verified that the suppression of
500,000
starting tags and500,000
ending tags, in one step, take the same time, whatever the regex syntax used !Best Regards,
guy038
-