Community
    • Login

    Bookmark sets of lines that does not meet criteria

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    21 Posts 5 Posters 5.1k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Bá Hùng LêB
      Bá Hùng Lê
      last edited by Bá Hùng Lê

      For example a right set of question would be sth like this

      1. A ____ can bark. A ____ can catch mouses
      a. dog / cat
      b. fish / tiger
      c. human / lady
      d. table / chair
      

      But since its OCR from an image the fill in gap ( ____ ) would be missing like this

      1. A ____ can bark. A can catch mouses
      a. dog / cat
      b. fish / tiger
      c. human / lady
      d. table / chair
      

      So I’m trying to bookmark the faulty converted ones by finding choices with “/” and check the number of gaps if it’s enough ( I wish to make this process auto )

      Terry RT 1 Reply Last reply Reply Quote 0
      • Terry RT
        Terry R @Bá Hùng Lê
        last edited by

        @Bá-Hùng-Lê said in Bookmark sets of lines that does not meet criteria:

        For example a right set of question would be sth like this

        Thanks for showing the examples in black boxes, it makes it so much easier to understand.
        I do some additional questions.

        1. Are the lines with the _____ areas missing always starting with a number.
        2. Are there always 2 questions on this line. So we need to find lines that do NOT have 2 sets of ______ and are started with a number.

        Terry

        1 Reply Last reply Reply Quote 2
        • Bá Hùng LêB
          Bá Hùng Lê
          last edited by

          Man! you’re so polite! You’re helping me and you start with “thank you”? I wish the best of luck to you, my gregarious mate!
          And to answer your questions

          1. Yes the question lines (with ____ ) always start with a number then followed by four answer lines
          2. Unfortunately, there are NOT always two “____” in a question for example:
          2. _____ you for helping me!
          a. Hate
          b. Love
          c. Thank
          d. Kiss
          

          or even 3 fill in:

          3. Your _____ is so beautiful, I _____ it! Where can I _____ it ?
          a. dress / hate / sell
          b. hat / love / buy
          c. table / want / posses
          d. hand / love / take
          
          1 Reply Last reply Reply Quote 1
          • Terry RT
            Terry R
            last edited by Terry R

            @Bá-Hùng-Lê said in Bookmark sets of lines that does not meet criteria:

            Unfortunately, there are NOT always two “____” in a question for example:

            I will thank you again, for answering my 2 questions precisely. You have no idea how good this conversation is. Sometimes we ask questions and never get an answer, or at east nothing that helps us. So yes I can easily say thanks, even though it is me (or someone else) helping you.

            I had presumed that there may be questions that have other than a 2 part answer. so bear with me. I hope to supply some info shortly.

            Given the complexity I decided that the best way forward is to bookmark the lines do look good. Bookmarking can be carried out in steps. So you would bookmark the 2 part questions that are good, then bookmark the 3 part questions. My regex can be expanded out to as many part questions as you have. Do you have an idea if any 4 part questions exist?

            Once all the good question sets have been bookmarked then using the Search, Bookmark, Inverse bookmark you can swap the bookmarks to the lines that don’t comply with the rules. These will be the question sets you need to check and possibly edit.

            I’ll be back shortly with the solution.

            Terry

            1 Reply Last reply Reply Quote 2
            • Terry RT
              Terry R
              last edited by

              @Terry-R said in Bookmark sets of lines that does not meet criteria:

              I’ll be back shortly with the solution

              So as explained before I think it’s best to bookmark in steps. Look for good 3 part questions and bookmark them. Then look for good 2 part questions and bookmark those.
              Then using the “Inverse bookmark” option under Search, Bookmark you finish up with the bookmarks on question sets that need to be checked and possibly edited.
              So using the “Mark” function we have
              Find What:(?-s)\d+[^\x5f\r\n]+?\x5f+[^\x5f\r\n]+?\x5f+[^\x5f\r\n]+?\x5f+[^\x5f\r\n]+\R[^/\r\n]+?/[^/\r\n]+?/[^/\r\n]+\R(.+?/.+\R)*
              As these are regular expressions the Search Mode must be set to regular expression. Tick the “bookmark line” option. This will mark each line with a marker in the left column, generally a blue circle and the line itself is highlighted.
              When you have done the 3 part questions (the above regex) you can repeat with the 2 part question using
              Find What:(?-s)\d+[^\x5f\r\n]+?\x5f+[^\x5f\r\n]+?\x5f+[^\x5f\r\n]+\R[^/\r\n]+?/[^/\r\n]+\R(.+?/.+\R)*

              If you have any 4 part questions I’ve leave it up to you to edit the 3 part one. There are 2 repeated set of subexpressions which get added again. It’s getting late so I can’t provide that right now. If you are still having problems I could help in about 12hrs or maybe someone else might chip in.

              For a bit of background info this (.+?/.+\R)* part of the regex is what highlights the remaining lines in each set if the first line is good. It selects each following line which has a / in it, stopping at a line which doesn’t (next question set or a blank line perhaps).

              Terry

              1 Reply Last reply Reply Quote 4
              • Terry RT
                Terry R
                last edited by

                @Terry-R said in Bookmark sets of lines that does not meet criteria:

                So as explained before I think it’s best to bookmark in steps. Look for good 3 part questions and bookmark them. Then look for good 2 part questions and bookmark those.

                I will just add that I tried to think of a method of finding the bad sets in 1 step, but couldn’t. There are some other forum members here with capabilities exceeding my own and possibly they might have an idea how that might be achieved. You see the primary problem is that regexes cannot count, at least not in a way that helps here as far as I can see. That meant looking at a solution in a different way, sort of a reverse idea, then inverting the answer.

                Terry

                1 Reply Last reply Reply Quote 3
                • guy038G
                  guy038
                  last edited by guy038

                  Hello, @bá-hùng-lê, @terry-r and All,

                  I haven’t looked at @terry-r’s solution yet, so as not to be influenced in any way ;-))

                  My general idea is to temporary change your file contents by :

                  • Replacing any fill gap, _____, with a single / symbol

                  • Adding a / symbol at the end of each choice a., b., c.,and d.

                  It becomes obvious that, after these first replacements, any section, made of the question line and the four answer lines, contains the same number of / symbols, on each line !

                  So, with the appropriate regex, we’ll bookmark, either :

                  • The true empty lines and lines with blank characters only

                  • All correct sections, successively, with one, two, three… fill gap(s)

                  => Now, by cumulative process, all correct sections should be bookmarked !

                  We’ll, then, run the option Search > Bookmark > Inverse Bookmark

                  => All the remaining blocks of bookmarked lines should be sections which need corrections !

                  • Once the number of / symbols is identical, on each line of each bookmarked section, we’ll get the initial layout of your file by :

                    • Replacing any single / symbol in question lines, beginning with a digit, with the string _____

                    • Deleting the last / symbol of any answer line, beginning with an lower-case letter


                  OK ! Let’s start with that initial text :

                  1. _____ you for helping me!                                        # CORRECT section
                  a. Hate
                  b. Love
                  c. Thank
                  d. Kiss
                  
                  2. A _____ can bark. A _____ can catch mouses                       # CORRECT section
                  a. dog / cat
                  b. fish / tiger
                  c. human / lady
                  d. table / chair
                  
                  3. Your _____ is so beautiful, I _____ it! Where can I _____ it ?   # CORRECT section
                  a. dress / hate / sell
                  b. hat / love / buy
                  c. table / want / posses
                  d. hand / love / take
                  
                  2. A _____ can bark. A _____ can catch mouses                       # WRONG section
                  a. dog / cat
                  b. fish / tiger
                  c. human  lady
                  d. table / chair
                  
                  1. you for helping me!                                              # WRONG section
                  a. Hate
                  b. Love
                  c. Tha / nk
                  d. Kiss
                  
                  2. A _____ can bark. A _____ can catch _____ mouses                 # WRONG section
                  a. dog / cat
                  b. fish / tiger
                  c. human / lady
                  d. table / chair
                  
                  3. Your _____ is so beautiful, I it! Where can I _____ it ?         # WRONG section
                  a. dress / hate / sell
                  b. hat / love / buy
                  c. table / want / pos / ses
                  d. hand / love / take
                  
                  1. _____ you for helping me!                                        # CORRECT section
                  a. Hate
                  b. Love
                  c. Thank
                  d. Kiss
                  

                  • Open the Replace dialog ( Ctrl + H )

                    • SEARCH (?-is)_+|^(\l.+[^/\r\n])$

                    • REPLACE (?1\1)/

                    • Tick the Wrap around option

                    • Select the Regular expression search mode

                    • Click on the Replace All button

                  You should get this text :

                  1. / you for helping me!                                        # CORRECT section
                  a. Hate/
                  b. Love/
                  c. Thank/
                  d. Kiss/
                  
                  2. A / can bark. A / can catch mouses                       # CORRECT section
                  a. dog / cat/
                  b. fish / tiger/
                  c. human / lady/
                  d. table / chair/
                  
                  3. Your / is so beautiful, I / it! Where can I / it ?   # CORRECT section
                  a. dress / hate / sell/
                  b. hat / love / buy/
                  c. table / want / posses/
                  d. hand / love / take/
                  
                  2. A / can bark. A / can catch mouses                       # WRONG section
                  a. dog / cat/
                  b. fish / tiger/
                  c. human  lady/
                  d. table / chair/
                  
                  1. you for helping me!                                              # WRONG section
                  a. Hate/
                  b. Love/
                  c. Tha / nk/
                  d. Kiss/
                  
                  2. A / can bark. A / can catch / mouses                 # WRONG section
                  a. dog / cat/
                  b. fish / tiger/
                  c. human / lady/
                  d. table / chair/
                  
                  3. Your / is so beautiful, I it! Where can I / it ?         # WRONG section
                  a. dress / hate / sell/
                  b. hat / love / buy/
                  c. table / want / pos / ses/
                  d. hand / love / take/
                  
                  1. / you for helping me!                                        # CORRECT section
                  a. Hate/
                  b. Love/
                  c. Thank/
                  d. Kiss/
                  

                  • Now, open the Mark dialog ( Ctrl + M )

                    • SEARCH ^(\h*\R)+

                    • Tick the two options Bookmark line and Wrap around

                    • Untick the two options Purge for each search and Backward direction

                    • Select the Regular expression search mode

                    • Click on the Mark All button

                  => This first action bookmarks all the empty and blank lines

                  • Then select one complete line, below, which is a regex in free-spacing mode ( (?x) )

                    • In the Search what: zone, paste, successively, the first, second and third line, from (?x) to end of each line, in order to bookmark, accordingly, sections with one, two and three gaps !

                  (?x)  ^\d+ ( ( [^/\r\n] )+ /                                         (?2)* \R ) ( (?1) ){4}  #  To mark the "ONE   fill GAP" sections
                  (?x)  ^\d+ ( ( [^/\r\n] )+ / (?2)+ /                                 (?2)* \R ) ( (?1) ){4}  #  To mark the "TWO   fill GAP" sections
                  (?x)  ^\d+ ( ( [^/\r\n] )+ / (?2)+ / (?2)+ /                         (?2)* \R ) ( (?1) ){4}  #  To mark the "THREE fill GAP" sections
                  
                  (?x)  ^\d+ ( ( [^/\r\n] )+ / (?2)+ / (?2)+ / (?2)+ /                 (?2)* \R ) ( (?1) ){4}  #  To mark the "FOUR  fill GAP" sections
                  (?x)  ^\d+ ( ( [^/\r\n] )+ / (?2)+ / (?2)+ / (?2)+ / (?2)+ /         (?2)* \R ) ( (?1) ){4}  #  To mark the "FIVE  fill GAP" sections
                  (?x)  ^\d+ ( ( [^/\r\n] )+ / (?2)+ / (?2)+ / (?2)+ / (?2)+ / (?2)+ / (?2)* \R ) ( (?1) ){4}  #  To mark the "SIX   fill GAP" sections
                  

                  Now, run the command Search > Bookmark > Inverse Bookmark

                  You should get this text, where I added • characters to indicate the remaining blue bookmarked lines, which need corrections :

                   1. / you for helping me!                                        # CORRECT section
                   a. Hate/
                   b. Love/
                   c. Thank/
                   d. Kiss/
                   
                   2. A / can bark. A / can catch mouses                       # CORRECT section
                   a. dog / cat/
                   b. fish / tiger/
                   c. human / lady/
                   d. table / chair/
                   
                   3. Your / is so beautiful, I / it! Where can I / it ?   # CORRECT section
                   a. dress / hate / sell/
                   b. hat / love / buy/
                   c. table / want / posses/
                   d. hand / love / take/
                   
                  •2. A / can bark. A / can catch mouses                       # WRONG section
                  •a. dog / cat/
                  •b. fish / tiger/
                  •c. human  lady/
                  •d. table / chair/
                   
                  •1. you for helping me!                                              # WRONG section
                  •a. Hate/
                  •b. Love/
                  •c. Tha / nk/
                  •d. Kiss/
                   
                  •2. A / can bark. A / can catch / mouses                 # WRONG section
                  •a. dog / cat/
                  •b. fish / tiger/
                  •c. human / lady/
                  •d. table / chair/
                   
                  •3. Your / is so beautiful, I it! Where can I / it ?         # WRONG section
                  •a. dress / hate / sell/
                  •b. hat / love / buy/
                  •c. table / want / pos / ses/
                  •d. hand / love / take/
                   
                   1. / you for helping me!                                        # CORRECT section
                   a. Hate/
                   b. Love/
                   c. Thank/
                   d. Kiss/
                  

                  Now, correct all the problems using, either, the F2 and Shift + F2 shortcuts to reach the wrong sections, of your file.

                  Regarding our simple example :

                  • In the first wrong section, a / is missing, between words human and lady

                  • In the second wrong section, a / is missing, after 1. and an excess / separates the word Thank

                  • In the third wrong section, an excess / is inserted between the words catch and mouses

                  • In the fourth wrong section, a / is missing, between words I and it and an excess / separates the word posses


                  Once your file syntactically correct, simply use this final regex S/R to get the right syntax back :

                  • SEARCH (?-is)^(\l+.+)/$|(^\d+|(?!\A)\G)(.*?)/

                  • REPLACE (?1\1)?3\2\3_____

                  => All the /, in question lines, have been changed as _____ and the excess / symbol, at end of answer lines, have been deleted either !


                  To end, you’ll find, above, the different regexes to mark up from one to six fill gaps sections. I suppose that this range is enough for your needs !

                  Best Regards,

                  guy038

                  P.S. : All the S/R, above, work once only. So, absolutely no trouble if you click twice or more on the Replace All button !

                  1 Reply Last reply Reply Quote 4
                  • Bá Hùng LêB
                    Bá Hùng Lê
                    last edited by

                    OMG thank you guys. You guys have just REVOLUTIONIZED my process. Both of you did such PHENOMENAL jobs of intuitively explaining your solutions to a newbie like me. So thankyou thankyou thankyou!

                    Terry RT 1 Reply Last reply Reply Quote 3
                    • guy038G
                      guy038
                      last edited by guy038

                      Hi, @bá-hùng-lê :

                      If you want to explore the amazing world of regular expressions, here is a good starting point

                      Best Regards,

                      guy038

                      1 Reply Last reply Reply Quote 2
                      • Terry RT
                        Terry R @Bá Hùng Lê
                        last edited by

                        @Bá-Hùng-Lê

                        I had another look at my (and @guy038) solutions and thought there are too many steps, it should be easier. I did eventually come up with a much neater solution. My regex will identify correct sets for 1,2 and 3 part answers in 1 step (can be expanded to as many as required). Then the final step is to inverse the bookmarks ending up with the sets to check and possibly edit.

                        So search mode must be set to “regular expression”. Also tick “wrap around”.

                        1. Mark the good lines, any with 1 to 3 ____ and 1 less of / on the first answer line.
                          Mark function, have “Bookmark line” also ticked.
                          Find What:(?-s)^\d+(?=.*\x5f)([^\x5f\r\n]*\x5f+)?+([^\x5f\r\n]*\x5f+)?+([^\x5f\r\n]*\x5f+)?+[^\x5f\r\n]*\R[^/\r\n]*(?(2)/)[^/\r\n]*(?(3)/)[^/\r\n]*\R(.+\R){3}\x20*\R?

                        This regex can be easily expanded to suit as many parts as required to search for, obviously it gets longer with each set added. Here is a 5 part regex:
                        (?-s)^\d+(?=.*\x5f)([^\x5f\r\n]*\x5f+)?+([^\x5f\r\n]*\x5f+)?+([^\x5f\r\n]*\x5f+)?+([^\x5f\r\n]*\x5f+)?+([^\x5f\r\n]*\x5f+)?+[^\x5f\r\n]*\R[^/\r\n]*(?(2)/)[^/\r\n]*(?(3)/)[^/\r\n]*(?(4)/)[^/\r\n]*(?(5)/)[^/\r\n]*\R(.+\R){3}\x20*\R?
                        So if you have some 4 part questions you can just use the 5 part one as it checks any from 1 to 5.

                        So that is 2 steps, first is “bookmarking” the good lines, the second is inverse bookmarks to get the lines (question sets) to check and edit.

                        Here is the description of what each part is performing:
                        (?-s)…states that the dot character . cannot include the newline character
                        ^\d+(?=.*\x5f)…find a line starting with some numbers, at least 1. If found check that the line also contains a ____ string. So any question sets missing this are obviously wrong and will never be bookmarked.
                        ([^\x5f\r\n]*\x5f+)?+…look for some characters other than the _ (or newline), then look for a number of _ characters, at least 1 together. The ?+ is possessive, so if found keep them. This sub-expression is repeated for the number of ___ to be located.
                        [^\x5f\r\n]*\R…look for some characters other than the _ (or newline), then look for a newline charcter.
                        [^/\r\n]*(?(2)/)…look for some characters other than the / (or newline). Check if group 2 was created earlier, if so then look for a /. This sub-expression is repeated for the number of / to be found, which is 1 less than the number of ____ to be found.
                        [^/\r\n]*\R…look for some characters other than the / (or newline), followed by newline.
                        (.+\R){3}…look for 3 lines including a newline character.
                        \x20*\R?…look for any following “empty” line, so it may contain nothing or 1 (or more) spaces.

                        Terry

                        1 Reply Last reply Reply Quote 3
                        • guy038G
                          guy038
                          last edited by guy038

                          Hello, @bá-hùng-lê, @terry-r and All,

                          When I first read the @terry-r post, I was intrigued by the (?(2)/) syntax. After a while, I understood that it is a conditional regex syntax. I must admit that I’ve never used this feature, yet, in my replies on this forum… and that’s a big mistake !

                          The general syntax of a conditional regex structure is :

                          (?(Condition)Regex if TRUE[|Regex if FALSE]), where Condition is, either :

                          • #, a digit of a numbered group

                          • <Name> / 'Name', a name of a named group

                          • (?=••••) / (?!••••) / (?<=••••) / (?<!••••), a look-around assertion

                          • R, a recursive reference to the overall regex

                          • Rn, a recursive reference to a numbered group n

                          • R&Name, a recursive reference to a named group Name


                          These conditional regexes, introduced by Terry, are a powerful method, especially when two sets of data are linked together. This is precisely the case of this topic, as :

                          • We have a lot of questions _____ in a first line

                          • Then, we have four set of answers and each set must contain as many answers as there are ______ areas, in the question

                          So :

                          • In case of one area to fill in, the regex should detect this correct text :
                          1. 11111 _____ 22222
                          a. wwwww
                          b. xxxxx
                          c. yyyyy
                          d. zzzzz
                          
                          • In case of two areas to fill in, the regex should detect this correct text :
                          1. 11111 _____ 22222 _____ 33333
                          a. wwwww / sssss
                          b. xxxxx / ttttt
                          c. yyyyy / uuuuu
                          d. zzzzz / vvvvv
                          
                          • In case of three areas to fill in, the regex should detect this correct text :
                          1. 11111 _____ 22222 _____ 33333 _____ 44444
                          a. wwwww / sssss / ooooo
                          b. xxxxx / ttttt / ppppp
                          c. yyyyy / uuuuu / qqqqq
                          d. zzzzz / vvvvv / rrrrr
                          

                          And so on…


                          Therefore, if we assume a text, with three areas _____, like above :

                          • The 11111_____ area stored as group 1
                          • The 22222_____ area stored as group 2
                          • The 33333_____ area stored as group 3

                          Here is what the regex should look at :

                          • A letter a and a dot

                          • A space char and the wwwww area, as there is always ONE _____ area to fill in

                          • A space char and the / sssss area, if group 2 exists

                          • A space char and the / ooooo area, if group 3 exists


                          So, here is a regex, expressed with the free-spacing mode (?x), which suits, from one to five _____ areas :

                          (?x-is)  # FREE-SPACING mode + search SENSIBLE to CASE + DOT = STANDARD character 
                          
                                                            #  DEFINITION of groups 1, 2 and 3
                          
                            ( \x20 (?: [^_\r\n]+ \x20 )? )  #  (G1) = SPACE + [ ANY char(s), DIFFERENT from _ and EOL + SPACE ]
                            ( \x20 [^/\r\n]+ )              #  (G2) = SPACE +   ANY char(s), DIFFERENT from / and EOL
                            ( \x20 / )                      #  (G3) = SPACE + SLASH
                            ¤                               #  NEVER matches ( NON-EXISTING '¤' )
                          
                          |  #  OR
                          
                                                    #  FOR the QUESTION line :
                            ^\d+ \.                 #    DIGITS at START + DOT
                            (?=.*_)                 #      IF an UNDERSCORE exists in CURRENT line
                            (?![\x20_]*$)           #      IF CURRENT line DON'T contain UNDERSCORE(S) and SPACE(s), ONLY
                          
                             (?1)_+                 #           =   G1 + UNDERSCORE(S)
                            ((?1)_+)?+              #    (G4)?+ = [ G1 + UNDERSCORE(S) ] in an ATOMIC way
                            ((?1)_+)?+              #    (G5)?+ = [ G1 + UNDERSCORE(S) ] in an ATOMIC way
                            ((?1)_+)?+              #    (G6)?+ = [ G1 + UNDERSCORE(S) ] in an ATOMIC way
                            ((?1)_+)?+              #    (G7)?+ = [ G1 + UNDERSCORE(S) ] in an ATOMIC way
                          
                            (?:\x20 [^_\r\n]+)? \R  #    [ SPACE + ANY char(s), DIFFERENT from _ and EOL chars ] + EOL char(s)
                                                    #  END FOR
                          
                                #  For EACH ANSWER line, below :
                                #    LOWERCASE letter [abcd] at START + DOT, if CURRENT line NOT BLANK
                                #    G2  + IF group 4 or 5 or 6 or 7 EXISTS, search G3 + G2 for each EXISTING group + EOL char(s)
                                #  END FOR
                          
                            ^a \. (?!\h*$) (?2) (?(4)(?3)(?2)) (?(5)(?3)(?2)) (?(6)(?3)(?2)) (?(7)(?3)(?2)) \R
                            ^b \. (?!\h*$) (?2) (?(4)(?3)(?2)) (?(5)(?3)(?2)) (?(6)(?3)(?2)) (?(7)(?3)(?2)) \R
                            ^c \. (?!\h*$) (?2) (?(4)(?3)(?2)) (?(5)(?3)(?2)) (?(6)(?3)(?2)) (?(7)(?3)(?2)) \R
                            ^d \. (?!\h*$) (?2) (?(4)(?3)(?2)) (?(5)(?3)(?2)) (?(6)(?3)(?2)) (?(7)(?3)(?2)) \R
                          
                            \x20* \R?               #    [ SPACE char(s) ] + [ EOL char(s) ]
                          

                          The regex’s size, with comments after the # symbol, seems important but this regex is able to detect numerous syntax errors, as it does not match the following cases :

                          • Number of answers in a set, different from the number of ____ areas ( the main condition )

                          • Missing space before and/or after any ______ area and any / symbol

                          • Missing number, letter, dot and/or space at beginning of line

                          • Lines without any content, after the dot and space char

                          • Switching of two or more answer lines

                          • Missing or excess answer lines

                          • Excess blank lines between each section


                          How this regex works ? This regex contains two alternatives :

                          • The first alternative, below, is needed to store the groups 1, 2 and 3, by reference. So the exact regex is stored ( not its present value )

                          • The second one is the main regex which looks for a correct section, with a question line and a set of four answer lines

                          (?x-is)  # FREE-SPACING mode + search SENSIBLE to CASE + DOT = STANDARD character 
                          
                                                            #  DEFINITION of groups 1, 2 and 3
                          
                            ( \x20 (?: [^_\r\n]+ \x20 )? )  #  (G1) = SPACE + [ ANY char(s), DIFFERENT from _ and EOL + SPACE ]
                            ( \x20 [^/\r\n]+ )              #  (G2) = SPACE +   ANY char(s), DIFFERENT from / and EOL
                            ( \x20 / )                      #  (G3) = SPACE + SLASH
                            ¤                               #  NEVER matches ( NON-EXISTING '¤' )
                          
                          |  #  OR
                          

                          As you can see, in order to properly define these groups, I’m using a special symbol ¤, not used in current file. Therefore, this first alternative will never matches ! But, fortunately, the groups 1, 2 and 3 are defined during this match attempt, and, above all, remain available while trying the main second alternative : that’s the KEY point ;-))

                          You can see this first failed alternative as a group definition region, which is never part of the final match ;-))


                          Near the end of the main alternative, there are four very similar lines :

                          (?x-is) ^a\. (?!\h*$) (?2) (?(4)(?3)(?2)) (?(5)(?3)(?2)) (?(6)(?3)(?2)) (?(7)(?3)(?2)) \R

                          which shows that the regex, needed to match a complete a.••••••• line, is increasing as the number of existing groups, 4, 5, 6 and 7, increases ! The conditional regex syntaxes (?(#)•••••) make all this process elegant and almost obvious ;-))

                          I hope that the explanations, given in comments, will be enough to satisfy your curiosity

                          Best Regards,

                          guy038

                          Alan KilbornA 1 Reply Last reply Reply Quote 1
                          • Alan KilbornA
                            Alan Kilborn @guy038
                            last edited by

                            @guy038 said in Bookmark sets of lines that does not meet criteria:

                            NEVER matches ( NON-EXISTING ‘¤’ )

                            I was intrigued by this part.
                            (Well, okay, I was intrigued by other parts of Guy’s posting, too!)

                            FWIW, if I ever want to never-match, I use (?!).
                            In truth, I haven’t examined the situation above to see if this would work, but I presume it would.

                            Guy, I hope we see further discussion of the conditional regex structure in future postings, because the one above may not make it clear how to use it, for regex beginners/intermediates.

                            1 Reply Last reply Reply Quote 1
                            • Alan KilbornA
                              Alan Kilborn
                              last edited by

                              @guy038 said in Bookmark sets of lines that does not meet criteria:

                              (?=••••) / (?!••••) / (?<=••••) / (?<!••••), a look-around assertion

                              Interestingly, the Boost documentation for “Conditional expressions” only discusses the first two (the look-aheads, not the look-behinds) of these four, but in my limited testing all four seem to work.

                              PeterJonesP 3 Replies Last reply Reply Quote 1
                              • PeterJonesP
                                PeterJones @Alan Kilborn
                                last edited by

                                This post is deleted!
                                1 Reply Last reply Reply Quote 0
                                • PeterJonesP
                                  PeterJones @Alan Kilborn
                                  last edited by PeterJones

                                  This post is deleted!
                                  1 Reply Last reply Reply Quote 0
                                  • PeterJonesP
                                    PeterJones @Alan Kilborn
                                    last edited by

                                    @Alan-Kilborn ,

                                    Sorry for the deleted posts. I see now you were talking about the “assertions” in the “Conditional expressions”, not the normal lookahead and lookbehind.

                                    1 Reply Last reply Reply Quote 2
                                    • guy038G
                                      guy038
                                      last edited by guy038

                                      Hello, @bá-hùng-lê, @terry-r, @alan-kilborn, @peterjones and All,

                                      First, Alan, I’m really sorry as you’re perfectly right about a NEVER match regex sequence. I always forget the other syntaxes :-(

                                      Indeed, in the regex, in free-spacing mode, of my previous post, you may replace the ¤ symbol with, either :

                                      • The empty negative look-ahead (?!), as it is impossible to NOT match an empty string

                                      • The backtracking control verb (*F) or (*FAIL), which seems the official syntax to cancel the current match attempt and, possibly, try other parts of the overall regex for a successful match attempt


                                      Now, some points about the regex conditional structures :

                                      I did some tests and, globally, I would say that this feature is not essential in “everyday” regexes. But, as I previously said, the conditional syntaxes are really interesting when dealing with correlated data !

                                      For instance, let start with the regex (?-is)^.*Paul.*\R(.*Bob.*\R.*Alice.*\R|.*Alice.*\R.*Bob.*\R), which matches the two blocks below, where the two last lines can be switched !

                                      Here is Paul Smith
                                      Yesterday I saw Bob who spoke
                                      with Alice, in the street
                                      
                                      Here is Paul Smith
                                      Yesterday I saw Alice who spoke
                                      with Bob, in the street
                                      

                                      Note that we can also simplify this regex as (?-is)^.*Paul.*\R.*(Bob.*\R.*Alice|Alice.*\R.*Bob).*\R

                                      But we can choose to use, for instance, the conditional structure (?(#)•••••••) . So, our regex is changed into :

                                      (?-is)^.*Paul.*\R.*((Bob)|Alice).*\R.*(?(2)Alice|Bob).*\R

                                      As you can see :

                                      • If group 2 exists, so Bob, in second line, it will search for Alice in the third line

                                      • Else Alice has been found in second line and, then, it will search for Bob, in the third line

                                      In terms of complexity, we can’t clearly see the advantages of conditionals !


                                      Let’s play with the text, below, with each section to match :

                                      A
                                      X
                                      a
                                      
                                      AB
                                      XY
                                      ab
                                      
                                      ABC
                                      XYZ
                                      abc
                                      
                                      ABCD
                                      XYZT
                                      abcd
                                      

                                      First, we can use the usual and sequential form, with four alternatives in order to match each block. In free-spacing mode, we have :

                                      (?x-i)
                                      A     \R  X    \R  a    \R  |
                                      AB    \R  XY   \R  ab   \R  |
                                      ABC   \R  XYZ  \R  abc  \R  |
                                      ABCD  \R  XYZT \R  abcd \R
                                      

                                      Again, let’s use the approach with conditionals. We get :

                                      (?x-i)
                                      A    (B)?     (C)?     (D)?   \R
                                      X  (?(1)Y)  (?(2)Z)  (?(3)T)  \R
                                      a  (?(1)b)  (?(2)c)  (?(3)d)  \R
                                      

                                      Like with the previous example, the gain of this new feature is not that obvious ! Now, let’s imagine that some letters represent an important and/or complicated regex : we immediately see the benefice of the later syntax, as each letter occurs just once in the complete regex !

                                      You might retort that some part of the former regex can be factorized. However, the irreducible form seems to be :

                                      (?x-i)
                                      A
                                      (
                                           \R  X    \R  a     |
                                      B    \R  XY   \R  ab    |
                                      BC   \R  XYZ  \R  abc   |
                                      BCD  \R  XYZT \R  abcd 
                                      )  \R
                                      

                                      But, if X and a stands for very long regexes, this syntaxe remains tedious !


                                      We could have the same reasoning with single-line blocks of text, like below :

                                      AXa
                                      
                                      ABXYab
                                      
                                      ABCXYZabc
                                      
                                      ABCDXYZTabcd
                                      

                                      The normal regex syntax is

                                      (?x-i)
                                      A     X     a     \R  |
                                      AB    XY    ab    \R  |
                                      ABC   XYZ   abc   \R  |
                                      ABCD  XYZT  abcd  \R
                                      

                                      Which could be simplified as :

                                      (?x-i)
                                      A
                                      (
                                           X     a     |
                                      B    XY    ab    |
                                      BC   XYZ   abc   |
                                      BCD  XYZT  abcd
                                      )  \R
                                      

                                      Now, the same syntax, with conditionals, is :

                                      (?x-i)
                                      A  (B)?   (C)?   (D)? 
                                      X(?(1)Y)(?(2)Z)(?(3)T)
                                      a(?(1)b)(?(2)c)(?(3)d)
                                      \R
                                      

                                      Again, in case of complicated regexes, standing for the letters, this last syntax seems better !

                                      Best Regards,

                                      guy038

                                      1 Reply Last reply Reply Quote 0
                                      • guy038G
                                        guy038
                                        last edited by

                                        Hi, @bá-hùng-lê, @terry-r, @alan-kilborn, @peterjones and All,

                                        In retrospect, my argument for conditional expressions seems a bit weak ! Indeed, if we consider this regex, without conditionals :

                                        (?x-i)
                                        A     \R  X    \R  a    \R  |
                                        AB    \R  XY   \R  ab   \R  |
                                        ABC   \R  XYZ  \R  abc  \R  |
                                        ABCD  \R  XYZT \R  abcd \R
                                        

                                        Even if we suppose that, let’s say, the X and a stand for long regex sequences, we still can use sub-routine call syntax to simplify this example as :

                                        (?x-i)
                                        (X)(a)(*F)
                                        |
                                        A     \R  (?1)    \R  (?2)    \R  |
                                        AB    \R  (?1)Y   \R  (?2)b   \R  |
                                        ABC   \R  (?1)YZ  \R  (?2)bc  \R  |
                                        ABCD  \R  (?1)YZT \R  (?2)bcd \R
                                        

                                        Perhaps, we’ll come across examples, later, which clearly show some real advantages to use the conditional feature !

                                        BR

                                        guy038

                                        1 Reply Last reply Reply Quote 0
                                        • guy038G
                                          guy038
                                          last edited by guy038

                                          Hello @alan-kilborn and All,

                                          I’ve found out a simple example of the advantage of the conditional feature !

                                          Let’s suppose that you have a particular tag <guy> and that you want :

                                          • To delete the starting tag <guy> with, both, its leading and trailing space chars

                                          • To delete the ending tag </guy> with its leading space char, only


                                          • The simple and obvious solution is :

                                            • SEARCH \x20<guy>\x20|\x20</guy>

                                            • REPLACE Leave EMPTY

                                          • Now, this shorter regex S/R, with a conditional expression, related to group 1, is :

                                            • SEARCH \x20<(/)?guy>(?(1)|\x20)

                                            • REPLACE Leave EMPTY


                                          I verified that the suppression of 500,000 starting tags and 500,000 ending tags, in one step, take the same time, whatever the regex syntax used !

                                          Best Regards,

                                          guy038

                                          1 Reply Last reply Reply Quote 1
                                          • First post
                                            Last post
                                          The Community of users of the Notepad++ text editor.
                                          Powered by NodeBB | Contributors