Community
    • Login

    Help with regex

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    19 Posts 3 Posters 12.9k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • DaveyDD
      DaveyD
      last edited by

      Hi, I am trying to build a regex for the following:
      I have a file that has sections that are titled as follows:

      =======      MY FIRST SECTION      ===================
      

      I need to extract the words MY DESCRIPTION only
      As in this example, there can be multiple spaces within the section name, and there is no set amount of words - there can be 1,2,3,4 etc.

      I started with the following:

      =+\s+\K.+(?= )
      

      which returns to me the whole description, including the spaces at the end.
      How can I get the words without the spaces?
      (If I put a ? after the \K.+, then it will only find the first word!)

      Anyone know how to do this?
      Thanks,
      David

      Claudia FrankC 1 Reply Last reply Reply Quote 0
      • Claudia FrankC
        Claudia Frank @DaveyD
        last edited by

        Hello David,

        =+\s+\K.+[^\s](?= )
        

        this seems to work although I don’t know why?

        =+ one or multiple equal signs followed by
        \s+ one or more whitespace chars
        \K reset position to current location
        .+ one or more chars
        [^\s] but not whitespace chars and
        (?= ) match space but consumes zero characters.

        Cheers
        Claudia

        1 Reply Last reply Reply Quote 0
        • Claudia FrankC
          Claudia Frank
          last edited by

          nope - doesn’t work as expected

          1 Reply Last reply Reply Quote 0
          • Claudia FrankC
            Claudia Frank
            last edited by

            This one seems to be working

            ^=+\s+\K.+[^\s](?= )
            

            But still haveing problems to understand why.

            Cheers
            Claudia

            1 Reply Last reply Reply Quote 1
            • DaveyDD
              DaveyD
              last edited by DaveyD

              Hi Claudia - Thanks!
              Works perfect, and simple! Why didn’t I think of this! :)
              The reason why it works is as follows:
              After resetting the position with \K, the regex continues:

              • find any characters (with the .) as much as it can find (+ without ?)
              • until and including a non-white_space character
              • followed by (and not including) a space

              although without the [^\s] the .+ finds all words until and including the last letter, by adding [^\s] followed by a non-including space (i.e., (?= )), the regex has to minimize it’s find with .+ by one character in order to fulfill the request of [^\s](?= )

              I hope I explained that clear enough…

              Thanks a lot Claudia!
              All the best
              David

              Claudia FrankC 1 Reply Last reply Reply Quote 0
              • Claudia FrankC
                Claudia Frank @DaveyD
                last edited by Claudia Frank

                Hello David,

                perfect, I now get it.

                I always interpreted [^\s] as don’t find whitespace chars whereas it is
                find everything except whitespace chars.

                Another step on my regex ladder.

                Thank you
                Claudia

                1 Reply Last reply Reply Quote 0
                • DaveyDD
                  DaveyD
                  last edited by

                  Hi Claudia, happy to hear that I can be a help to you!
                  (After all, I am still so grateful that you pushed me into trying out python… never stopped since! :) - by the way, those lines of equal signs and section names… they’re produced with a python script! :) )

                  Although this is not the topic of discussion… for some reason the function list parser is only showing me the first 2 results of this regex.
                  I.e., I have 10 of these lines in my files, when I put the regex as a class_name in function list, it only finds the first 2 occurrences! And if I erase the first, then it finds the next 2…

                  Do you know what can be the cause of this?

                  Thanks,
                  David

                  Claudia FrankC 1 Reply Last reply Reply Quote 1
                  • Claudia FrankC
                    Claudia Frank @DaveyD
                    last edited by

                    Hi David,

                    yes python rules. For me it is one of those languages which suits me. I can read it
                    and understand most of it instantly. Nice to see that you get infected to.

                    Regarding the function list, to be honest, no, I did a test (borrowed fortan configuration and modified it) with eleven section
                    and it is working. See here.

                    Could it be that you modified the document without saving it, then it needs a refresh click.
                    Otherwise … hmmm how did you do the association? With a language ID? extention or userdefined?
                    Mine is using the id.

                    Cheers
                    Claudia

                    1 Reply Last reply Reply Quote 0
                    • DaveyDD
                      DaveyD
                      last edited by

                      Nevermind… I ‘sort of’ figured it out
                      There was obviously some type of overlapping going on (not sure what), but when I changed the regex of the mainexpr a little bit, everything fell into place

                      (To explain fully would take a lot of time and writing - and reading :) )

                      1 Reply Last reply Reply Quote 0
                      • Claudia FrankC
                        Claudia Frank
                        last edited by

                        good to see you did it ;-)

                        Have a nice day
                        Claudia

                        1 Reply Last reply Reply Quote 0
                        • DaveyDD
                          DaveyD
                          last edited by

                          Hi Claudia, thanks
                          I hadn’t seen your post when I wrote mine!

                          Regarding association, I did it by userDefinedLang ID
                          (By the way… thanks for that too! I couldn’t live without it! :)
                          [still waiting and hoping for version 3!] )

                          Thanks
                          David

                          (P.S. My replies are delayed since my reputation is under 2… :( - :) )

                          1 Reply Last reply Reply Quote 1
                          • guy038G
                            guy038
                            last edited by guy038

                            Hi, Davey and Claudia,

                            I understood why the regex ^=+\s+\K.+[^\s](?= ) is correct and the regex =+\s+\K.+[^\s](?= ), without the ^ character, is NOT correct

                            Just imagine the subject text, below :

                            =======      MY FIRST SECTION      ===================
                            =======      MY SECOND SECTION      ===================
                            

                            You have to remember that the regex \s is stricly identical to the regex [\t\n\x0B\f\r\x20\xA0]


                            So, if you consider the wrong regex =+\s+\K.+[^\s](?= )

                            • A first click, on the Find Next button, matches the text MY FIRST SECTION. Nice !

                            • A second click, on the Find Next button, matches, wrongly, the string ======= MY SECOND SECTION. Why ?

                            Well, after the first click, the cursor is located between the last letter N and the space, in the first line


                            So, that wrong regex =+\s+\K.+[^\s](?= ) matches :

                            • The =================== string, at the end of the first line, due to the regex =+

                            • The EOL characters ( \r\n ) of the first line, which are, both, \s characters !

                            • The \K form forget the present regex, so the cursor is reset between the \n of the first line and the first = character of the second line

                            • Then, after backtracking, the ======= MY SECOND SECTIO string, due to the .+ part of the regex

                            • Finally, the N character, as it’s a NON BLANK character ( [^\s] )

                            • The space character, although not part of the final regex, must be present, after the last N character, of the second line : that’s right

                            On the contrary, after the first match, when we use the regex ^=+\s+\K.+[^\s](?= ), the cursor location is correctly reset, first, at the beginning of the second line !

                            Cheers,

                            guy038

                            Claudia FrankC 1 Reply Last reply Reply Quote 1
                            • Claudia FrankC
                              Claudia Frank @guy038
                              last edited by

                              Hello guy038,

                              thank you for the detailed and good explanation.
                              When creating a regex I’m still don’t care enough about the current cursor position and the correct meaning of parts
                              of the regex like \K. Like you said, it is not only the reset but als the forget the previous matches which is important.
                              But I hope this gets done when doing more and more of these. Yesterday I found a nice webpage
                              which supports me understanding my regexes ;-)

                              Cheers
                              Claudia

                              1 Reply Last reply Reply Quote 0
                              • DaveyDD
                                DaveyD
                                last edited by

                                Hi guy038
                                Thanks for the extra clarity that you provided - as you always do!

                                All the best
                                Davey

                                1 Reply Last reply Reply Quote 0
                                • guy038G
                                  guy038
                                  last edited by

                                  Hello, Claudia and DaveyD,

                                  I had a look to the webpage, that you found out Claudia. Really interesting, indeed !

                                  However, in order to get the same matches than the Notepad++ regex engine, we should take care about the following points, when using the site :

                                  https://regex101.com


                                  • We must use the default PCRE flavor, on the left part of the window, which seems to have the closest behaviour than the N++ regex engine

                                  • In the gmixXsuUAJ field, you should, systematically, add the two modifiers gm

                                    • The g modifier means it will indicate all the matches of the test regex, and not only the first one ! ( just like the Find All button of the Mark tab ! )

                                    • The m modifier means that the anchors ^ and $ represent the beginning and the end of each line, as the N++ regex engine does ( implicit modifier (?m) )

                                  • In the gmixXsuUAJ field, you will add the i modifier, if you DON’T check the Match case option OR if your N++ regex begins with the (?i) form

                                  • In the gmixXsuUAJ field, you will add the s modifier, if you have CHECKED the . matches newline option OR if your N++ regex begins with the (?s) form

                                  • In the gmixXsuUAJ field, you will add the x modifier, if your N++ regex begins with the form (?x)

                                  • In the gmixXsuUAJ field, you will NOT indicate the m modifier, if your N++ regex begins with the form (?-m)


                                  So, giving our last example of the test string, below,

                                  =======      MY FIRST SECTION      ===================
                                  =======      MY SECOND SECTION      ===================
                                  
                                  • If you enter the regex =+\s+\K.+[^\s](?= ), in the Regular Expression field and gm after the slash, in the modifiers field, you should have the strings MY FIRST SECTION and ======= MY SECOND SECTION, both, highlighted in blue

                                  • Now, just add the ^ anchor, at the beginning of the regex -> This time, you should have the two strings MY FIRST SECTION and MY SECOND SECTION, both, highlighted in blue !

                                  Cheers,

                                  guy038

                                  Claudia FrankC 1 Reply Last reply Reply Quote 0
                                  • Claudia FrankC
                                    Claudia Frank @guy038
                                    last edited by

                                    Hi guy038,

                                    thanky you very much for the infos. I was looking for a boost:regex tester but couldn’t find anyone.
                                    So with this site and your explanation it makes it easier to find the misunderstandings when using
                                    regexes.

                                    Cheers
                                    Claudia

                                    1 Reply Last reply Reply Quote 0
                                    • DaveyDD
                                      DaveyD
                                      last edited by

                                      Hi Loreia,
                                      Just to mention, when I first started, I found a website (similar to the one you mentioned above) that was really helpful, and I liked it mostly because they had a desktop app that included the same features!
                                      You can take a here http://regexr.com/
                                      I think the download is somewhere else www.gskinner.com - however, this is from memory
                                      I’m not saying that this is better than what you found, but just another nice option with a desktop app.

                                      I think everything guy038 mentioned above applies here too, except that here, the g modifier is enabled by default

                                      Davey

                                      Claudia FrankC 1 Reply Last reply Reply Quote 0
                                      • Claudia FrankC
                                        Claudia Frank @DaveyD
                                        last edited by

                                        Loreia ?? Did some UDL lately??
                                        Yeah, https://github.com/gskinner/regexr/
                                        Worth a try

                                        Thanks and cheers
                                        Claudia

                                        1 Reply Last reply Reply Quote 0
                                        • DaveyDD
                                          DaveyD
                                          last edited by

                                          :) Yes, I did! - Sorry about that confusion!

                                          Let me know what you think - I thought it was great and it really helped me in the beginning.

                                          All the best
                                          Davey

                                          1 Reply Last reply Reply Quote 0
                                          • First post
                                            Last post
                                          The Community of users of the Notepad++ text editor.
                                          Powered by NodeBB | Contributors