Community
    • Login

    Regex: Match the first three words from every line

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    6 Posts 2 Posters 1.1k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Vasile CarausV
      Vasile Caraus
      last edited by Vasile Caraus

      Hello, I want to match the first three words from every line of my file, whether or not they have a dash.

      For example:

      Intr-o zi plecarea mea s-a amanat pentru toata viata.
      

      The output match of the first three words should be:

      Intr-o zi plecarea
      

      I made some regex, but which are not working:

      ^(\w+){3} or ^(\w+\s+){3} or ^(?=(\w+)){3}

      1 Reply Last reply Reply Quote 0
      • guy038G
        guy038
        last edited by guy038

        Hello, @vasile-caraus and All,

        Interesting problem, indeed ! In this search, we must considered that the dash - is also a virtual word character. So, in this example, a word char is defined as [\w-] and common words are defined as [\w-]+

        Thus, a non-word char, in this specific example, is, necessarily, defined with the regex [^\w-] but we usually use this regex [^\w\r\n-], in order to not match EOL chars, too ! Note that the dash must be the last character of the character class, because of its meaning inside square brackets !

        So, in this specific example, a non-empty range of non-word characters is matched with the regex [^\w\r\n-]+


        Now the first three common words, of each line can be expressed, in common language, as :

        ^ Word range + Non-word range + Word range + Non-word range + Word range or, more simply :

        ^ ( Word range + Non-word range ) {2} + Word range

        which gives, when translated to regex, with the free-spacing mode :

        SEARCH (?x) ^ ( [\w-]+ [^\w\r\n-]+ ) {2} [\w-]+

        So the minimal form :

        SEARCH ^([\w-]+[^\w\r\n-]+){2}[\w-]+


        @Vasile-caraus, you did not speak about the case of sentences with one or two words, only as, for instance:

        Intr-o zi
        Intr-o
        

        If you also want to match theses cases, prefer the following search regex :

        SEARCH ^([\w-]+[^\w\r\n-]+){0,2}[\w-]+

        Best Regards,

        guy038

        1 Reply Last reply Reply Quote 3
        • Vasile CarausV
          Vasile Caraus
          last edited by

          thank you, @guy038 . Also, there must be another case:

          The space of the beginning.

          Intr-o zi plecarea mea s-a amanat pentru toata viata.
          
            Intr-o zi plecarea mea s-a amanat pentru toata viata.
          

          I try to add \s\S in your regex but is not working ^\s\S([\w-]+[^\w\r\n-]+){2}[\w-]+

          1 Reply Last reply Reply Quote 0
          • guy038G
            guy038
            last edited by guy038

            Hi, @vasile-caraus,

            Ah… OK ! Note that you could have stated, in your initial post, that possible blank spaces may occur before the first word !

            Moreover, the regexes that you provided, in your first post, were all anchored to beginning of line ^ !


            Now, we still need additional information : do you want to match these leading blanks chars as well, along with the three “words” or not ?

            BR

            guy038

            1 Reply Last reply Reply Quote 0
            • Vasile CarausV
              Vasile Caraus
              last edited by

              @guy038 said in Regex: Match the first three words from every line:

              non-word

              empty spaces are non-words.

              So, finding those 3 words must not contain space in front of them. I don’t need to find empty spaces :)

              1 Reply Last reply Reply Quote 0
              • guy038G
                guy038
                last edited by

                Hi, @vasile-caraus,

                Then, use the following search regex :

                ^\h*\K([\w-]+[^\w\r\n-]+){2}[\w-]+

                And, in case of replacement, click on the Replace All button, only, because of the \K syntax !

                Cheers,

                guy038

                1 Reply Last reply Reply Quote 3
                • First post
                  Last post
                The Community of users of the Notepad++ text editor.
                Powered by NodeBB | Contributors