Regex: Match the first three words from every line



  • Hello, I want to match the first three words from every line of my file, whether or not they have a dash.

    For example:

    Intr-o zi plecarea mea s-a amanat pentru toata viata.
    

    The output match of the first three words should be:

    Intr-o zi plecarea
    

    I made some regex, but which are not working:

    ^(\w+){3} or ^(\w+\s+){3} or ^(?=(\w+)){3}



  • Hello, @vasile-caraus and All,

    Interesting problem, indeed ! In this search, we must considered that the dash - is also a virtual word character. So, in this example, a word char is defined as [\w-] and common words are defined as [\w-]+

    Thus, a non-word char, in this specific example, is, necessarily, defined with the regex [^\w-] but we usually use this regex [^\w\r\n-], in order to not match EOL chars, too ! Note that the dash must be the last character of the character class, because of its meaning inside square brackets !

    So, in this specific example, a non-empty range of non-word characters is matched with the regex [^\w\r\n-]+


    Now the first three common words, of each line can be expressed, in common language, as :

    ^ Word range + Non-word range + Word range + Non-word range + Word range or, more simply :

    ^ ( Word range + Non-word range ) {2} + Word range

    which gives, when translated to regex, with the free-spacing mode :

    SEARCH (?x) ^ ( [\w-]+ [^\w\r\n-]+ ) {2} [\w-]+

    So the minimal form :

    SEARCH ^([\w-]+[^\w\r\n-]+){2}[\w-]+


    @Vasile-caraus, you did not speak about the case of sentences with one or two words, only as, for instance:

    Intr-o zi
    Intr-o
    

    If you also want to match theses cases, prefer the following search regex :

    SEARCH ^([\w-]+[^\w\r\n-]+){0,2}[\w-]+

    Best Regards,

    guy038



  • thank you, @guy038 . Also, there must be another case:

    The space of the beginning.

    Intr-o zi plecarea mea s-a amanat pentru toata viata.
    
      Intr-o zi plecarea mea s-a amanat pentru toata viata.
    

    I try to add \s\S in your regex but is not working ^\s\S([\w-]+[^\w\r\n-]+){2}[\w-]+



  • Hi, @vasile-caraus,

    Ah… OK ! Note that you could have stated, in your initial post, that possible blank spaces may occur before the first word !

    Moreover, the regexes that you provided, in your first post, were all anchored to beginning of line ^ !


    Now, we still need additional information : do you want to match these leading blanks chars as well, along with the three “words” or not ?

    BR

    guy038



  • @guy038 said in Regex: Match the first three words from every line:

    non-word

    empty spaces are non-words.

    So, finding those 3 words must not contain space in front of them. I don’t need to find empty spaces :)



  • Hi, @vasile-caraus,

    Then, use the following search regex :

    ^\h*\K([\w-]+[^\w\r\n-]+){2}[\w-]+

    And, in case of replacement, click on the Replace All button, only, because of the \K syntax !

    Cheers,

    guy038


Log in to reply