• Login
Community
  • Login

Regex: Match the first three words from every line

Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
6 Posts 2 Posters 1.1k Views
Loading More Posts
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • V
    Vasile Caraus
    last edited by Vasile Caraus Mar 6, 2021, 8:06 AM Mar 6, 2021, 8:04 AM

    Hello, I want to match the first three words from every line of my file, whether or not they have a dash.

    For example:

    Intr-o zi plecarea mea s-a amanat pentru toata viata.
    

    The output match of the first three words should be:

    Intr-o zi plecarea
    

    I made some regex, but which are not working:

    ^(\w+){3} or ^(\w+\s+){3} or ^(?=(\w+)){3}

    1 Reply Last reply Reply Quote 0
    • G
      guy038
      last edited by guy038 Mar 6, 2021, 10:34 AM Mar 6, 2021, 10:28 AM

      Hello, @vasile-caraus and All,

      Interesting problem, indeed ! In this search, we must considered that the dash - is also a virtual word character. So, in this example, a word char is defined as [\w-] and common words are defined as [\w-]+

      Thus, a non-word char, in this specific example, is, necessarily, defined with the regex [^\w-] but we usually use this regex [^\w\r\n-], in order to not match EOL chars, too ! Note that the dash must be the last character of the character class, because of its meaning inside square brackets !

      So, in this specific example, a non-empty range of non-word characters is matched with the regex [^\w\r\n-]+


      Now the first three common words, of each line can be expressed, in common language, as :

      ^ Word range + Non-word range + Word range + Non-word range + Word range or, more simply :

      ^ ( Word range + Non-word range ) {2} + Word range

      which gives, when translated to regex, with the free-spacing mode :

      SEARCH (?x) ^ ( [\w-]+ [^\w\r\n-]+ ) {2} [\w-]+

      So the minimal form :

      SEARCH ^([\w-]+[^\w\r\n-]+){2}[\w-]+


      @Vasile-caraus, you did not speak about the case of sentences with one or two words, only as, for instance:

      Intr-o zi
      Intr-o
      

      If you also want to match theses cases, prefer the following search regex :

      SEARCH ^([\w-]+[^\w\r\n-]+){0,2}[\w-]+

      Best Regards,

      guy038

      1 Reply Last reply Reply Quote 3
      • V
        Vasile Caraus
        last edited by Mar 6, 2021, 12:21 PM

        thank you, @guy038 . Also, there must be another case:

        The space of the beginning.

        Intr-o zi plecarea mea s-a amanat pentru toata viata.
        
          Intr-o zi plecarea mea s-a amanat pentru toata viata.
        

        I try to add \s\S in your regex but is not working ^\s\S([\w-]+[^\w\r\n-]+){2}[\w-]+

        1 Reply Last reply Reply Quote 0
        • G
          guy038
          last edited by guy038 Mar 6, 2021, 4:02 PM Mar 6, 2021, 2:54 PM

          Hi, @vasile-caraus,

          Ah… OK ! Note that you could have stated, in your initial post, that possible blank spaces may occur before the first word !

          Moreover, the regexes that you provided, in your first post, were all anchored to beginning of line ^ !


          Now, we still need additional information : do you want to match these leading blanks chars as well, along with the three “words” or not ?

          BR

          guy038

          1 Reply Last reply Reply Quote 0
          • V
            Vasile Caraus
            last edited by Mar 6, 2021, 3:08 PM

            @guy038 said in Regex: Match the first three words from every line:

            non-word

            empty spaces are non-words.

            So, finding those 3 words must not contain space in front of them. I don’t need to find empty spaces :)

            1 Reply Last reply Reply Quote 0
            • G
              guy038
              last edited by Mar 6, 2021, 4:07 PM

              Hi, @vasile-caraus,

              Then, use the following search regex :

              ^\h*\K([\w-]+[^\w\r\n-]+){2}[\w-]+

              And, in case of replacement, click on the Replace All button, only, because of the \K syntax !

              Cheers,

              guy038

              1 Reply Last reply Reply Quote 3
              3 out of 6
              • First post
                3/6
                Last post
              The Community of users of the Notepad++ text editor.
              Powered by NodeBB | Contributors