Community
    • Login

    Regex: Match the first three words from every line

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    6 Posts 2 Posters 1.7k Views 1 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Vasile CarausV Offline
      Vasile Caraus
      last edited by Vasile Caraus

      Hello, I want to match the first three words from every line of my file, whether or not they have a dash.

      For example:

      Intr-o zi plecarea mea s-a amanat pentru toata viata.
      

      The output match of the first three words should be:

      Intr-o zi plecarea
      

      I made some regex, but which are not working:

      ^(\w+){3} or ^(\w+\s+){3} or ^(?=(\w+)){3}

      1 Reply Last reply Reply Quote 0
      • guy038G Offline
        guy038
        last edited by guy038

        Hello, @vasile-caraus and All,

        Interesting problem, indeed ! In this search, we must considered that the dash - is also a virtual word character. So, in this example, a word char is defined as [\w-] and common words are defined as [\w-]+

        Thus, a non-word char, in this specific example, is, necessarily, defined with the regex [^\w-] but we usually use this regex [^\w\r\n-], in order to not match EOL chars, too ! Note that the dash must be the last character of the character class, because of its meaning inside square brackets !

        So, in this specific example, a non-empty range of non-word characters is matched with the regex [^\w\r\n-]+


        Now the first three common words, of each line can be expressed, in common language, as :

        ^ Word range + Non-word range + Word range + Non-word range + Word range or, more simply :

        ^ ( Word range + Non-word range ) {2} + Word range

        which gives, when translated to regex, with the free-spacing mode :

        SEARCH (?x) ^ ( [\w-]+ [^\w\r\n-]+ ) {2} [\w-]+

        So the minimal form :

        SEARCH ^([\w-]+[^\w\r\n-]+){2}[\w-]+


        @Vasile-caraus, you did not speak about the case of sentences with one or two words, only as, for instance:

        Intr-o zi
        Intr-o
        

        If you also want to match theses cases, prefer the following search regex :

        SEARCH ^([\w-]+[^\w\r\n-]+){0,2}[\w-]+

        Best Regards,

        guy038

        1 Reply Last reply Reply Quote 3
        • Vasile CarausV Offline
          Vasile Caraus
          last edited by

          thank you, @guy038 . Also, there must be another case:

          The space of the beginning.

          Intr-o zi plecarea mea s-a amanat pentru toata viata.
          
            Intr-o zi plecarea mea s-a amanat pentru toata viata.
          

          I try to add \s\S in your regex but is not working ^\s\S([\w-]+[^\w\r\n-]+){2}[\w-]+

          1 Reply Last reply Reply Quote 0
          • guy038G Offline
            guy038
            last edited by guy038

            Hi, @vasile-caraus,

            Ah… OK ! Note that you could have stated, in your initial post, that possible blank spaces may occur before the first word !

            Moreover, the regexes that you provided, in your first post, were all anchored to beginning of line ^ !


            Now, we still need additional information : do you want to match these leading blanks chars as well, along with the three “words” or not ?

            BR

            guy038

            1 Reply Last reply Reply Quote 0
            • Vasile CarausV Offline
              Vasile Caraus
              last edited by

              @guy038 said in Regex: Match the first three words from every line:

              non-word

              empty spaces are non-words.

              So, finding those 3 words must not contain space in front of them. I don’t need to find empty spaces :)

              1 Reply Last reply Reply Quote 0
              • guy038G Offline
                guy038
                last edited by

                Hi, @vasile-caraus,

                Then, use the following search regex :

                ^\h*\K([\w-]+[^\w\r\n-]+){2}[\w-]+

                And, in case of replacement, click on the Replace All button, only, because of the \K syntax !

                Cheers,

                guy038

                1 Reply Last reply Reply Quote 3

                Hello! It looks like you're interested in this conversation, but you don't have an account yet.

                Getting fed up of having to scroll through the same posts each visit? When you register for an account, you'll always come back to exactly where you were before, and choose to be notified of new replies (either via email, or push notification). You'll also be able to save bookmarks and upvote posts to show your appreciation to other community members.

                With your input, this post could be even better 💗

                Register Login
                • First post
                  Last post
                The Community of users of the Notepad++ text editor.
                Powered by NodeBB | Contributors