Community
    • Login

    Regex with unexpected repeat application

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    4 Posts 3 Posters 384 Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Richard DarwinR
      Richard Darwin
      last edited by Richard Darwin

      Hello Notepad++ users:
      Could you please help me with a regex problem?

      First off, here is my Debug data:

      Notepad++ v8.6.2   (32-bit)
      Build time : Jan 14 2024 - 02:18:41
      Path : C:\Program Files (x86)\Notepad++\notepad++.exe
      Command Line : "E:\Linguistica\FrequencyList\lemma-pos.txt" 
      Admin mode : OFF
      Local Conf mode : OFF
      Cloud Config : OFF
      OS Name : Windows 10 Home (64-bit)
      OS Version : 22H2
      OS Build : 19045.4046
      Current ANSI codepage : 1252
      Plugins : 
          mimeTools (3)
          NppConverter (4.5)
          NppExport (0.4)
      

      Here is some sample data.  The first four ‘spaces’ are really TABs, while the ‘Inflections’ part is comma-separated.

      # LEMMA|POS LEMMA POS FREQUENCY INFLECTIONS
      underestimate|v underestimate v 35 underestimate, underestimated, underestimates, underestimating
      unique|j unique j 32 unique, uniquer, uniquest
      various|j various j 32 various
      vein|n vein n 32 vein, veins
      weep|v weep v 32 weep, weeping, weeps, wept
      whiskey n 32 whiskey, whiskeys, whiskies
      witty j 32 witty, wittier, wittiest
      worry|n worry n 32 worry, worries
      memorial|n memorial n 31 memorial, memorials
      

      I want to strip out the redundant first block of text in each line, the one containing ‘|’,  and the TAB after it.

      Here is how it should look:

      underestimate v 35 underestimate, underestimated, underestimates, underestimating
      unique j 32 unique, uniquer, uniquest
      various j 32 various
      vein n 32 vein, veins
      weep v 32 weep, weeping, weeps, wept
      whiskey n 32 whiskey, whiskeys, whiskies
      witty j 32 witty, wittier, wittiest
      worry n 32 worry, worries
      memorial n 31 memorial, memorials
      

      To accomplish this, I have tried using the following:
      Find/Replace expressions and settings

      • Find What = ^([a-z|]+\t)
      • Replace With = ``
      • Search Mode = REGULAR EXPRESSION
      • Dot Matches Newline = NOT CHECKED
        (I also have ‘Match case’ and ‘Wrap around’ OFF.

      This regex does the job when I apply it with ‘Find Next’ and ‘Replace’, but when I apply it using ‘Replace All’, I get this output:

      35 underestimate, underestimated, underestimates, underestimating
      32 unique, uniquer, uniquest
      32 various
      32 vein, veins
      32 weep, weeping, weeps, wept
      32 whiskey, whiskeys, whiskies
      32 witty, wittier, wittiest
      32 worry, worries
      31 memorial, memorials
      

      I tried using ‘*’ instead of ‘+’ in the regex but got exactly the same unwanted result.

      ObvIously the regex is being applied more than once per line. I don’t know why this is happening.  I thought the regex should only apply once at the start of each line, given ‘^’.  Is there a ‘global’ flag somewhere that I inadvertently set?  If so, how do I access it?

      Any advice would be appreciated.

      –
      rick.darwin@gmail.com
      –Charles Darwin? He was my grandfather.  Oh, that Charles.  We share a common ancestor.

      Terry RT 1 Reply Last reply Reply Quote 0
      • Terry RT
        Terry R @Richard Darwin
        last edited by Terry R

        @Richard-Darwin said in Regex with unexpected repeat application:

        ObvIously the regex is being applied more than once per line. I don’t know why this is happening.

        When I copied your example, there were no tabs so I had to interpret where I thought they might be and the first line helped. Whilst I used your regex without any alteration I did not clear the “dot matches newline” etc. You will find as you look at solutions provided by members that we prefer instead to use modifiers. Reference for this is in the online manual here.
        The reason for doing so is that these modifiers will override any settings the user might have set and forgot to change, that way we (as solution provider) have more certainty that our provided regex will work as expected.

        Now you say when you used find, then replace repeatedly it worked. Well I tried the same and as expected it didn’t work. That process and the Replace All should work exactly the same (EDIT see following post where a setting elsewhere may have influenced your result). So your problem is that with the cursor at the start of a line, it finds the string to remove, does so and afterwards the cursor is STILL at the start of the same line. The next iteration of Find/Replace will find yet another occurance on the same line.

        So what you need to do is at least capture 1 further character and replace that (write it back) so the cursor isn’t at the start of a line. My modified regex actually captures the remainder of the line and writes it all back, this places the cursor at the end of a line.

        So my regex is Find What:(?-s)^[a-z|]+\t(.+) and Replace With:${1}. The (?-s) means the same as clearing the Dot matches newline.

        Terry

        CoisesC 1 Reply Last reply Reply Quote 2
        • CoisesC
          Coises @Terry R
          last edited by

          @Richard-Darwin said in Regex with unexpected repeat application:

          This regex does the job when I apply it with ‘Find Next’ and ‘Replace’, but when I apply it using ‘Replace All’, I get this output:

          @Terry-R said in Regex with unexpected repeat application:

          Now you say when you used find, then replace repeatedly it worked. Well I tried the same and as expected it didn’t work. That process and the Replace All should work exactly the same.

          Most likely the original poster has Settings | Preferences… | Searching | Replace: Don’t move to the following occurrence unchecked (I believe unchecked is the default), and is literally repeating both Find Next and Replace.

          With the setting I mentioned unchecked, Replace does the next Find automatically, so the original poster is doing a second find after every replace. Replace All, of course, doesn’t do that.

          Terry RT 1 Reply Last reply Reply Quote 3
          • Terry RT
            Terry R @Coises
            last edited by

            @Coises said in Regex with unexpected repeat application:

            Most likely the original poster has Settings | Preferences… | Searching | Replace: Don’t move to the following occurrence unchecked (I believe unchecked is the default), and is literally repeating both Find Next and Replace.

            I never knew that setting was there and was trying to figure out why he had a different result to me. Thanks, good to know there are still some things to learn about NPP.

            Terry

            1 Reply Last reply Reply Quote 0
            • First post
              Last post
            The Community of users of the Notepad++ text editor.
            Powered by NodeBB | Contributors