Regex with unexpected repeat application

Richard Darwin

Hello Notepad++ users:
Could you please help me with a regex problem?

First off, here is my Debug data:

Notepad++ v8.6.2   (32-bit)
Build time : Jan 14 2024 - 02:18:41
Path : C:\Program Files (x86)\Notepad++\notepad++.exe
Command Line : "E:\Linguistica\FrequencyList\lemma-pos.txt" 
Admin mode : OFF
Local Conf mode : OFF
Cloud Config : OFF
OS Name : Windows 10 Home (64-bit)
OS Version : 22H2
OS Build : 19045.4046
Current ANSI codepage : 1252
Plugins : 
    mimeTools (3)
    NppConverter (4.5)
    NppExport (0.4)

Here is some sample data. The first four ‘spaces’ are really TABs, while the ‘Inflections’ part is comma-separated.

# LEMMA|POS LEMMA POS FREQUENCY INFLECTIONS
underestimate|v underestimate v 35 underestimate, underestimated, underestimates, underestimating
unique|j unique j 32 unique, uniquer, uniquest
various|j various j 32 various
vein|n vein n 32 vein, veins
weep|v weep v 32 weep, weeping, weeps, wept
whiskey n 32 whiskey, whiskeys, whiskies
witty j 32 witty, wittier, wittiest
worry|n worry n 32 worry, worries
memorial|n memorial n 31 memorial, memorials

I want to strip out the redundant first block of text in each line, the one containing ‘|’, and the TAB after it.

Here is how it should look:

underestimate v 35 underestimate, underestimated, underestimates, underestimating
unique j 32 unique, uniquer, uniquest
various j 32 various
vein n 32 vein, veins
weep v 32 weep, weeping, weeps, wept
whiskey n 32 whiskey, whiskeys, whiskies
witty j 32 witty, wittier, wittiest
worry n 32 worry, worries
memorial n 31 memorial, memorials

To accomplish this, I have tried using the following:
Find/Replace expressions and settings

Find What = ^([a-z|]+\t)
Replace With = ``
Search Mode = REGULAR EXPRESSION
Dot Matches Newline = NOT CHECKED
(I also have ‘Match case’ and ‘Wrap around’ OFF.

This regex does the job when I apply it with ‘Find Next’ and ‘Replace’, but when I apply it using ‘Replace All’, I get this output:

35 underestimate, underestimated, underestimates, underestimating
32 unique, uniquer, uniquest
32 various
32 vein, veins
32 weep, weeping, weeps, wept
32 whiskey, whiskeys, whiskies
32 witty, wittier, wittiest
32 worry, worries
31 memorial, memorials

I tried using ‘*’ instead of ‘+’ in the regex but got exactly the same unwanted result.

ObvIously the regex is being applied more than once per line. I don’t know why this is happening. I thought the regex should only apply once at the start of each line, given ‘^’. Is there a ‘global’ flag somewhere that I inadvertently set? If so, how do I access it?

Any advice would be appreciated.

–
rick.darwin@gmail.com
–Charles Darwin? He was my grandfather. Oh, that Charles. We share a common ancestor.

Terry R

@Richard-Darwin said in Regex with unexpected repeat application:

ObvIously the regex is being applied more than once per line. I don’t know why this is happening.

When I copied your example, there were no tabs so I had to interpret where I thought they might be and the first line helped. Whilst I used your regex without any alteration I did not clear the “dot matches newline” etc. You will find as you look at solutions provided by members that we prefer instead to use modifiers. Reference for this is in the online manual here.
The reason for doing so is that these modifiers will override any settings the user might have set and forgot to change, that way we (as solution provider) have more certainty that our provided regex will work as expected.

Now you say when you used find, then replace repeatedly it worked. Well I tried the same and as expected it didn’t work. That process and the Replace All should work exactly the same (EDIT see following post where a setting elsewhere may have influenced your result). So your problem is that with the cursor at the start of a line, it finds the string to remove, does so and afterwards the cursor is STILL at the start of the same line. The next iteration of Find/Replace will find yet another occurance on the same line.

So what you need to do is at least capture 1 further character and replace that (write it back) so the cursor isn’t at the start of a line. My modified regex actually captures the remainder of the line and writes it all back, this places the cursor at the end of a line.

So my regex is Find What:(?-s)^[a-z|]+\t(.+) and Replace With:${1}. The (?-s) means the same as clearing the Dot matches newline.

Terry

Coises

@Richard-Darwin said in Regex with unexpected repeat application:

This regex does the job when I apply it with ‘Find Next’ and ‘Replace’, but when I apply it using ‘Replace All’, I get this output:

@Terry-R said in Regex with unexpected repeat application:

Now you say when you used find, then replace repeatedly it worked. Well I tried the same and as expected it didn’t work. That process and the Replace All should work exactly the same.

Most likely the original poster has Settings | Preferences… | Searching | Replace: Don’t move to the following occurrence unchecked (I believe unchecked is the default), and is literally repeating both Find Next and Replace.

With the setting I mentioned unchecked, Replace does the next Find automatically, so the original poster is doing a second find after every replace. Replace All, of course, doesn’t do that.

Terry R

@Coises said in Regex with unexpected repeat application:

Most likely the original poster has Settings | Preferences… | Searching | Replace: Don’t move to the following occurrence unchecked (I believe unchecked is the default), and is literally repeating both Find Next and Replace.

I never knew that setting was there and was trying to figure out why he had a different result to me. Thanks, good to know there are still some things to learn about NPP.

Terry