Regex with unexpected repeat application
-
Hello Notepad++ users:
Could you please help me with a regex problem?First off, here is my Debug data:
Notepad++ v8.6.2 (32-bit) Build time : Jan 14 2024 - 02:18:41 Path : C:\Program Files (x86)\Notepad++\notepad++.exe Command Line : "E:\Linguistica\FrequencyList\lemma-pos.txt" Admin mode : OFF Local Conf mode : OFF Cloud Config : OFF OS Name : Windows 10 Home (64-bit) OS Version : 22H2 OS Build : 19045.4046 Current ANSI codepage : 1252 Plugins : mimeTools (3) NppConverter (4.5) NppExport (0.4)
Here is some sample data. The first four ‘spaces’ are really TABs, while the ‘Inflections’ part is comma-separated.
# LEMMA|POS LEMMA POS FREQUENCY INFLECTIONS underestimate|v underestimate v 35 underestimate, underestimated, underestimates, underestimating unique|j unique j 32 unique, uniquer, uniquest various|j various j 32 various vein|n vein n 32 vein, veins weep|v weep v 32 weep, weeping, weeps, wept whiskey n 32 whiskey, whiskeys, whiskies witty j 32 witty, wittier, wittiest worry|n worry n 32 worry, worries memorial|n memorial n 31 memorial, memorials
I want to strip out the redundant first block of text in each line, the one containing ‘|’, and the TAB after it.
Here is how it should look:
underestimate v 35 underestimate, underestimated, underestimates, underestimating unique j 32 unique, uniquer, uniquest various j 32 various vein n 32 vein, veins weep v 32 weep, weeping, weeps, wept whiskey n 32 whiskey, whiskeys, whiskies witty j 32 witty, wittier, wittiest worry n 32 worry, worries memorial n 31 memorial, memorials
To accomplish this, I have tried using the following:
Find/Replace expressions and settings- Find What =
^([a-z|]+\t)
- Replace With = ``
- Search Mode = REGULAR EXPRESSION
- Dot Matches Newline = NOT CHECKED
(I also have ‘Match case’ and ‘Wrap around’ OFF.
This regex does the job when I apply it with ‘Find Next’ and ‘Replace’, but when I apply it using ‘Replace All’, I get this output:
35 underestimate, underestimated, underestimates, underestimating 32 unique, uniquer, uniquest 32 various 32 vein, veins 32 weep, weeping, weeps, wept 32 whiskey, whiskeys, whiskies 32 witty, wittier, wittiest 32 worry, worries 31 memorial, memorials
I tried using ‘*’ instead of ‘+’ in the regex but got exactly the same unwanted result.
ObvIously the regex is being applied more than once per line. I don’t know why this is happening. I thought the regex should only apply once at the start of each line, given ‘^’. Is there a ‘global’ flag somewhere that I inadvertently set? If so, how do I access it?
Any advice would be appreciated.
–
rick.darwin@gmail.com
–Charles Darwin? He was my grandfather. Oh, that Charles. We share a common ancestor. - Find What =
-
@Richard-Darwin said in Regex with unexpected repeat application:
ObvIously the regex is being applied more than once per line. I don’t know why this is happening.
When I copied your example, there were no tabs so I had to interpret where I thought they might be and the first line helped. Whilst I used your regex without any alteration I did not clear the “dot matches newline” etc. You will find as you look at solutions provided by members that we prefer instead to use modifiers. Reference for this is in the online manual here .
The reason for doing so is that these modifiers will override any settings the user might have set and forgot to change, that way we (as solution provider) have more certainty that our provided regex will work as expected.Now you say when you used find, then replace repeatedly it worked. Well I tried the same and as expected it didn’t work. That process and the Replace All should work exactly the same (EDIT see following post where a setting elsewhere may have influenced your result). So your problem is that with the cursor at the start of a line, it finds the string to remove, does so and afterwards the cursor is STILL at the start of the same line. The next iteration of Find/Replace will find yet another occurance on the same line.
So what you need to do is at least capture 1 further character and replace that (write it back) so the cursor isn’t at the start of a line. My modified regex actually captures the remainder of the line and writes it all back, this places the cursor at the end of a line.
So my regex is Find What:
(?-s)^[a-z|]+\t(.+)
and Replace With:${1}
. The (?-s) means the same as clearing the Dot matches newline.Terry
-
@Richard-Darwin said in Regex with unexpected repeat application:
This regex does the job when I apply it with ‘Find Next’ and ‘Replace’, but when I apply it using ‘Replace All’, I get this output:
@Terry-R said in Regex with unexpected repeat application:
Now you say when you used find, then replace repeatedly it worked. Well I tried the same and as expected it didn’t work. That process and the Replace All should work exactly the same.
Most likely the original poster has Settings | Preferences… | Searching | Replace: Don’t move to the following occurrence unchecked (I believe unchecked is the default), and is literally repeating both Find Next and Replace.
With the setting I mentioned unchecked, Replace does the next Find automatically, so the original poster is doing a second find after every replace. Replace All, of course, doesn’t do that.
-
@Coises said in Regex with unexpected repeat application:
Most likely the original poster has Settings | Preferences… | Searching | Replace: Don’t move to the following occurrence unchecked (I believe unchecked is the default), and is literally repeating both Find Next and Replace.
I never knew that setting was there and was trying to figure out why he had a different result to me. Thanks, good to know there are still some things to learn about NPP.
Terry