Problemewith regex in multi-line search
-
Fellow Notepad++ Users,
Could you please help me the the following search-and-replace problem I am having?
Here is the debug info:
Notepad++ v8.7.5 (32-bit) Build time : Dec 21 2024 - 05:11:15 Path : I:\Binaries\Notepad++\notepad++.exe Command Line : Admin mode : OFF Local Conf mode : OFF Cloud Config : OFF Periodic Backup : OFF Placeholders : OFF DirectWrite : ON Multi-instance Mode : multiInst File Status Auto-Detection : cdEnabledNew (for current file/tab only) Dark Mode : OFF OS Name : Windows 11 Home (64-bit) OS Version : 24H2 OS Build : 26100.3476 Current ANSI codepage : 1252 Plugins : mimeTools (3.1) NppConverter (4.6) NppExport (0.4)
I am trying to massage some data about English syllable structure. I have parsed an IPA data file into phoneme classes: P represents [p]|[t]|[k], B represents the voiced counterparts, X represents [f]|[θ]|[s]|[ʃ], Ɣ the voiced fricatives, and so on. V obviously represents any vowel.
What I want to do is find adjacent lines that are identical except that the second one ends in an additional [Ɣ][VƔ] or [X], which will be a plural version of the upper word, or a third-person verb: the original text might have been ‘limp /ˈɫɪmp/’, ‘limps /ˈɫɪmps/’, which are massaged into ˈLVNP’, ‘LVNPX’. I want to replace that suffix with ‘=Z’ (a generalization of =Zp ‘plural’ and =Z3 ‘3rd.person.singular’).
Here is the data I currently have (“before” data):
BRVNBVˈNVPV BRVNVˈPVPVV --snip-- ˌƔVXˈPVƔVPX ˌƔVƔVˈBVRVPV ˌƔVƔVˈPVVXVN ˌƔVƔVˈPVVXVNƔ ˌƔVƔVˈVNV ˌƔVƔVˈVNVV ˌƔVƔWVRVˈƔVVXVN ˌƔVˈBRVBV ˌƔVˈBRVX ˌƔVˈBRVXVX ˌƔVˈBVPV ˌƔVˈBVRV ˌƔVˈBVRVV ˌƔVˈBVRVVƔ ˌƔVˈBVX ˌƔVˈBVXVX
Here is how I would like that data to look (“after” data):
BRVNBVˈNVPV BRVNVˈPVPVV --snip-- ˌƔVXˈPVƔVPX ˌƔVƔVˈBVRVPV ˌƔVƔVˈPVVXVN ˌƔVƔVˈPVVXVN=Z ˌƔVƔVˈVNV ˌƔVƔVˈVNVV ˌƔVƔWVRVˈƔVVXVN ˌƔVˈBRVBV ˌƔVˈBRVX ˌƔVˈBRVXVX ˌƔVˈBVPV ˌƔVˈBVRV ˌƔVˈBVRVV ˌƔVˈBVRVV=Z ˌƔVˈBVX
To accomplish this, I have tried using the following Find/Replace expressions and settings
- Find What =
^(.+)$\n^\1Ɣ$
- Replace With =
\1\n\1=Z
- Search Mode = REGULAR EXPRESSION
- Dot Matches Newline = CHECKED or NOT CHECKED
This regex is supposed to match the entire first line and the ‘\n’ and a second line with the suffix (initially just Ɣ). With ‘Dot matches Newline’ unset, this returns ‘Find: can’t find the text
“^(.+)$\n^\1Ɣ$” in entire file’; with the setting checked, it searches the entire file and selects the first line (blank) and all but the last three characters of the second line, or it says ‘Invalid Regular Expression’ and notes that ‘complexity exceeds predefined bounds’, depending on where I start the search (the file is 646k, 64k lines).
Taking out the carets from the regex did not seem to have any effect; taking out the '$'s with Dot unchecked led to ‘can’t find text’ and withDot checked led to ‘invalid regex’.This did not produce the output I desired, and I’m not sure why. Could you please help me understand what went wrong and help me find the solution?
- Find What =
-
Most likely you are editing a file with Windows line endings, which are \r\n, not just \n.
I would suggest:
-
Leave . matches newline unchecked — you don’t want that .+ to cross lines
-
Use \R to represent the break between lines — it matches \r, \n or \r\n
-
You don’t need the $ when it’s immediately followed by a line-ending character, nor the circumflex immediately after (though they don’t hurt, either)
So, try:
^(.+)\R\1Ɣ$
and see if that matches as desired; leave . matches newline unchecked. If you are using Windows line endings, use\1\r\n\1=Z
to replace. (On the status bar at the bottom, towards the right side, you’ll see either Windows (CR LF), Unix (LF) or Macintosh (CR), which tells you the current line ending setting for your file.) -