PythonScript regex error, but not as NPP regex

M Andre Z Eckenrode

I use PythonScript, mostly for extended sequences of regular expressions find & replace operations. I typically test my regex code directly in Notepad++ via the built-in Find/Replace dialog before putting it into a script. Both my files being operated on and my scripts are nearly always ANSI/Windows-1252, but I want my scripts to be able to work on unicode text as well. A new script I’m working on includes this problematic line of code:

editor.rereplace(r'(\)) ([[:alpha:]]+) ([[:alpha:]]+ — )', ur'\1\r\n\t\u\2 \u\3')

That line results in the following error message in PythonScript’s console:

File "C:\Users\MAZE\AppData\Roaming\Notepad++\plugins\Config\PythonScript\scripts\McCartney-Project.py", line 25
    editor.rereplace(r'(\)) ([[:alpha:]]+) ([[:alpha:]]+ x97 )', ur'\1\r\n\t\u\2 \u\3')
SyntaxError: (unicode error) 'rawunicodeescape' codec can't decode bytes in position 8-9: truncated \uXXXX

However, that actual regex works directly in NPP. It looks offhand like my m-dash (—) is the root of the problem, as it’s replaced by x97 in the error message, but I use an ANSI m-dash in several other regex operations coming before that line in the same script, and PythonScript doesn’t complain about any of them, and DOES process them as expected (after I’ve commented out the offending line). Anybody know why that particular line is a stumbling block?

Notepad++ v8.8.5 (32-bit)
Build time: Aug 14 2025 - 00:17:53
Scintilla/Lexilla included: 5.5.7/5.4.5
Boost Regex included: 1_85
Path: C:\Program Files (x86)\Notepad++\notepad++.exe
Command Line: “C:\Program Files\ArdfryImaging\PNGOUTWin\PNGOUTWin Reg Codes.txt”
Admin mode: OFF
Local Conf mode: OFF
Cloud Config: OFF
Periodic Backup: ON
Placeholders: OFF
Scintilla Rendering Mode: SC_TECHNOLOGY_DIRECTWRITE (1)
Multi-instance Mode: monoInst
asNotepad: OFF
File Status Auto-Detection: cdEnabledNew (for current file/tab only)
Dark Mode: OFF
Display Info:
primary monitor: 1920x1080, scaling 100%
visible monitors count: 1
installed Display Class adapters:
0000: Description - Intel® HD Graphics 620
0000: DriverVersion - 31.0.101.2111
0001: Description - NVIDIA GeForce 940MX
0001: DriverVersion - 30.0.15.1169
OS Name: Windows 10 Enterprise (64-bit)
OS Version: 22H2
OS Build: 19045.6216
Current ANSI codepage: 1252
Plugins:
BetterMultiSelection (1.5)
ColumnsPlusPlus (1.2)
ColumnTools (1.4.5.1)
ComparePlus (1.2)
DSpellCheck (1.5)
ExtSettings (1.3.1)
HTMLTag_unicode (1.5.4)
mimeTools (3.1)
MultiClipboard (2.1)
MultiReplace (4.3.2.28)
NppCalc (1.5)
NppConverter (4.6)
NppExport (0.4)
NPPJSONViewer (2.1.1)
NppTextFX (1.4.1)
NppXmlTreeviewPlugin (2)
PreviewHTML (1.3.3.2)
PythonScript (2.1)
RegexTrainer (1.2)
SessionMgr (1.4.4)

M Andre Z Eckenrode

Actually, I’m now thinking that my use of \u is the problem. I’m looking for it to cause the next character to be output in UPPER CASE, but looks like Python is expecting four hexadecimal digits to specify a Unicode code point.

Ekopalypse

@M-Andre-Z-Eckenrode

Unfortunately, this is still an open issue,

In this specific case you can use something like this

editor.rereplace(r'(\)) ([[:alpha:]]+) ([[:alpha:]]+ — )', lambda m: f'{m.group(1)}\r\n\t{m.group(2).title()} {m.group(3).title()}')

EDIT: oopss - just realized you are still using python 2

editor.rereplace(r'(\)) ([[:alpha:]]+) ([[:alpha:]]+ — )', lambda m: '{}\r\n\t{} {}'.format(m.group(1), m.group(2).title(), m.group(3).title()))

M Andre Z Eckenrode

@Ekopalypse

Thanks much. My bad for even bringing it up, actually, since I already had back in 2021 and was advised about the lambda workaround at that time. Forgot about that.