regex: Match everything up to linebreak but not linebreak
-
hello. This is the line from my Python code, with the regex I must change a little bit:
words = re.findall(r'\w+', new_filename)
Basically, this will select the content of
<title></title>
tag and it will save it as an html.For example:
<title>My name is Peter | Prince Justin (en)</title>
must be save as:
my-name-is-peter.html
(so, without everything after|
)My regex
\w+
will select also the linebreak|
and after it. I need to change this regex, in order to select all words before linebreak.I try also, this 2 regex, but are not good:
\w+.*\|
or\w+.*?[\s\S]\|
Can anyone help me?
-
@Hellena-Crainicu It looks like you are asking about usage of Python’s regex machinery, and not the regex within Notepad++. Is this correct?
-
I work only with notepad++, just running the code in Python.
-
@Hellena-Crainicu But you’re asking about a regex to feed into a call to re.findall(), correct? Or are you asking how to convert lines of text that look like your
<title>..<\title>
example that are in a text file loaded in the np++ editor?If it’s the latter, I have a solution but I’m confused.
-
@Neil-Schipper I am using
\w+
as you can see. But I need to stop selecting on the linebreak|
, othewise I will getmy-name-is-peter-prince-justin.html
instead ofmy-name-is-peter.html
-
@Hellena-Crainicu I’m not getting the clarity I’m hoping for. Here are two very different things people do on computers:
-
running a python program that processes an input file, and maybe changes it or produces an output file, etc.
-
having a file loaded in an editor, and running a search and replace operation on it
Which of these are you trying to do (that requires regex assistance as you described)?
-
-
it is just about the regex… maybe @guy038 will can help me. He is the master of regex.
-
For my own amusement, I solved the problem in the editor.
I broke the problem into:
- consume from start line to first ‘>’
- capture everything up to and excluding (space followed by literal ‘|’) into group 1
- consume everything else up to and including EOL
The search phrase
^.*?>(.+?)(?= \|).*?$
does this. Then replace with\1.html
. Then a separate S&R can convert all spaces to ‘-’.But I still don’t know what you’re asking for, because you refuse to tell me!
-
Again, for my own amusement (since I’ve never used re.sub() before, only match & split):
>>> t1 = re.sub(r"^.*?>(.+?)(?= \|).*?$", r"\1.html", "<title>My name is Peter | Prince Justin (en)</title>") >>> t2 = re.sub(r"\s", r"-", t1) >>> t2 'My-name-is-Peter.html' >>>
-
I must split all html files, not just one. I don’t think I can use the replacement…
new_filename = title.get_text() new_filename = new_filename.lower() words = re.findall(r'\w+', new_filename) new_filename = '-'.join(words) new_filename = new_filename + '.html' print(new_filename)
-
I try now this regex:
\w+.*(?= \|)
words = re.findall(r"\w+.*(?= \|)", new_filename)
almost works, but I get:
my name is peter.html
(but without little dash) -
You guys are OFF-TOPIC.
This is not an appropriate place to discuss Python’s regular expression engine.
Please find a more appropriate forum for that and confine discussions here to Notepad++ related topics.
Just because you write Python code in Notepad++ doesn’t make discussion of that code a Notepad++ topic. -
I find the regex which I needed:
\b\w+\b(?=[\w\s]+\|)
and in Python should be:
words = re.findall(r'\b\w+\b(?=[\w\s]+\|)', new_filename)
thanks @Neil-Schipper You give me a good ideea ;)