• Login
Community
  • Login

regex: Match everything up to linebreak but not linebreak

Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
13 Posts 3 Posters 2.9k Views
Loading More Posts
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • H
    Hellena Crainicu
    last edited by Oct 14, 2021, 7:06 AM

    hello. This is the line from my Python code, with the regex I must change a little bit:

    words = re.findall(r'\w+', new_filename)

    Basically, this will select the content of <title></title> tag and it will save it as an html.

    For example:

    <title>My name is Peter | Prince Justin (en)</title>

    must be save as:

    my-name-is-peter.html (so, without everything after | )

    My regex \w+ will select also the linebreak | and after it. I need to change this regex, in order to select all words before linebreak.

    I try also, this 2 regex, but are not good: \w+.*\| or \w+.*?[\s\S]\|

    Can anyone help me?

    N 1 Reply Last reply Oct 14, 2021, 8:58 AM Reply Quote 0
    • N
      Neil Schipper @Hellena Crainicu
      last edited by Oct 14, 2021, 8:58 AM

      @Hellena-Crainicu It looks like you are asking about usage of Python’s regex machinery, and not the regex within Notepad++. Is this correct?

      1 Reply Last reply Reply Quote 0
      • H
        Hellena Crainicu
        last edited by Oct 14, 2021, 9:03 AM

        I work only with notepad++, just running the code in Python.

        N 1 Reply Last reply Oct 14, 2021, 9:28 AM Reply Quote 0
        • N
          Neil Schipper @Hellena Crainicu
          last edited by Oct 14, 2021, 9:28 AM

          @Hellena-Crainicu But you’re asking about a regex to feed into a call to re.findall(), correct? Or are you asking how to convert lines of text that look like your <title>..<\title> example that are in a text file loaded in the np++ editor?

          If it’s the latter, I have a solution but I’m confused.

          1 Reply Last reply Reply Quote 0
          • H
            Hellena Crainicu
            last edited by Oct 14, 2021, 9:30 AM

            @Neil-Schipper I am using \w+ as you can see. But I need to stop selecting on the linebreak |, othewise I will get my-name-is-peter-prince-justin.html instead of my-name-is-peter.html

            N 1 Reply Last reply Oct 14, 2021, 9:36 AM Reply Quote 0
            • N
              Neil Schipper @Hellena Crainicu
              last edited by Oct 14, 2021, 9:36 AM

              @Hellena-Crainicu I’m not getting the clarity I’m hoping for. Here are two very different things people do on computers:

              1. running a python program that processes an input file, and maybe changes it or produces an output file, etc.

              2. having a file loaded in an editor, and running a search and replace operation on it

              Which of these are you trying to do (that requires regex assistance as you described)?

              1 Reply Last reply Reply Quote 0
              • H
                Hellena Crainicu
                last edited by Oct 14, 2021, 9:39 AM

                it is just about the regex… maybe @guy038 will can help me. He is the master of regex.

                1 Reply Last reply Reply Quote 0
                • N
                  Neil Schipper
                  last edited by Oct 14, 2021, 9:48 AM

                  For my own amusement, I solved the problem in the editor.

                  I broke the problem into:

                  1. consume from start line to first ‘>’
                  2. capture everything up to and excluding (space followed by literal ‘|’) into group 1
                  3. consume everything else up to and including EOL

                  The search phrase ^.*?>(.+?)(?= \|).*?$ does this. Then replace with \1.html. Then a separate S&R can convert all spaces to ‘-’.

                  But I still don’t know what you’re asking for, because you refuse to tell me!

                  1 Reply Last reply Reply Quote 0
                  • N
                    Neil Schipper
                    last edited by Oct 14, 2021, 10:15 AM

                    Again, for my own amusement (since I’ve never used re.sub() before, only match & split):

                    >>> t1 = re.sub(r"^.*?>(.+?)(?= \|).*?$", r"\1.html", "<title>My name is Peter | Prince Justin (en)</title>")
                    >>> t2 = re.sub(r"\s", r"-", t1)
                    >>> t2
                    'My-name-is-Peter.html'
                    >>>
                    
                    1 Reply Last reply Reply Quote 0
                    • H
                      Hellena Crainicu
                      last edited by Oct 14, 2021, 10:20 AM

                      I must split all html files, not just one. I don’t think I can use the replacement…

                          new_filename = title.get_text() 
                          new_filename = new_filename.lower()
                          words = re.findall(r'\w+', new_filename)
                          new_filename = '-'.join(words)
                          new_filename = new_filename + '.html'
                          print(new_filename)
                      
                      1 Reply Last reply Reply Quote 0
                      • H
                        Hellena Crainicu
                        last edited by Oct 14, 2021, 10:29 AM

                        I try now this regex: \w+.*(?= \|)

                        words = re.findall(r"\w+.*(?= \|)", new_filename)

                        almost works, but I get: my name is peter.html (but without little dash)

                        1 Reply Last reply Reply Quote 0
                        • A
                          Alan Kilborn
                          last edited by Oct 14, 2021, 10:56 AM

                          You guys are OFF-TOPIC.
                          This is not an appropriate place to discuss Python’s regular expression engine.
                          Please find a more appropriate forum for that and confine discussions here to Notepad++ related topics.
                          Just because you write Python code in Notepad++ doesn’t make discussion of that code a Notepad++ topic.

                          1 Reply Last reply Reply Quote 0
                          • H
                            Hellena Crainicu
                            last edited by Hellena Crainicu Oct 14, 2021, 11:01 AM Oct 14, 2021, 11:00 AM

                            I find the regex which I needed: \b\w+\b(?=[\w\s]+\|)

                            and in Python should be:

                            words = re.findall(r'\b\w+\b(?=[\w\s]+\|)', new_filename)

                            thanks @Neil-Schipper You give me a good ideea ;)

                            1 Reply Last reply Reply Quote 0
                            10 out of 13
                            • First post
                              10/13
                              Last post
                            The Community of users of the Notepad++ text editor.
                            Powered by NodeBB | Contributors