Community
    • Login

    Change in the way end of line characters act in regex?

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    11 Posts 5 Posters 388 Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Daniel BrandonD
      Daniel Brandon
      last edited by Daniel Brandon

      I’ve been running a relatively straightforward regex to pull certain lines from working text files that I’ve identified by putting a hash (#) at the start of.

      I start by organizing the file, deleting all \n characters and then replacing every (line start) with “\n(line start)” (that’s not literal, there are specific characters that start each line that I use as a cue).
      Then I do a complicated search and replace to put hashes in front of certain lines that I need to extract. ^(.*complicated string) => #\1
      Finally, I use ^[^#].+\n to delete all lines that don’t start with a hash mark.
      This has been working just fine for ages (seriously, I’ve been doing this for years). Suddenly, a week or so ago, it’s stopped working predictably-- it’s grabbing extra lines, or skipping ends or something. I tried using ^[^#].+$ instead, and that sort of works (although it leaves a lot of empty lines behind), but even that is somehow not grabbing what I expected. I haven’t changed my scripts at all. Does anyone have any idea if there was a update, or change in setting or checkbox or option or something that I might have missed somewhere?
      Feel free to suggest other methods for isolating/extracting lines if you must, but bear in mind that what I’m describing above is about the limit of my experience with regex. Anything involving other tools or more complicated patterns is not likely to be useful and may be summarily ignored.

      ETA: for whatever it’s worth, I just booted up an older text editor that we used to use, and the same regex is still working fine there; that’s feeding my suspicion that there was something Notepad-specific involved. I’m going to try to check PeterJones’s suggestion of the .matches newline tickbox. Next week. After turkey. Thanks for the quick responses, though!

      Alan KilbornA PeterJonesP Terry RT mkupperM Daniel BrandonD 5 Replies Last reply Reply Quote 0
      • Alan KilbornA
        Alan Kilborn @Daniel Brandon
        last edited by Alan Kilborn

        @Daniel-Brandon said:

        Does anyone have any idea if there was a update, or change in setting or checkbox or option or something that I might have missed somewhere?

        There have been no recent changes to regex in Notepad++.
        The only options that affect regex are Match case and . matches newline, so, start there I guess. Well, other options can affect it, but from reading your posting I didn’t get the feeling like you were replacing-in-selection (e.g.), or if you were, I think you know what you’re doing.

        it’s stopped working predictably-- it’s grabbing extra lines, or skipping ends or something.

        Probably no one can offer any advice without a very specific example.
        But odds are, if you take the time to put that together for posting, you yourself may spot something you are doing wrong.

        1 Reply Last reply Reply Quote 1
        • PeterJonesP
          PeterJones @Daniel Brandon
          last edited by

          @Daniel-Brandon said in Change in the way end of line characters act in regex?:

          Suddenly, a week or so ago, it’s stopped working predictably-- it’s grabbing extra lines, or skipping ends or something.

          Given that, and the fact that you said your regex had \n, I am wondering if your data source changed newline types.

          Windows newlines are \r\n; linux newlines are \n; and the now-less-common ancient Mac newlines are \r. For the matching portion of a regex, you can use \R which will match any of those three as a line-ender; but \R doesn’t work in the replacement, so you would have to pick one of the sequences to use for your replacement.

          If you are certain that your files have always been Windows EOL (and if they still are – you can use View > Show Symbol > Show All Characters so that you can see whether lines end with just [LF] or end with [CR][LF]), then it may be that you previously had . matches newline, which would have allowed the .+ in the ^[^#].+\n to match the [CR] (\r) before the [LF] (\n) that you were manually matching. So, if you turned off . matches newline unexpectedly, the . would stop matching the \r character, and the search would never match “one or more non-newline characters followed by exactly the LF character”.

          Alan KilbornA Daniel BrandonD 2 Replies Last reply Reply Quote 1
          • Alan KilbornA
            Alan Kilborn @PeterJones
            last edited by Alan Kilborn

            @PeterJones said:

            and the fact that you said your regex had \n

            I considered that, but since OP has had the solution in place for a reasonably long time, I’d think he’d be aware.
            Probably using Unix line endings in his data from the start, and that means that using \n in the regex is fine.

            PeterJonesP 1 Reply Last reply Reply Quote 1
            • Daniel BrandonD
              Daniel Brandon @PeterJones
              last edited by

              @PeterJones This sounds promising-- I thought I had checked that, but it’s worth checking again, and I didn’t know the trick to show the different line ends.

              1 Reply Last reply Reply Quote 1
              • PeterJonesP
                PeterJones @Alan Kilborn
                last edited by

                I considered that, but since OP has had the solution in place for a reasonably long time, I’d think he’d be aware. Probably using Unix line endings in his data from the start.

                But that, in combination with the possibility that . matches newline was previously on and not now on, would have allowed his regex to “work” with Unix newlines even though his data had Windows newlines.

                And the “solution in place for a reasonably long time” doesn’t get rid of the possibility that I raised of the data he’s getting has been recently changed from Unix newlines to Windows newlines – thus, a regex that worked with the old unix newlines wouldn’t work with the new windows newlines. And if he doesn’t regularly show (or check) a particular file’s newlines, he might not have realized there was a change in the input data.

                1 Reply Last reply Reply Quote 1
                • Terry RT
                  Terry R @Daniel Brandon
                  last edited by Terry R

                  @Daniel-Brandon said in Change in the way end of line characters act in regex?:

                  it’s grabbing extra lines, or skipping ends or something.

                  My immediate thought is that you might have ticked the “. matches newline” option. Read this from the online manual. Actual option is down about a page.

                  Otherwise images of your replacement regex overlaying the file being worked on (if you dont mind showing your real data) might be helpful. Also consider showing all characters including line endings. That reference in the online manual is here.

                  Terry

                  PS writing on a tablet is so slow that others got there first!

                  1 Reply Last reply Reply Quote 1
                  • mkupperM
                    mkupper @Daniel Brandon
                    last edited by

                    @Daniel-Brandon

                    | Finally, I use ^[^#].+\n to delete all lines that don’t start with a hash mark.

                    That expression seems to have three possible issues.

                    1. It will match and delete \n\n# Some stuff you wanted to keep as [^#] matches \n in the data.
                    2. It won’t match \n#\n~ in the data though it should be removing a line that only has a # at the beginning.
                    3. While you use \n in your expressions can you guarantee that there is never a \rn or \r\n pair in your data?

                    A safer expression is ^[^#\r\n].*\R.

                    Notepad++ regexp is a little different than some other regexp engines in that in Notepad++ [^#] will also match end of line characters such as \n or \r Also, the [^#] only matches one character. If you have a two-character CRLF in your data it will match just the CR (\r) and the next character in the data stream is a LF (\n).

                    1 Reply Last reply Reply Quote 3
                    • Daniel BrandonD
                      Daniel Brandon @Daniel Brandon
                      last edited by

                      Welp, when I came back from holiday, I tried again with the next batch of data, making sure that “. matches newline” was unchecked, and everything is working as expected once again. Either that did somehow get checked, or gremlins got into my laptop (again). Thanks again for all the suggestions!

                      Terry RT Daniel BrandonD 2 Replies Last reply Reply Quote 2
                      • Terry RT
                        Terry R @Daniel Brandon
                        last edited by

                        @Daniel-Brandon said in Change in the way end of line characters act in regex?:

                        Either that did somehow get checked, or gremlins got into my laptop (again)

                        Most of the regulars here use the “search modifiers” instead of ticking the “. matches newline box”. That’s because we aren’t sure that users of our regex will have that ticked or not. The search modifier overrides that tick box and allows some certainty with what our solution will do.

                        The reference in the online manual is here. (?s) is the same as ticking that box and (?-s) is no tick in that box.

                        Terry

                        1 Reply Last reply Reply Quote 4
                        • Daniel BrandonD
                          Daniel Brandon @Daniel Brandon
                          last edited by

                          @Daniel-Brandon It also appears that some (but not all!) of the files are indeed reaching me with \r Carriage Return characters instead of \n Line feed characters baked in, which should not be happening, and never happened before, but there you go. Thanks to everyone who suggested looking for them. Between that and matches newline, I’m really hoping I’ve got it sorted now.

                          1 Reply Last reply Reply Quote 3
                          • First post
                            Last post
                          The Community of users of the Notepad++ text editor.
                          Powered by NodeBB | Contributors