Community
    • Login

    Reg Ex to find lines with linefeeds in csv

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    pipe-delimitedcsvregexfind-replace
    13 Posts 3 Posters 3.1k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Mark YorkovichM
      Mark Yorkovich
      last edited by

      I’m working with a pipe-delimited csv. The column format is 10 columns with 9 pipes in a row and then a line feed - NOT CRLF, just LF. But there are LOTS of records that have line feeds within one or more columns in a single row. I need to replace the “interior” line feeds with a dummy/replacement string (#$%*).

      Some examples of valid rows (“[LF]” indicating the end of line Line Feed, of course)::

      • lorem|1|2|1|1|1|0|||9887544[LF]
      • dolor|6|6|1|1|0|0|||123456[LF]

      Examples of invalid records - they have a single column broken across multiple lines due to interior line feeds:

      dolor|0|3|2|0|0|0||Lorem ipsum dolor sit amet[LF]
      consectetur adipiscing elit[LF]
      sed do eiusmod tempor incididunt[LF]
      ut labore et dolore magna aliqua|[LF]

      lorem|9|1|1|0|0|0||Lorem ipsum dolor sit amet, consectetur adipiscing elit,[LF]
      sed do eiusmod tempor incididunt[LF]
      ut labore et dolore magna aliqua[LF]
      Ut enim ad minim veniam,[LF]
      quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.[LF]
      |[LF]

      I need a regex that will select lines 1 through 3 in the first example of an invalid record and lines 1 through 5 in the second example. There can be any number of Line Feeds within a given column in a single record.

      I have a regex that finds blocks of lines that have NO pipes: ^[^|]*$. That will select lines 2 and 3 in the first example and lines 2 through 4 in the second example, but I can’t figure out how to expand on that to include the first line in each record.

      Alan KilbornA 1 Reply Last reply Reply Quote 1
      • Alan KilbornA
        Alan Kilborn @Mark Yorkovich
        last edited by Alan Kilborn

        @Mark-Yorkovich

        I might be tempted to take a multistep approach on this one.

        Replace |[LF] with a [CR] -----> actual FInd: \|\n, Replace: \r

        Replace [LF] with a space (maybe, not sure what you want here) -----> actual Find: \n, Replace: \x20

        Replace [CR] with |[LF] -----> actual Find: \r, Replace: |\n

        Note that the [CR] is just used as a temporary real-end-of-line placeholder.

        I’m sure other solutions are quite possible.

        1 Reply Last reply Reply Quote 2
        • Alan KilbornA
          Alan Kilborn
          last edited by

          @Mark-Yorkovich said:

          I need to replace the “interior” line feeds with a dummy/replacement string (#$%*).

          I missed that part upon first read. Seems easy enough to extrapolate my solution to that, though. ;)

          1 Reply Last reply Reply Quote 0
          • Mark YorkovichM
            Mark Yorkovich
            last edited by Mark Yorkovich

            Thanks for the reply, Alan. A “multistep approach” isn’t ideal in this situation. The file is over 100k lines - something like 70k records and I’d like to have a one-step process where I enter the regex in the Find What box, set my replacement string and click Replace All. I’ve been manually going through the file, fixing the offending records, but that way will take an incredible amount of time.

            I see what you’re intending to do, which made me realize that there can be characters in the last column - after the last pipe and before the LF. I failed to mention that in my post.

            Alan KilbornA 1 Reply Last reply Reply Quote 0
            • Alan KilbornA
              Alan Kilborn @Mark Yorkovich
              last edited by Alan Kilborn

              @Mark-Yorkovich

              A “multistep approach” isn’t ideal in this situation

              It seems like my 3 steps could all be of the Replace All variety, thus making it only 3 actions for a file of arbitrary length. Not 3 steps you would have to repeat n times.

              there can be characters in the last column - after the last pipe and before the LF. I failed to mention that in my post.

              Actually, I think you did mention it, indirectly, by showing these lines, which have that attribute:

              lorem|1|2|1|1|1|0|||9887544[LF]
              dolor|6|6|1|1|0|0|||123456[LF]

              1 Reply Last reply Reply Quote 1
              • Mark YorkovichM
                Mark Yorkovich
                last edited by

                Okay - I’ll pull a bunch of lines out of the file into a test file and give it a whirl.

                Alan KilbornA 1 Reply Last reply Reply Quote 0
                • Alan KilbornA
                  Alan Kilborn @Mark Yorkovich
                  last edited by

                  @Mark-Yorkovich

                  I still think the best approach is to temporarily change the [LF] you want to keep into something else, then go back and change any [LF] remaining (the bogus ones) into what you want them to be. And finally revert the substituted line-ends back into real line-ends.

                  Note that I don’t think there is a good way to detect a bogus [LF] if it occurs in the final field (after the 9th pipe), so I hope that doesn’t occur for you.

                  This is rather a tough problem because the pattern can get destroyed quickly.

                  1 Reply Last reply Reply Quote 0
                  • Mark YorkovichM
                    Mark Yorkovich
                    last edited by

                    So, here’s a good example of what I’m working with. The multi-line records are the lines with asterisks by the line number. This forum text makes it kinda hard to read like this, but hopefully you’ll get the idea.

                    1 lorem62|8|5|1|1|0|0|||EQUI033[LF]
                    2 lorem62|8|6|1|1|0|0|||EQUI036[LF]
                    *3 lorem62|9|1|1|0|0|0||Lorem ipsum dolor sit amet, consectetur adipiscing elit[LF]
                    *4 quis nostrud exercitation ullamco[LF]
                    *5 laboris nisi ut aliquip ex ea commodo consequat.[LF]
                    *6 consectetur adipiscing elit|[LF]
                    7 lorem-44|1|1|1|1|0|0|||3003024[LF]
                    8 lorem-44|1|2|1|1|0|0|||3001476[LF]
                    9 lorem-44|1|3|1|1|0|0|||2001415[LF]
                    10 lorem47|8|4|1|1|0|0|||EQUI032[LF]
                    11 lorem47|8|5|1|1|0|0|||EQUI033[LF]
                    12 lorem47|8|6|1|1|0|0|||EQUI036[LF]
                    13 lorem47|9|1|1|0|0|0||Lorem ipsum dolor sit amet, consectetur adipiscing elit|[LF]
                    14 lorem62|1|1|1|1|0|0|||3003024[LF]
                    15 lorem62|1|2|1|1|0|0|||3001476[LF]
                    16 lorem-44|8|6|1|1|0|0|||EQUI036[LF]
                    *17 lorem-44|9|1|1|0|0|0||Lorem ipsum dolor sit amet, consectetur adipiscing elit[LF]
                    *18 quis nostrud exercitation ullamco[LF]
                    *19 laboris nisi ut aliquip ex ea commodo consequat.[LF]
                    *20 consectetur adipiscing elit[LF]
                    *21 |[LF]
                    22 lorem20|1|1|1|1|0|0|||3003024[LF]
                    23 lorem20|1|2|1|1|0|0|||3001476[LF]
                    24 lorem20|1|3|1|2|0|0|||3001894[LF]
                    25 lorem20|1|4|1|1|0|0|||2001443[LF]
                    26 lorem20|1|5|1|1|0|0|||2001439[LF]
                    27 lorem20|1|6|1|1|0|0|||3002274[LF]
                    28 lorem10|5|1|1|1|1|0|||MED0015[LF]
                    *29 lorem20|13|1|1|0|0|0||Lorem ipsum dolor sit amet, consectetur adipiscing elit[LF]
                    *30 quis nostrud exercitation ullamco[LF]
                    *31 laboris nisi ut aliquip ex ea commodo consequat.[LF]
                    *32 consectetur adipiscing elit|[LF]
                    33 lorem20|3|1|1|2|0|0|||3002407[LF]
                    34 lorem16|11|0|1|0|0|0|||[LF]

                    So, according to the steps, I should first replace pipe nand line feed |[LF] with [CR] (Find what: \n and Replace with: \r)
                    I did that and lines 6, 13, 21, 32 and 34 now end with [CR]. Right away, I see this won’t work, as we’re effectively removing the last column from those lines. Remember, the rows should be 10 columns, with 9 pipes separating the columns.

                    1 Reply Last reply Reply Quote 0
                    • Mark YorkovichM
                      Mark Yorkovich
                      last edited by

                      Also, step 2 replaces the rest of the [LF] with a dummy string, resulting in 5 long lines of text, with records/lines merged.

                      1 Reply Last reply Reply Quote 0
                      • Alan KilbornA
                        Alan Kilborn
                        last edited by

                        The earlier steps were made with some bogus assumptions due to not enough data.

                        And/or just some plain old mistakes. :)

                        So taking the new, lengthier data, which in reality is (I think) this, in a Linux formatted N++ tab:

                        lorem62|8|5|1|1|0|0|||EQUI033
                        lorem62|8|6|1|1|0|0|||EQUI036
                        lorem62|9|1|1|0|0|0||Lorem ipsum dolor sit amet, consectetur adipiscing elit
                        quis nostrud exercitation ullamco
                        laboris nisi ut aliquip ex ea commodo consequat.
                        consectetur adipiscing elit|
                        lorem-44|1|1|1|1|0|0|||3003024
                        lorem-44|1|2|1|1|0|0|||3001476
                        lorem-44|1|3|1|1|0|0|||2001415
                        lorem47|8|4|1|1|0|0|||EQUI032
                        lorem47|8|5|1|1|0|0|||EQUI033
                        lorem47|8|6|1|1|0|0|||EQUI036
                        lorem47|9|1|1|0|0|0||Lorem ipsum dolor sit amet, consectetur adipiscing elit|
                        lorem62|1|1|1|1|0|0|||3003024
                        lorem62|1|2|1|1|0|0|||3001476
                        lorem-44|8|6|1|1|0|0|||EQUI036
                        lorem-44|9|1|1|0|0|0||Lorem ipsum dolor sit amet, consectetur adipiscing elit
                        quis nostrud exercitation ullamco
                        laboris nisi ut aliquip ex ea commodo consequat.
                        consectetur adipiscing elit
                        |
                        lorem20|1|1|1|1|0|0|||3003024
                        lorem20|1|2|1|1|0|0|||3001476
                        lorem20|1|3|1|2|0|0|||3001894
                        lorem20|1|4|1|1|0|0|||2001443
                        lorem20|1|5|1|1|0|0|||2001439
                        lorem20|1|6|1|1|0|0|||3002274
                        lorem10|5|1|1|1|1|0|||MED0015
                        lorem20|13|1|1|0|0|0||Lorem ipsum dolor sit amet, consectetur adipiscing elit
                        quis nostrud exercitation ullamco
                        laboris nisi ut aliquip ex ea commodo consequat.
                        consectetur adipiscing elit|
                        lorem20|3|1|1|2|0|0|||3002407
                        lorem16|11|0|1|0|0|0|||
                        

                        The following 3 replace-all transforms seem to do the job:

                        F: (?-s)^([^|]*?\|){9}.*\K\n
                        R: \r

                        F: \n
                        R: LINEFEED <— your choice of text substitution here

                        F: \r
                        R: \n

                        After which one obtains:

                        lorem62|8|5|1|1|0|0|||EQUI033
                        lorem62|8|6|1|1|0|0|||EQUI036
                        lorem62|9|1|1|0|0|0||Lorem ipsum dolor sit amet, consectetur adipiscing elitLINEFEEDquis nostrud exercitation ullamcoLINEFEEDlaboris nisi ut aliquip ex ea commodo consequat.LINEFEEDconsectetur adipiscing elit|
                        lorem-44|1|1|1|1|0|0|||3003024
                        lorem-44|1|2|1|1|0|0|||3001476
                        lorem-44|1|3|1|1|0|0|||2001415
                        lorem47|8|4|1|1|0|0|||EQUI032
                        lorem47|8|5|1|1|0|0|||EQUI033
                        lorem47|8|6|1|1|0|0|||EQUI036
                        lorem47|9|1|1|0|0|0||Lorem ipsum dolor sit amet, consectetur adipiscing elit|
                        lorem62|1|1|1|1|0|0|||3003024
                        lorem62|1|2|1|1|0|0|||3001476
                        lorem-44|8|6|1|1|0|0|||EQUI036
                        lorem-44|9|1|1|0|0|0||Lorem ipsum dolor sit amet, consectetur adipiscing elitLINEFEEDquis nostrud exercitation ullamcoLINEFEEDlaboris nisi ut aliquip ex ea commodo consequat.LINEFEEDconsectetur adipiscing elitLINEFEED|
                        lorem20|1|1|1|1|0|0|||3003024
                        lorem20|1|2|1|1|0|0|||3001476
                        lorem20|1|3|1|2|0|0|||3001894
                        lorem20|1|4|1|1|0|0|||2001443
                        lorem20|1|5|1|1|0|0|||2001439
                        lorem20|1|6|1|1|0|0|||3002274
                        lorem10|5|1|1|1|1|0|||MED0015
                        lorem20|13|1|1|0|0|0||Lorem ipsum dolor sit amet, consectetur adipiscing elitLINEFEEDquis nostrud exercitation ullamcoLINEFEEDlaboris nisi ut aliquip ex ea commodo consequat.LINEFEEDconsectetur adipiscing elit|
                        lorem20|3|1|1|2|0|0|||3002407
                        lorem16|11|0|1|0|0|0|||
                        
                        1 Reply Last reply Reply Quote 3
                        • Mark YorkovichM
                          Mark Yorkovich
                          last edited by

                          IT WORKED! Actually I stopped at step 2, as I just need to get it to a point where each record is on a single line - the EOL character(s) don’t matter - it just needs to be able to be opened in Excel.

                          Dude, do you realize how much time you just saved me?! :)

                          Thanks!

                          Alan KilbornA 1 Reply Last reply Reply Quote 2
                          • Alan KilbornA
                            Alan Kilborn @Mark Yorkovich
                            last edited by

                            @Mark-Yorkovich said:

                            Thanks!

                            No problem. It was an interesting thing to think about. I like some minor challenges on a daily basis to keep the aging little grey cells (hopefully) fine-tuned…

                            Key to your problem was grabbing enough text such that the real line-ending [LF] could be isolated. Once that’s done, it becomes rather trivial…

                            1 Reply Last reply Reply Quote 2
                            • guy038G
                              guy038
                              last edited by

                              Hello, @mark-yorkovich, @alan-kilborn and All,

                              I was away from the N++ community since more than a month ( see my post below )

                              https://community.notepad-plus-plus.org/post/47895

                              So, Mark, my reply is quite out of date. However, just for info, here is a one-go regex, which does what you want :

                              SEARCH (?-s)([^|\n]*\|){9}.*\n|\n(?!([^|\n]*\|){9}.*\n)

                              REPLACE (?1$0:LINEFEED)

                              It’s the equivalent regex S/R to the Alan’s one. As you just want to replace with a single dummy char, use, for instance, the following Replace zone :

                              REPLACE (?1$0:#)

                              Notes :

                              • First, the part (?-s) forces the regex engine to consider any dot ( . ) as a single standard character ( Not an EOF one ! )

                              • Then, the search regex processes one of the two alternatives :

                                • (?-s)([^|\n]*\|){9}.*\n which looks for entire correct lines, containing 9 pipe characters. In replacement, as group 1 exists, the overall match ( $0 ) is just re-written

                                • \n(?!([^|\n]*\|){9}.*\n) which simply searches a line-feed char, only if NOT followed with a correct record. In replacement, as group 1 does not exist, the part after the colon ( LINEFEED or # ) replaces the \n character

                              Best Regards,

                              guy038

                              1 Reply Last reply Reply Quote 2
                              • First post
                                Last post
                              The Community of users of the Notepad++ text editor.
                              Powered by NodeBB | Contributors