How to replace CR/LF within fields in a CSV file
-
This post is deleted! -
This is a classic use-case for Generic Regex: Replacing in a Specific Zone of Text, which can be found in FAQ Desk: Generic Regular Expression formulas in our FAQ section
In your case, you want to search between comma-quote and quote (I used comma-quote as the start tag because that makes the regex less confusing rather than trying some other way to disambiguate a start-quote from end-quote), so the “parameters” of the formula are:
- FR =
\r\n
- looking for CRLF
- RR =
,
- replace with comma
- BSR =
,"
- but only in regions that start with comma-quote
- ESR =
"
- … and end with the next quote mark
Put it all together in that formula to get
- FIND WHAT =
(?-si:,"|(?!\A)\G)(?s-i:(?!").)*?\K(?-si:\r\n)
- REPLACE WITH =
,
- REPLACE ALL (as explained in that sub-FAQ, single REPLACE won’t work because of the
\K
)
----
Useful References
- Please Read Before Posting
- Template for Search/Replace Questions
- Formatting Forum Posts
- Notepad++ Online User Manual: Searching/Regex
- FAQ: Where to find other regular expressions (regex) documentation
–
edits: added descriptions of each piece, and the references list - FR =
-
@PeterJones said in How to replace CR/LF within fields in a CSV file:
(?-si:,“|(?!\A)\G)(?s-i:(?!”).)*?\K(?-si:\r\n)
Works like a charm, Peter.
Thanks so much for your help!
-
@PeterJones said in How to replace CR/LF within fields in a CSV file:
I used comma-quote as the start tag…
BSR =,"
What struck me about this was, what if the next person coming along with a similar problem has the issue in his CSV’s column “1” string data?
In that case the solution would have to be revised – because there would be no
,
starting off each line; the first string field, being the first, would simply start with"
.So let’s revise it right now; change the BSR to:
BSR =
(?:(?:^|,)")
Then, some data that looks like this will also process:
"1 one","Smith","John","Loves to skateboard","12/3/98" "2 two","Johnson","Mike","Eats worms","2/14/75" "3 three little pigs","Howard","Sally","Walks along the beach Likes bike rides Loves to cook","9/7/67" "4 four","Doe","Jane","Restores classic cars","7/4/74"
into:
"1,one","Smith","John","Loves to skateboard","12/3/98" "2,two","Johnson","Mike","Eats worms","2/14/75" "3,three,little,pigs","Howard","Sally","Walks along the beach,Likes bike rides,Loves to cook","9/7/67" "4,four","Doe","Jane","Restores classic cars","7/4/74"
-
@Alan-Kilborn ,
Actually, it could be worse than that, since CSV doesn’t require the use of quotes, except around strings, if the first or any field is a numeric with no quotes, it needs to test for a numeric prior to a,
.
I wrote a parsing program in dBASE 1-2 years ago, that would take a file that was over a million records that were CSV, and wouldn’t you know, the numeric fields weren’t quoted, so every field had to be tested first, to see if it was numeric, and if not, process it accordingly.Just an FYI.
-
@Lycan-Thrope said in How to replace CR/LF within fields in a CSV file:
Actually, it could be worse than that, since CSV doesn’t require the use of quotes, except around strings
Any field that has an embedded newline in it is by definition a string, and requires quotes (because of the newline). Hence, for purposes of changing an embedded newline to something else, what he has shown is sufficient.
And, despite what you said, strings are allowed to be non-quoted, as long as the don’t contain the separator character or newline sequences
-
@PeterJones ,
True, but I noticed he was quoting numerics, and figured that might change the search aspect. Ignore if I misread that regex application. I’m still not comfortable playing with the \A and \G, so wasn’t reading what yours was doing, but I saw his using quoted numerics so maybe I misread his regex for it. Still not sure what I see, but trying. :( It was, however an example where I used a program instead of a regex to parse out the lines. ::shrug:: Of course, I was pulling out the fields and assigning them to another tables structure, so that may be the difference.
Feel free to disregard my idiocy. :)Edit: Just went over my notes for that program, and it was a 1.7 million record file, and I was trying to convert it to a CSV format dBASE could use for importing because we needed to count characters to find the largest size of each field, so we could create the structure, and of course, fields were missing and the like. I hadn’t even started messing with regex back then, it’s 3 years now, so it wasn’t a tool in my box, back then. It might have made the job easier, in hindsight. :)
-
@Lycan-Thrope said in How to replace CR/LF within fields in a CSV file:
he was quoting numerics
My intent was not to quote numeric data. I was “making up” data based on the OP’s original data, in order to show a string as the first CSV “column”, so that the overall idea I was presenting abot the regex used would work. Sorry if there was some confusion about it. This might have been better sample data:
"one ONE","Smith","John","Loves to skateboard","12/3/98" "two TWO","Johnson","Mike","Eats worms","2/14/75"
and so on.
-
@Alan-Kilborn ,
Talk about an epiphany moment. Most applications were choking on that big .csv file, which is why I was trying to write that program to fix the file without having to open and display it so we could convert the problem aspects and store it in a database.I just did it with the current NPP, and it took a few seconds, but it opened it up completely, all 1.6m records in NPP. I could have used this capability 3 years ago. Maybe I could have had NPP find the field sizes so I could have known then how big to make the fields to hold the data and then import it. :)
-
Hello, @mark-yorkovich, @peterjones, @alan-kilborn, @lycan-thrope and All,
Of course, the generic regex satisfies the @mark-yorkovich’s goal ! But if we try to think more deeply about the problem, the solution is almost obvious !
Mark said :
I have a CSV file with comma-separated fields, and double quotes for each field, except the first field which is an ID field.
Thus, as a valid field is always surrounded by double-quotes, this means that the line-breaks which need to be replaced with a literal
,
must be :- Not immediately preceded with a double-quote character
AND
- Not immediately followed with a double-quote character
Hence, the following regex S/R :
SEARCH
(?<!")\r\n(?!")
REPLACE
,
-
Untick all box options of the Replace dialog
-
Select the
Regex expression
search mode -
Move to very beginning of your
CSV
file -
Click once on the
Replace All
button ( or several times on theReplace
button )
Best Regards,
guy038
P.S. :
-
As explained in a recent post, do not use the simplified search syntax
(?<!")\R(?!")
! -
Of course, this method does not work if fields, near the line break, may not be followed / preceded with a
"
character !