Finding and Removing LF within a comma delimited comments field
-
Hi, I have a comma delimited file with double quotes as a text qualifier. There are five columns. The last is a “comments” field where some users entered line feeds. That’s causing fragmented lines in my output file. I can’t fix the source, so I need to fix the output. A line of text basically goes:
“John”,“Smith”,“Chicago”,“Illinois”,“Likes to bicycle and [LF]
swim.”[LF]
I tried to Find using ([^"])$\n as a Regular Expression and replace with SPACE, but for some reason that selects the entire file (then replaced with a blank space, ha! fail). I appreciate any help!! -
You have a real problem here. If I had this problem I might try replacing a non-double-quote (which I capture into group #1) followed by a
\n
with the group #1 character. Here’s how to tell Notepad++ to do it:Find what box:
([^"])\n
Replace with box:\1
Search mode: Regular expressionI’m not sure your text always follows that specification, but from your description it appears as if it might.
Actually, looking back at what you were trying, I must say you were pretty close to what I came up with.
We always like to see those asking for help here show us what they’ve already tried and failed with, so hat’s off to ya for that! :)
-
That worked perfectly! Thank you so much Alan!!
-
Hello, @john-ditzel and All,
I tried to re-visit the general problem, when a list of records, in a delimited file, is completely unformatted and may contain :
-
Line-break(s) and/or blank character(s), possibly consecutive, inserted before and/or **after any field separator symbol ( generally a comma, semi-colon or colon char )
-
Blank character(s), possibly consecutive, located at beginning or end of any field contents, right before and/or after the delimiter symbol ( generally a double or single quote char )
-
Line-break(s), possibly consecutive, located inside any field contents, between two standard characters, different from, both, the delimiter and the separator symbols
-
Additional line-break(s) and/or blank character(s), possibly consecutive, between two records
-
Unnecessary space characters char(s)( or tabulation chars, when not a delimiter ) inside field contents
So, here is a way to get a neat formatted list :
-
Let SEP be the field separator ( The
,
character, in your case ) -
Let DEL be the starting AND ending character delimiter ( The
"
character, in your case )
IMPORTANT : The separator character is NEVER part of any field value !
-
IF the separator is not the tabulation character, use these four generic regex S/R, below, in that order :
-
SEARCH 1
[\h\r\n]*(
SEP)[\h\r\n]*|
DEL\K[\h\r\n]*(?=
DEL|\z)
-
REPLACE 1
?1
SEP:\r\n
-
SEARCH 2
[^
DELSEP\h\r\n]\K(\h*\R+)+\h*(?=[^
DELSEP\h\r\n])
-
REPLACE 2
\x20
-
SEARCH 3
(^|
SEP)
DEL\K[\h\r\n]+|[\h\r\n]+(?=
DEL(
SEP|$))
-
REPLACE 3
Leave EMPTY
-
SEARCH 4
(['-])\h+|\x20{2,}|\t+
-
REPLACE 4
?1\1:\x20
-
-
IF the separator is the tabulation character, use these four generic regex S/R, below, in that order :
-
SEARCH 1
[\x20\r\n]*(\t)[\x20\r\n]*|
DEL\K[\x20\r\n]*(?=
DEL|\z)
-
REPLACE 1
?1\t:\r\n
-
SEARCH 2
[^
DEL\h\r\n]\K(\x20*\R+)+\x20*(?=[^
DEL\h\r\n])
-
REPLACE 2
\x20
-
SEARCH 3
(^|\t)
DEL\K[\x20\r\n]+|[\x20\r\n]+(?=
DEL(\t|$))
-
REPLACE 3
Leave EMPTY
-
SEARCH 4
(['-])\x20+|\x20{2,}
-
REPLACE 4
?1\1:\x20
-
Practical example :
Assuming the sample list, below, containing
3
records, and, really, very, very twisted !"John","Smith","Chicago","Illinois","Likes to bicycle and swim. " " Mary Elisabeth " , " Ditzel " ," Los Angeles" , "California"," Love skiing and snow- boarding" "Andrew " , " Mc Millan " , "San Francisco " , " California" , " This is a small Test "
With the first regex S/R, below,
-
We delete any range of blank and/or EOL characters, located outside fields and which surrounds separators
-
We replace any range of blank and/or EOL characters; between two records, with a single line-break
SEARCH
[\h\r\n]*(,)[\h\r\n]*|"\K[\h\r\n]*(?="|\z)
REPLACE
?1,:\r\n
So, the text is changed into :
"John","Smith","Chicago","Illinois","Likes to bicycle and swim. " " Mary Elisabeth "," Ditzel "," Los Angeles","California"," Love skiing and snow- boarding" "Andrew "," Mc Millan ","San Francisco "," California"," This is a small Test "
Now, after running the second regex S/R, below, which replaces any range of blank and/or EOL chars, containing, at least, one EOL char, between
2
non-delimiter/blank/EOL characters, with a single space character :SEARCH
[^",\h\r\n]\K(\h*\R+)+\h*(?=[^",\h\r\n])
REPLACE
\x20
We get the changed list, below :
"John","Smith","Chicago","Illinois","Likes to bicycle and swim. " " Mary Elisabeth "," Ditzel "," Los Angeles","California"," Love skiing and snow- boarding" "Andrew "," Mc Millan ","San Francisco "," California"," This is a small Test "
Then, using this third regex S/R, which deletes any consecutive range of blank and/or EOL characters, beginning or ending any field :
SEARCH
(^|,)"\K[\h\r\n]+|[\h\r\n]+(?="(,|$))
REPLACE
Leave EMPTY
We get :
"John","Smith","Chicago","Illinois","Likes to bicycle and swim." "Mary Elisabeth","Ditzel","Los Angeles","California","Love skiing and snow- boarding" "Andrew","Mc Millan","San Francisco","California","This is a small Test"
From now on, we almost have our neat list, without unnecessary line-breaks ! Finally, with that fourth regex S/R, we normalize any consecutive range of blank characters to a single space character, except when right after a dash or apostrophe. In that case, the space char, wrongly added, by the second S/R above, is, then, deleted !
SEARCH
(['-])\h+|\x20{2,}|\t+
REPLACE
?1\1:\x20
And, here is our final and expected list :
"John","Smith","Chicago","Illinois","Likes to bicycle and swim." "Mary Elisabeth","Ditzel","Los Angeles","California","Love skiing and snow-boarding" "Andrew","Mc Millan","San Francisco","California","This is a small Test"
Regards
guy038
-