Hello, @john-ditzel and All,
I tried to re-visit the general problem, when a list of records, in a delimited file, is completely unformatted and may contain :
Line-break(s) and/or blank character(s), possibly consecutive, inserted before and/or **after any field separator symbol ( generally a comma, semi-colon or colon char )
Blank character(s), possibly consecutive, located at beginning or end of any field contents, right before and/or after the delimiter symbol ( generally a double or single quote char )
Line-break(s), possibly consecutive, located inside any field contents, between two standard characters, different from, both, the delimiter and the separator symbols
Additional line-break(s) and/or blank character(s), possibly consecutive, between two records
Unnecessary space characters char(s)( or tabulation chars, when not a delimiter ) inside field contents
So, here is a way to get a neat formatted list :
Let SEP be the field separator ( The , character, in your case )
Let DEL be the starting AND ending character delimiter ( The " character, in your case )
IMPORTANT : The separator character is NEVER part of any field value !
IF the separator is not the tabulation character, use these four generic regex S/R, below, in that order :
SEARCH 1 [\h\r\n]*(SEP)[\h\r\n]*|DEL\K[\h\r\n]*(?=DEL|\z)
REPLACE 1 ?1SEP:\r\n
SEARCH 2 [^DELSEP\h\r\n]\K(\h*\R+)+\h*(?=[^DELSEP\h\r\n])
REPLACE 2 \x20
SEARCH 3 (^|SEP)DEL\K[\h\r\n]+|[\h\r\n]+(?=DEL(SEP|$))
REPLACE 3 Leave EMPTY
SEARCH 4 (['-])\h+|\x20{2,}|\t+
REPLACE 4 ?1\1:\x20
IF the separator is the tabulation character, use these four generic regex S/R, below, in that order :
SEARCH 1 [\x20\r\n]*(\t)[\x20\r\n]*|DEL\K[\x20\r\n]*(?=DEL|\z)
REPLACE 1 ?1\t:\r\n
SEARCH 2 [^DEL\h\r\n]\K(\x20*\R+)+\x20*(?=[^DEL\h\r\n])
REPLACE 2 \x20
SEARCH 3 (^|\t)DEL\K[\x20\r\n]+|[\x20\r\n]+(?=DEL(\t|$))
REPLACE 3 Leave EMPTY
SEARCH 4 (['-])\x20+|\x20{2,}
REPLACE 4 ?1\1:\x20
Practical example :
Assuming the sample list, below, containing 3 records, and, really, very, very twisted !
"John","Smith","Chicago","Illinois","Likes to bicycle
and
swim. "
" Mary
Elisabeth " , "
Ditzel
" ," Los
Angeles" , "California"," Love skiing
and
snow-
boarding"
"Andrew " ,
"
Mc
Millan " , "San
Francisco " , " California"
,
" This is a small
Test "
With the first regex S/R, below,
We delete any range of blank and/or EOL characters, located outside fields and which surrounds separators
We replace any range of blank and/or EOL characters; between two records, with a single line-break
SEARCH [\h\r\n]*(,)[\h\r\n]*|"\K[\h\r\n]*(?="|\z)
REPLACE ?1,:\r\n
So, the text is changed into :
"John","Smith","Chicago","Illinois","Likes to bicycle
and
swim. "
" Mary
Elisabeth ","
Ditzel
"," Los
Angeles","California"," Love skiing
and
snow-
boarding"
"Andrew ","
Mc
Millan ","San
Francisco "," California"," This is a small
Test "
Now, after running the second regex S/R, below, which replaces any range of blank and/or EOL chars, containing, at least, one EOL char, between 2 non-delimiter/blank/EOL characters, with a single space character :
SEARCH [^",\h\r\n]\K(\h*\R+)+\h*(?=[^",\h\r\n])
REPLACE \x20
We get the changed list, below :
"John","Smith","Chicago","Illinois","Likes to bicycle and swim. "
" Mary Elisabeth ","
Ditzel
"," Los Angeles","California"," Love skiing and snow- boarding"
"Andrew ","
Mc Millan ","San Francisco "," California"," This is a small Test "
Then, using this third regex S/R, which deletes any consecutive range of blank and/or EOL characters, beginning or ending any field :
SEARCH (^|,)"\K[\h\r\n]+|[\h\r\n]+(?="(,|$))
REPLACE Leave EMPTY
We get :
"John","Smith","Chicago","Illinois","Likes to bicycle and swim."
"Mary Elisabeth","Ditzel","Los Angeles","California","Love skiing and snow- boarding"
"Andrew","Mc Millan","San Francisco","California","This is a small Test"
From now on, we almost have our neat list, without unnecessary line-breaks ! Finally, with that fourth regex S/R, we normalize any consecutive range of blank characters to a single space character, except when right after a dash or apostrophe. In that case, the space char, wrongly added, by the second S/R above, is, then, deleted !
SEARCH (['-])\h+|\x20{2,}|\t+
REPLACE ?1\1:\x20
And, here is our final and expected list :
"John","Smith","Chicago","Illinois","Likes to bicycle and swim."
"Mary Elisabeth","Ditzel","Los Angeles","California","Love skiing and snow-boarding"
"Andrew","Mc Millan","San Francisco","California","This is a small Test"
Regards
guy038