Open a pipe-delimited text file on Windows - where one field may contain CRLF

Hardy Merrill

I’m using Notepad++ to open a pipe-delimited text file on Windows - one of the fields may contain CRLF. Notepad++ thinks every CRLF is a end-of-line character now which looks odd because it caused the line to bump down and be continued on the next line whenever it encounters a CRLF. Is it possible to adjust Notepad++ so that it will recognize the actual end-of-line characters, but leave the data fields alone that contain CRLF characters? TIA!

PeterJones

@Hardy-Merrill ,

Notepad++ is a text editor. The text of your character-separated-values (CSV) file contains newlines, and Notepad++ is properly rendering them as newlines. Now from a spreadsheet/database perspective, the two newlines may have two different meanings (the ones in the fields are data/string newlines, which you seem to want to treat as not being newlines, whereas the ones at the ends of records are being used as a record separator, which you seem to want to treat as real newlines).

However, you can work around it while you are editing.

I am going to assume you have the following valid CSV, which has quotes around any field containing newlines:

record1|"This is a long string without newlines"|1234
record2|"This has a newline.
See."|5678

the newline after 1234 is a record separator newline. The newline after newline. is a string newline.

So to “hide” the newlines in the data, you could do a search like the following:

FIND = (\|"[^\r\n"]*)\r\n – this says look for a pipe, then a quote, then multiple characters that are not CR or LF or quote, store those in group1, then a newline sequence
REPLACE = ${1}¶ – this says replace that whole sequence with the contents of group1, followed by a pilcrow symbol.
SEARCH MODE = Regular expression
REPLACE ALL

That will give you

record1|"This is a long string without newlines"|1234
record2|"This has a newline.¶See."|5678

so now you can see the individual records, and know which fields contain newline sequences, without it breaking it up to look funny for you. (If you have a field with multiple newlines, you can just run the search-and-replace a few more times until it stops matching. There are complicated ways to get it in one fell swoop, but why debug and maintain a complicated regex when you can just hit REPLACE ALL on a simple regex a few times and be done.)

When you are done, and you want to go back to a valid pipe-delimited CSV, you can convert ¶ to \r\n with another regex search-and-replace.

Et voila, it’s usable for you, without requiring a huge coding change and having to wait for the developer to implement the niche feature for you (which wasn’t likely to happen even with an official feature request) and without waiting for someone to develop a plugin that does it for you – though it wouldn’t be unreasonable to see if @Bas-de-Reuver wanted to add such a feature to the CSV Lint plugin – because I can see that your feature (automatically change embedded newlines to pilcrow on file read, and automatically change pilcrow to embedded newlines on file save) would be useful to the pretty much the same set of users that CSV Lint is useful for.

(NB: if you have an invalid CSV, where field newlines don’t have quotes around it, this won’t work. If you have more complicated field data, you might have to research regex and come up with a more complicated regex on your own.)

----

Useful References

Bas de Reuver

I’m not exactly clear on what OP is asking. Notepad++ is a textfile editor and so it will always display the textfile as-is, including all the CRLF’s.

Like @PeterJones pointed out, you can install the CSV Lint plug-in which will add different colors to the columns, making it easier to see what is what (see video) But, it will still display the file as just a text file.

PeterJones

@Bas-de-Reuver ,

I was just thinking: could you add an option to CSV Lint that would hook up two callbacks:

On load, replace newline that’s inside a field (ie, inside matched quotes) with the pilcrow (so that it displays nice)
on save, replace pilcrow with a real newline (so that it has the real data whenever it’s saved to disk)

Because CSV files can have valid newlines inside a string field, as shown with the first of my two text boxes… and having your CSV Lint help with displaying that might be a nice feature (it’s not technically linting… but since the linter is there to make nice syntax highlighting for better being able to see your columns in CSV, displaying a newline that’s in a column as a pilcrow might be an optional “display my CSV better for me” feature).

If I find some time later today, I might hack up a PythonScript example of what I am talking about, in case you still don’t understand, or want a sequence that might help.

Hardy Merrill

@PeterJones Thanks Peter! This got me thinking in the right direction. Here’s an example - the file I’m opening in Notepad++ looks like this I believe (displaying special characters like CRLF):

{aaa}|bbb|cccCRLF
{ddd}|eeCRLFe|fffCRLF
{ggg}|hhh|iiiCRLF

But notice that eee has a CRLF actually as part of the text between the 2nd and 3rd e, so what I see in Notepad++ is this:

{aaa}|bbb|cccCRLF
{ddd}|eeCRLF
e|fffCRLF
{ggg}|hhh|iiiCRLF

To fix that, the Find and Replace that I used was this:

Find what: \r\n([^{])
Replace with: \1
Match case: unchecked
Wrap around: unchecked
Search mode
Normal: unselected
Extended: unselected
Regular expression: SELECTED
. matches newline: unselected

The find matches lines that end with CRLF where the next character (appears on the next line in Notepad++ although I believe it is on the same line in the file) is NOT left brace. Replace that with the captured next character on the next line that is NOT left brace - effectively removing the CRLF that came in as part of the text.

Hope this makes sense. Thanks for your help!

As I’m thinking about this - another question - will this replace ALL the instances found in a line (what if a text value contains more than one CRLF), or only the first one? Thanks again.

PeterJones

@Hardy-Merrill said in Open a pipe-delimited text file on Windows - where one field may contain CRLF:

But notice that eee has a CRLF actually as part of the text between the 2nd and 3rd e, so what I see in Notepad++ is this:
{aaa}|bbb|cccCRLF
{ddd}|eeCRLF
e|fffCRLF
{ggg}|hhh|iiiCRLF

@PeterJones mumbles something about poorly designed use of CSV format, feeling sorry for any who have to use it

To fix that, the Find and Replace that I used was this:

Find what: \r\n([^{])
Replace with: \1

Good job. Glad you came up with that. It helped that the first field in a record always starts with {.

As I’m thinking about this - another question - will this replace ALL the instances found in a line (what if a text value contains more than one CRLF), or only the first one? Thanks again.

{aaa}|bbb|ccc
{ddd}|ee
e|fff
{ggg}|h
h
h|iii

so the “eeCRLFe” has one newline, and “hCRLFhCRLFh” has two newlines

A single REPLACE ALL, given your regex, will end up with

{aaa}|bbb|ccc
{ddd}|eee|fff
{ggg}|hhh|iii

I believe that answers your “another question”.

PeterJones

@Hardy-Merrill said in Open a pipe-delimited text file on Windows - where one field may contain CRLF:

Replace with: \1

BTW: I highly recommend getting out of the habit of using the \# notation for Replace With expressions. Technically, Notepad++ and the Boost Regex library that it uses for regular expressions don’t specify the \# notation, because that backslash-digit notation is for “backreferences”: see npp usermanual and the official boost search syntax, and notice that notation is not listed in the boost replacement syntax.

Moreover, if you have ten or more matching groups, the backslash-digit will not work as you expect.

DATA:

ABCDEFHIJxABCDEFHIJyABCDEFHIJzABCDEFHIJx

FIND = (.)(.)(.)(.)(.)(.)(.)(.)(.)(.)
REPLACE = \10
result:

A0A0A0A0

… this is because the digit backref only work for digits 1-9. If you try for a 10th capture group, that notation will not work. What \10 literally says is “in the replacement, do a backreference to capture group 1, and then insert a literal 0 character”

compare that to
REPLACE = $10

xyzx

Which does what you expect. And
REPLACE = ${10}0

x0y0z0x0

… which lets you specify a literal digit to go in the replacement after the match group number…

If you get into the habit of \1, you will hit the end of your rope with nine capture groups. If you always use $1, then when you expand to 10 or more groups, it will still work. And if you get used to always ${1}, then when you expand to 10 or more, it won’t ever matter what the next literal character is in the replacement: it will always work as expected. This is why I try to make my example regex substitutions always use ${1} (and when I get lazy, they will occasionally just be $1), but I never use \1 in replacements, because is not the canonical way to refer to a capture group contents in the replacement. Saying “But \1 works for me, so I will just keep on using it” will eventually lead to “why doesn’t \10 work the way I want it to?” after you’ve forgotten because of thousands of repetitions putting \1 into muscle memory.

Hardy Merrill

@PeterJones Thanks again Peter - I did not know this about the \ limitation and $ for referring to capture groups.

Bas de Reuver

Maybe not exactly what you’re looking for but just fyi. I’ve updated the CSV Lint plug-in to v4.5.3 which includes a new option in the “Reformat Data” dialog to replace any carriage return/line feed characters (CRLF or CR or LF) with a given string, for example <br>.

You can install the latest plug-in manually by going to the releases page, download the zip and follow the steps as described here.