Using Notepad++ to sanitize data to be imported

Claudia Frank

Sounds like you are looking for tabs

If this is the case, then I don’t see any chance to format your data without
corrupting it.

Do you know whether there is a strict format of the data you check?
If so, we maybe could help you to setup some find rules to see if the data is correct.

Cheers
Claudia

Amy Wilson

Claudia–I will verify this and return.

Thank you for your help.

Goodnight,
Amy

dail

ElasticTabstops plugin?

Do note it has performance issues with large files.

mkupper

@Amy-Wilson What does your coder want you to do that can’t be done in code? You should not need to be looking at control codes. It’ll drive you, or anyone, batty.

In Notepad++ there’s a blue paragraph symbol in the toolbar. Click that to enable viewing of control code. I will use this to take a quick look at the data when I’m having difficulty parsing the data with code. I just look to see if there’s something about the data that’s normally invisible. Usually I spot the issue and click the blue paragraph symbol again to turn off viewing of the control code. I would never want to be doing extensive editing/cleanup work with control code viewing enabled.

Notepad++ can so some very cool things that can help with data cleanup but you’ve explained very little of your project’s needs. Also, you seem to have a technical person available, “the coder”, that really should be the one documenting and walking you through the process as he/she should understand the data well.

Claudia Frank

@Amy-Wilson

Hi Amy,
looks like @dail has the proper plugin for you.
Nevertheless, it might be useful too if we can identify some rules to make your life easier.

Cheers
Claudia

Amy Wilson

dail–thank you. That appears to be just what I wanted. Unfortunately, the initial imports will be large files, so it might not do the job still.

mkupper–the coder directed me to Notepad++ specifically for the blue paragraph symbol, the “Show All Characters” feature. His SSIS solution processes .txt files, so I need to save the .csv or .xlsx reports to a Unicode .txt file. When saving the reports as .txt files, the invisible content becomes visible in some cases, or are read as tabs or enters in other cases, thus moving the data into fields/records they should not be in.

The SSIS solution does two things so far: 1) Imports all data (records/fields) of the reseller reports and stores them in individual reseller tables. 2) Takes specific data of the reseller reports and combines them with all other like-data of other reseller reports in another table.

I believe that more may be added to the SSIS solution that is currently being accomplished by query/insert in the database, such as: Where in the combined-data table there is a transaction with description that indicates multiple line-items, break out transaction line-items and assign each a general ledger account.

As of this morning, we’ve given the task of preparing the reports for import back over to the coder so that they’re able to resolve the data integrity issues quickly and get the initial historical data into the database immediately. (We’ve passed a few internal deadlines already and want to have it in.) Perhaps the data integrity issues will become apparent to the coder then, and they’ll immediately recognize the solution(s).

If not, I might return to dig further into how best Notepad++ can be utilized to sanitize the data for import.

Thank you all again,
Amy

mkupper

@Amy-Wilson I tend to use Notepad++ for one-off sanitation projects. If it’s something I’ll be doing regularly then my personal preference is sed which is available here.

For example, if the lines of your file should have a specific format. Let’s say

Qty <tab> Description <tab> $price

then I’d do a sed script that contains

/^$/ d
/^[0-9]\+\t[^\t]\+\t\$[0-9]\+\.[0-9][0-9]$/ d
=

The first line deletes blank lines. The second line delete all of the lines that match the expected format which is a number that’s 1 to N digits, a tab, any number of characters other than tab, another tab, a $ sign, 1 to N digits, a decimal point, and two digits. The line with the = outputs the line number in the data file. sed then outputs the line that did not match our expected pattern.

I can tighten things up more. For example, if it’s know the Qty will never be more than 3 digits and must not have leading zeroes then I could use [1-9][0-9]\{0,2\} which will allow for 1 to 999 and error out on anything else.

If I save the above in check-for-errors.txt and run `sed -f check-for-errors.txt data.txt >errors.txt’ then errors.txt will contain a list of lines that did not match the format I expected along with the line numbers to make it easier to find the bad lines. I can use that to clean up data.txt until a run using check-for-errors.txt produces an empty file.

Be aware that while sed can process Unicode files that the sed scripts should be plain ascii files though you can do Unicode matching using \uXXXX

Amy Wilson

@mkupper That is immensely helpful, thank you.

mkupper

I realized when reading my earlier post that I can simplify one thing as you may be new to both sed and Notepad++

By default sed uses what it calls basic regular expressions. To use operators such as ? + | () and {} you need to preface them with a backslash. You can also use sed with what they call extended regular expressions. These are exactly the same as basic regular expressions except you don’t need the backslash for those operators. The advantage of extended regular expressions is that’s what Notepad++ uses by default. Thus it’ll be easier to try out things in Notepad++ and once you have an something that works you can just copy/paste the expression into a sed script file without then needing to locate and the special characters and to prefix each one with \

For example, here’s the example sed script using extended regular expressions, I also used the code that only allows Qty values of 1 to 999 as {} also needs be entered as \{ \} in basic mode:

/^$/ d
/^[1-9][0-9]{0,2}\t.+\t\$[0-9]+\.[0-9][0-9]$/ d
=

I replaced the [^\t]+ I’d used in the first example with .+. Both skip over the description field and define that it must have at least one character. I made that change because while [^\t]+ works well for sed which reads and processes one line at a time it will create surprises when used in Notepad++ where [^\t]+ will scan past the end of the line and all the way to the end of the file looking for a character that’s not a tab. There’s a workaround in that I could have used [^\t\r\n]+ but .+ is far clearer.

Amy Wilson

@mkupper Thank you for the simplification and additional info. Still immensely helpful.

John Pilge

Just a reminder. N++ under “Edit” has “Blank Operations” and that has the option of “Tab to Space” which makes all the tabs (arrows) into spaces (represented by colored dots). Once done, you can easily do a search and replace of the (typically 3) spaces to either one space or whatever the tabs need to become. Usual caution. A global search and replace (Control+R) will replace any set of three spaces, not just the ones that were formally tabs.

Alan Kilborn

@mkupper

I’ve several times over the years WANTED to use sed for stuff under Windows, but it seems that it always became too much of a struggle with getting all the escaping correct, first thru the CMD shell layer, and then the sed layer itself. Don’t misunderstand, I’ve been using Unix/DOS/Linux/CMD command lines since the early 1990s, and consider myself near expert level with them. It just seems that there is “something” about a Windows port of sed that makes me nuts trying to get what I’m trying to do to work. However, @mkupper, maybe your recent posts have encouraged me to give it another go. :)