Using Notepad++ to sanitize data to be imported
-
Background:
My employer contracted a coder to build a database for warehousing sales transaction data of our reseller markets (i.e. the sales reports we receive from them, downloaded through their various customer dashboards). The coder was also asked to develop an SSIS solution that processes data files for import to such DB (imports to be done on a monthly basis). We’re low budget. I’m the staff member designated to prepare the data for import, once initially for all historical data and going forward, monthly. I know just enough to know that I don’t know nearly enough about sanitizing and standardizing the data for import.
- I download the reports (usually .csv or .xslx)
- Open them in Excel to give them a data integrity check (formatting as table then looking through filters for stray characters or data in incorrect columns)
- Save them as a Unicode .txt file
- Open the .txt in Excel again for second data integrity check (same as step 2)
- Save the .txt file from Excel
- Open the .txt file from Notepad++
- Use the “Show All Characters” feature to visually check the data, its control characters against the SSIS solution’s column mapping
Problem:
The coder is who turned me onto Notepad++ for using the “Show All Characters” feature. It’s probably the simplest thing to change, but visually checking the control characters is messy because nothing is aligned (snapped in alignment as if in a column, like how Excel would display the data). Googling has gotten me as far as the plugin “Code alignment,” which I thought might be used to align the text by control character. I can’t find that it does this, though. I found this Community page and I’m hoping not to get laughed off.
Thank you,
Amy -
can you tell me which control chars you are looking for in that file?
It sounds strange that your data needs to have those chars, especially
when you save excel data as txt file.Cheers
Claudia -
Claudia–I think I’ve misspoke. The “control characters” that I look for in Notepad++ are orange right-pointing arrow symbols that Notepad++ makes visible when the “Show All Characters” feature is being used. Each arrow represents a column, I understood (as told by the coder). They are visual markers for me to verify that columns of the data for import match the order and number of the columns mapped in the SSIS solution. I am not inserting a control character. But I do have to give the data these checks to remove pre-appended and trailing spaces or quotation marks or tabs–errant characters appended somewhere in this process (how or when they are appended is a frustrating mystery).
Thank you,
Amy -
-
Claudia–I will verify this and return.
Thank you for your help.
Goodnight,
Amy -
-
@Amy-Wilson What does your coder want you to do that can’t be done in code? You should not need to be looking at control codes. It’ll drive you, or anyone, batty.
In Notepad++ there’s a blue paragraph symbol in the toolbar. Click that to enable viewing of control code. I will use this to take a quick look at the data when I’m having difficulty parsing the data with code. I just look to see if there’s something about the data that’s normally invisible. Usually I spot the issue and click the blue paragraph symbol again to turn off viewing of the control code. I would never want to be doing extensive editing/cleanup work with control code viewing enabled.
Notepad++ can so some very cool things that can help with data cleanup but you’ve explained very little of your project’s needs. Also, you seem to have a technical person available, “the coder”, that really should be the one documenting and walking you through the process as he/she should understand the data well.
-
Hi Amy,
looks like @dail has the proper plugin for you.
Nevertheless, it might be useful too if we can identify some rules to make your life easier.Cheers
Claudia -
dail–thank you. That appears to be just what I wanted. Unfortunately, the initial imports will be large files, so it might not do the job still.
mkupper–the coder directed me to Notepad++ specifically for the blue paragraph symbol, the “Show All Characters” feature. His SSIS solution processes .txt files, so I need to save the .csv or .xlsx reports to a Unicode .txt file. When saving the reports as .txt files, the invisible content becomes visible in some cases, or are read as tabs or enters in other cases, thus moving the data into fields/records they should not be in.
The SSIS solution does two things so far: 1) Imports all data (records/fields) of the reseller reports and stores them in individual reseller tables. 2) Takes specific data of the reseller reports and combines them with all other like-data of other reseller reports in another table.
I believe that more may be added to the SSIS solution that is currently being accomplished by query/insert in the database, such as: Where in the combined-data table there is a transaction with description that indicates multiple line-items, break out transaction line-items and assign each a general ledger account.
As of this morning, we’ve given the task of preparing the reports for import back over to the coder so that they’re able to resolve the data integrity issues quickly and get the initial historical data into the database immediately. (We’ve passed a few internal deadlines already and want to have it in.) Perhaps the data integrity issues will become apparent to the coder then, and they’ll immediately recognize the solution(s).
If not, I might return to dig further into how best Notepad++ can be utilized to sanitize the data for import.
Thank you all again,
Amy -
@Amy-Wilson I tend to use Notepad++ for one-off sanitation projects. If it’s something I’ll be doing regularly then my personal preference is sed which is available here .
For example, if the lines of your file should have a specific format. Let’s say
Qty <tab> Description <tab> $price
then I’d do a sed script that contains
/^$/ d /^[0-9]\+\t[^\t]\+\t\$[0-9]\+\.[0-9][0-9]$/ d =
The first line deletes blank lines. The second line delete all of the lines that match the expected format which is a number that’s 1 to N digits, a tab, any number of characters other than tab, another tab, a $ sign, 1 to N digits, a decimal point, and two digits. The line with the
=
outputs the line number in the data file. sed then outputs the line that did not match our expected pattern.I can tighten things up more. For example, if it’s know the Qty will never be more than 3 digits and must not have leading zeroes then I could use
[1-9][0-9]\{0,2\}
which will allow for 1 to 999 and error out on anything else.If I save the above in check-for-errors.txt and run `sed -f check-for-errors.txt data.txt >errors.txt’ then errors.txt will contain a list of lines that did not match the format I expected along with the line numbers to make it easier to find the bad lines. I can use that to clean up data.txt until a run using check-for-errors.txt produces an empty file.
Be aware that while sed can process Unicode files that the sed scripts should be plain ascii files though you can do Unicode matching using \uXXXX
-
@mkupper That is immensely helpful, thank you.
-
I realized when reading my earlier post that I can simplify one thing as you may be new to both sed and Notepad++
By default sed uses what it calls basic regular expressions. To use operators such as
? + | ()
and{}
you need to preface them with a backslash. You can also use sed with what they call extended regular expressions. These are exactly the same as basic regular expressions except you don’t need the backslash for those operators. The advantage of extended regular expressions is that’s what Notepad++ uses by default. Thus it’ll be easier to try out things in Notepad++ and once you have an something that works you can just copy/paste the expression into a sed script file without then needing to locate and the special characters and to prefix each one with \For example, here’s the example sed script using extended regular expressions, I also used the code that only allows Qty values of 1 to 999 as {} also needs be entered as \{ \} in basic mode:
/^$/ d /^[1-9][0-9]{0,2}\t.+\t\$[0-9]+\.[0-9][0-9]$/ d =
I replaced the
[^\t]+
I’d used in the first example with.+
. Both skip over the description field and define that it must have at least one character. I made that change because while[^\t]+
works well for sed which reads and processes one line at a time it will create surprises when used in Notepad++ where[^\t]+
will scan past the end of the line and all the way to the end of the file looking for a character that’s not a tab. There’s a workaround in that I could have used[^\t\r\n]+
but.+
is far clearer. -
@mkupper Thank you for the simplification and additional info. Still immensely helpful.
-
Just a reminder. N++ under “Edit” has “Blank Operations” and that has the option of “Tab to Space” which makes all the tabs (arrows) into spaces (represented by colored dots). Once done, you can easily do a search and replace of the (typically 3) spaces to either one space or whatever the tabs need to become. Usual caution. A global search and replace (Control+R) will replace any set of three spaces, not just the ones that were formally tabs.
-
I’ve several times over the years WANTED to use sed for stuff under Windows, but it seems that it always became too much of a struggle with getting all the escaping correct, first thru the CMD shell layer, and then the sed layer itself. Don’t misunderstand, I’ve been using Unix/DOS/Linux/CMD command lines since the early 1990s, and consider myself near expert level with them. It just seems that there is “something” about a Windows port of sed that makes me nuts trying to get what I’m trying to do to work. However, @mkupper, maybe your recent posts have encouraged me to give it another go. :)