Community
    • Login

    Using Notepad++ to sanitize data to be imported

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    15 Posts 6 Posters 11.7k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Claudia FrankC
      Claudia Frank @Amy Wilson
      last edited by

      @Amy-Wilson

      can you tell me which control chars you are looking for in that file?
      It sounds strange that your data needs to have those chars, especially
      when you save excel data as txt file.

      Cheers
      Claudia

      1 Reply Last reply Reply Quote 1
      • Amy WilsonA
        Amy Wilson
        last edited by

        Claudia–I think I’ve misspoke. The “control characters” that I look for in Notepad++ are orange right-pointing arrow symbols that Notepad++ makes visible when the “Show All Characters” feature is being used. Each arrow represents a column, I understood (as told by the coder). They are visual markers for me to verify that columns of the data for import match the order and number of the columns mapped in the SSIS solution. I am not inserting a control character. But I do have to give the data these checks to remove pre-appended and trailing spaces or quotation marks or tabs–errant characters appended somewhere in this process (how or when they are appended is a frustrating mystery).

        Thank you,
        Amy

        Claudia FrankC 1 Reply Last reply Reply Quote 1
        • Claudia FrankC
          Claudia Frank @Amy Wilson
          last edited by

          @Amy-Wilson

          Sounds like you are looking for tabs

          If this is the case, then I don’t see any chance to format your data without
          corrupting it.

          Do you know whether there is a strict format of the data you check?
          If so, we maybe could help you to setup some find rules to see if the data is correct.

          Cheers
          Claudia

          1 Reply Last reply Reply Quote 1
          • Amy WilsonA
            Amy Wilson
            last edited by

            Claudia–I will verify this and return.

            Thank you for your help.

            Goodnight,
            Amy

            Claudia FrankC 1 Reply Last reply Reply Quote 0
            • dailD
              dail
              last edited by

              ElasticTabstops plugin?

              Do note it has performance issues with large files.

              1 Reply Last reply Reply Quote 1
              • mkupperM
                mkupper @Amy Wilson
                last edited by

                @Amy-Wilson What does your coder want you to do that can’t be done in code? You should not need to be looking at control codes. It’ll drive you, or anyone, batty.

                In Notepad++ there’s a blue paragraph symbol in the toolbar. Click that to enable viewing of control code. I will use this to take a quick look at the data when I’m having difficulty parsing the data with code. I just look to see if there’s something about the data that’s normally invisible. Usually I spot the issue and click the blue paragraph symbol again to turn off viewing of the control code. I would never want to be doing extensive editing/cleanup work with control code viewing enabled.

                Notepad++ can so some very cool things that can help with data cleanup but you’ve explained very little of your project’s needs. Also, you seem to have a technical person available, “the coder”, that really should be the one documenting and walking you through the process as he/she should understand the data well.

                1 Reply Last reply Reply Quote 0
                • Claudia FrankC
                  Claudia Frank @Amy Wilson
                  last edited by

                  @Amy-Wilson

                  Hi Amy,
                  looks like @dail has the proper plugin for you.
                  Nevertheless, it might be useful too if we can identify some rules to make your life easier.

                  Cheers
                  Claudia

                  1 Reply Last reply Reply Quote 0
                  • Amy WilsonA
                    Amy Wilson
                    last edited by Amy Wilson

                    dail–thank you. That appears to be just what I wanted. Unfortunately, the initial imports will be large files, so it might not do the job still.

                    mkupper–the coder directed me to Notepad++ specifically for the blue paragraph symbol, the “Show All Characters” feature. His SSIS solution processes .txt files, so I need to save the .csv or .xlsx reports to a Unicode .txt file. When saving the reports as .txt files, the invisible content becomes visible in some cases, or are read as tabs or enters in other cases, thus moving the data into fields/records they should not be in.

                    The SSIS solution does two things so far: 1) Imports all data (records/fields) of the reseller reports and stores them in individual reseller tables. 2) Takes specific data of the reseller reports and combines them with all other like-data of other reseller reports in another table.

                    I believe that more may be added to the SSIS solution that is currently being accomplished by query/insert in the database, such as: Where in the combined-data table there is a transaction with description that indicates multiple line-items, break out transaction line-items and assign each a general ledger account.

                    As of this morning, we’ve given the task of preparing the reports for import back over to the coder so that they’re able to resolve the data integrity issues quickly and get the initial historical data into the database immediately. (We’ve passed a few internal deadlines already and want to have it in.) Perhaps the data integrity issues will become apparent to the coder then, and they’ll immediately recognize the solution(s).

                    If not, I might return to dig further into how best Notepad++ can be utilized to sanitize the data for import.

                    Thank you all again,
                    Amy

                    mkupperM 1 Reply Last reply Reply Quote 0
                    • mkupperM
                      mkupper @Amy Wilson
                      last edited by

                      @Amy-Wilson I tend to use Notepad++ for one-off sanitation projects. If it’s something I’ll be doing regularly then my personal preference is sed which is available here.

                      For example, if the lines of your file should have a specific format. Let’s say

                      Qty <tab> Description <tab> $price
                      

                      then I’d do a sed script that contains

                      /^$/ d
                      /^[0-9]\+\t[^\t]\+\t\$[0-9]\+\.[0-9][0-9]$/ d
                      =
                      

                      The first line deletes blank lines. The second line delete all of the lines that match the expected format which is a number that’s 1 to N digits, a tab, any number of characters other than tab, another tab, a $ sign, 1 to N digits, a decimal point, and two digits. The line with the = outputs the line number in the data file. sed then outputs the line that did not match our expected pattern.

                      I can tighten things up more. For example, if it’s know the Qty will never be more than 3 digits and must not have leading zeroes then I could use [1-9][0-9]\{0,2\} which will allow for 1 to 999 and error out on anything else.

                      If I save the above in check-for-errors.txt and run `sed -f check-for-errors.txt data.txt >errors.txt’ then errors.txt will contain a list of lines that did not match the format I expected along with the line numbers to make it easier to find the bad lines. I can use that to clean up data.txt until a run using check-for-errors.txt produces an empty file.

                      Be aware that while sed can process Unicode files that the sed scripts should be plain ascii files though you can do Unicode matching using \uXXXX

                      1 Reply Last reply Reply Quote 1
                      • Amy WilsonA
                        Amy Wilson
                        last edited by

                        @mkupper That is immensely helpful, thank you.

                        1 Reply Last reply Reply Quote 0
                        • mkupperM
                          mkupper
                          last edited by

                          I realized when reading my earlier post that I can simplify one thing as you may be new to both sed and Notepad++

                          By default sed uses what it calls basic regular expressions. To use operators such as ? + | () and {} you need to preface them with a backslash. You can also use sed with what they call extended regular expressions. These are exactly the same as basic regular expressions except you don’t need the backslash for those operators. The advantage of extended regular expressions is that’s what Notepad++ uses by default. Thus it’ll be easier to try out things in Notepad++ and once you have an something that works you can just copy/paste the expression into a sed script file without then needing to locate and the special characters and to prefix each one with \

                          For example, here’s the example sed script using extended regular expressions, I also used the code that only allows Qty values of 1 to 999 as {} also needs be entered as \{ \} in basic mode:

                          /^$/ d
                          /^[1-9][0-9]{0,2}\t.+\t\$[0-9]+\.[0-9][0-9]$/ d
                          =
                          

                          I replaced the [^\t]+ I’d used in the first example with .+. Both skip over the description field and define that it must have at least one character. I made that change because while [^\t]+ works well for sed which reads and processes one line at a time it will create surprises when used in Notepad++ where [^\t]+ will scan past the end of the line and all the way to the end of the file looking for a character that’s not a tab. There’s a workaround in that I could have used [^\t\r\n]+ but .+ is far clearer.

                          Alan KilbornA 1 Reply Last reply Reply Quote 1
                          • Amy WilsonA
                            Amy Wilson
                            last edited by

                            @mkupper Thank you for the simplification and additional info. Still immensely helpful.

                            1 Reply Last reply Reply Quote 0
                            • John PilgeJ
                              John Pilge
                              last edited by John Pilge

                              Just a reminder. N++ under “Edit” has “Blank Operations” and that has the option of “Tab to Space” which makes all the tabs (arrows) into spaces (represented by colored dots). Once done, you can easily do a search and replace of the (typically 3) spaces to either one space or whatever the tabs need to become. Usual caution. A global search and replace (Control+R) will replace any set of three spaces, not just the ones that were formally tabs.

                              1 Reply Last reply Reply Quote 0
                              • Alan KilbornA
                                Alan Kilborn @mkupper
                                last edited by

                                @mkupper

                                I’ve several times over the years WANTED to use sed for stuff under Windows, but it seems that it always became too much of a struggle with getting all the escaping correct, first thru the CMD shell layer, and then the sed layer itself. Don’t misunderstand, I’ve been using Unix/DOS/Linux/CMD command lines since the early 1990s, and consider myself near expert level with them. It just seems that there is “something” about a Windows port of sed that makes me nuts trying to get what I’m trying to do to work. However, @mkupper, maybe your recent posts have encouraged me to give it another go. :)

                                1 Reply Last reply Reply Quote 0
                                • First post
                                  Last post
                                The Community of users of the Notepad++ text editor.
                                Powered by NodeBB | Contributors