How to find the Correct Regex pattern to extract data from a text file

PeterJones

Given that text file, it can easily be imported into Excel or another spreadsheet, because one of the options in the text-import is to manually define column widths in terms of number of characters, and you’ve definitely got a fixed-width format there.

Trying to parse a mangled report using regex, instead of using the original data, is really the wrong way to go about things. The right way is to ask for access to the original data in a machine-readable format, and go from there. Or if your goal is just to get a new copy of the report with “better” formatting, then ask for the report to be modified.

If it were my data, and you couldn’t convince anyone to give you access to the data instead of the mangled report, I’d go study the .htm version better – because if it was a well-formed HTML file, all the data for the Description would be in one <td> cell, rather than being split across multiple <tr> rows. If that’s the case, then parsing the HTML to generate a better report might be easier than parsing this split-row txt. But if the report was horrendously generated, and the HTML really does have the Description split into multiple HTML rows, then working from the text is definitely the way to go.

It would be possible to look for lines that begin with 27 spaces instead of beginning with a number using regex… but you’re going to still run into problems that will make the replacement difficult – because you would have to know the maximum number of lines that the Description field was ever split into. (The easiest would probably be to move the whole Description column to the far right, so it can be as long as you want it to be)

You claimed,

The Description column is sometimes split into two,three,four lines

Assuming 4 is the maximum number of lines for a description, I might do something like this two step process – all of these search/replace have Search Mode = Regular Expression set:

Make sure the “additional” lines all have enough space to make the next step easier:
- FIND = (?-s)^(\x20{27}.*?)\h*$
  Find 27 spaces followed by anything followed by spaces
- REPLACE = $1\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20
  Make sure there’s at least 21 spaces after the text
Join the additional lines into the Description field, and widen the description field when there are fewer than four lines
- FIND = (?-s)^(\S.{26})(.{21})(.)(.{60})(\R\x20{27}(.{0,21})\h*$)?(\R\x20{27}(.{0,21})\h*$)?(\R\x20{27}(.{0,21})\h*$)?
  Make some groups: the first 27 characters, then the 21-character description, then the separator column, then the remaining main line data, then the three optional Description additional texts
- REPLACE = $1$2(?6$6:$3$3$3$3$3$3$3$3$3$3$3$3$3$3$3$3$3$3$3$3$3)(?8$8:$3$3$3$3$3$3$3$3$3$3$3$3$3$3$3$3$3$3$3$3$3)(?10${10}:$3$3$3$3$3$3$3$3$3$3$3$3$3$3$3$3$3$3$3$3$3)$3$4
  Include the main data; if the three optional lines are there, use them; otherwise, use the 21 instances of the separator character to pad in that space

But I think you won’t be completely happy with the results. I really think you should ask for the data in a machine-readable format that doesn’t have the line splits to begin with.

If you want to learn more about the syntax, read the official NPP Searching / Regex docs and the forum’s Regular Expression FAQ. You might be able to tweak the regex yourself to make it more to your liking. But really, starting with data that isn’t mangled would be better for you in this instance.

Good luck

Suresh Reddy

@PeterJones

Respected Sir,

Thank you for the very detailed explanation and guidance in the right direction. With the help of your code I was able to achieve desired output. It was an Amazing experience and now I am very curious to learn more about Regex.

The output is almost perfect expect when there are 5 rows ( Maximum Rows). Could you please let me know what changes should be done to the above code when the maximum rows are five. I am sorry as the data was huge I missed it.

Regarding

I tried importing the Fixed-Column Width in Excel but I had problem with the Description column which is getting split into Several Columns instead of One . Otherwise It would have been very easy task with the Text-join Function in excel.
It’s impossible to convince the people sitting at the helm of affairs of IT, who don’t understand the importance of clean data file. They still use the old legacy report formatting with dot matrix printers.
Pasring the HTML is completely a new task for me. I didn’t find any <td> or <tr> rows in the Html report. I just got curious with Notepad++ and Regex because at one instance I could easily extract the first lines of the transactions in the text file using (^\d). I got the first lines copied into excel very easily. Also I removed the header parts with simple find and replace feature of Notepad++.
I always cross verify the output with original data which in this case is the closing Balance , Total of Debits & Total of Credits which got tallied with the output of your Regex code.

Thank you

PeterJones

@Suresh-Reddy said in How to find the Correct Regex pattern to extract data from a text file:

The output is almost perfect expect when there are 5 rows ( Maximum Rows

You would add one more (\R\x20{27}(.{0,21})\h*$)? to the end, and in the replacement there would be a (?12${12}:$3$3$3$3$3$3$3$3$3$3$3$3$3$3$3$3$3$3$3$3$3) before the $3$4 =>

FIND = (?-s)^(\S.{26})(.{21})(.)(.{60})(\R\x20{27}(.{0,21})\h*$)?(\R\x20{27}(.{0,21})\h*$)?(\R\x20{27}(.{0,21})\h*$)?(\R\x20{27}(.{0,21})\h*$)?
REPLACE = $1$2(?6$6:$3$3$3$3$3$3$3$3$3$3$3$3$3$3$3$3$3$3$3$3$3)(?8$8:$3$3$3$3$3$3$3$3$3$3$3$3$3$3$3$3$3$3$3$3$3)(?10${10}:$3$3$3$3$3$3$3$3$3$3$3$3$3$3$3$3$3$3$3$3$3)(?12${12}:$3$3$3$3$3$3$3$3$3$3$3$3$3$3$3$3$3$3$3$3$3)$3$4

Regarding

Sorry I wasn’t more clear; in the excel import, I meant to mention that it would require manual (or VBA) merging of the multiple cells for the Description. But since this isn’t an Excel help forum, I didn’t focus my attention on that.
I understand IT can be difficult. But really the best would be to get a CSV report or other machine-readable rather than a report generated for printing
“I didn’t find any <td> or <tr> rows in the Html report”: that surprises me, unless everything was in <pre>, which would be… well, I guess that shouldn’t surprise me.

Good luck.

Suresh Reddy

Thank you for the kind help.

The Description column got fully merged into single line, but the end of first line is getting forced into second line though there is sufficient space, with “Enter Symbol” I don’t know the exact name of the symbol. I will try to cut the Description column using “Alt + C” columnar option in Notepad++ and get it copied to excel.

As I am a newbie your code is completely Greek to me , I will try to analyze and learn it step by step and post query if I am not clear on any thing.

Suresh Reddy

@PeterJones

Sir,

I tried the same logic on Different file (It’s extension is .mht) but it’s not working as desired. I removed the headers using the find and replace option and also shifted the Description column to end using the “ALT + Shift” Column option so that there wouldn’t be a space problem when joining lines. I am attaching the new format. New Line always starts with a date, can we have a regex which can combine multiple rows of description till the next date ( Irrespective of the number of maximum rows they are split). In that way I can use the logic on every file for Extraction of data

 01/04/2019  01/04/2019                           0.00               0.00          812,430.84      B/F ...                 
 02/04/2019  02/04/2019 000000082847         50,000.00               0.00          762,430.84      SELF CASA. CHEQUE       
                                                                                                   WITHDRAWAL                     
 03/04/2019  03/04/2019                           7.20               0.00          762,423.64      SMS ALERT CHG INCL OF   
                                                                                                   GST-Q4 2018                     
 11/04/2019  11/04/2019 910116174465            740.00               0.00          761,683.64      CART CARR              
                                                                                                   PURCHASE-DELICACIES.                     
                                                                                                   ISHISHISHISHISIND                     
 15/04/2019  15/04/2019 910306119640          2,813.00               0.00          758,870.64      CART CARR              
                                                                                                   PURCHASE-FINESSE,,                     
                                                                                                   ISHISHISHISHISIND                     
 17/04/2019  17/04/2019                           0.00         500,000.00        1,258,870.64      UARTHKIEI               
                                                                                                   TRADERS-HDFCR5201904177                     
                                                                                                   4193606                     
 18/04/2019  18/04/2019                           0.00         100,000.00        1,358,870.64      IMPS CR.:XX5007:PRI     
                                                                                                   SKUANTESWPS:BILLPAY:                     
                                                                                                   18042019 095225:                     
                                                                                                   910809123296                     
 18/04/2019  18/04/2019 910804827306         10,000.00               0.00        1,348,870.64      ATM. CASH WITHDRAWAL.   
                                                                                                   ON US                     
 18/04/2019  18/04/2019 082848            1,304,334.00               0.00           44,536.64      DLXBR52019041800500142- 
                                                                                                   MANGEENI DOLEIERIES                     
                                                                                                   CO LTD

Thank you

PeterJones

@Suresh-Reddy ,

This forum isn’t a regex writing service. I’ll give you one last regex for this, but at some point, you need to start studying the regex syntax and figuring it out on your own.

Because it’s at the end of a line, rather than in the middle, it’s much easier. I would check for newline followed by 99 spaces, and replace that with nothing (or with one space):

FIND = \R\x20{99}
REPLACE = empty, or a single space, or \x20
MODE = regular expression

Determine whether to use a space or empty in the replacement based on whether your data has spaces between last alphanumeric and the line-ending.

PeterJones

@Suresh-Reddy ,

I forgot I hadn’t shared my generic regex advice. Please read and understand this for any future questions

----

Do you want regex search/replace help? Then please be patient and polite, show some effort, and be willing to learn; answer questions and requests for clarification that are made of you. All example text should be marked as plain text using the </> toolbar button or manual Markdown syntax. Screenshots can be pasted from the clipboard to your post using Ctrl+V to show graphical items, but any text should be included as literal text in your post so we can easily copy/paste your data. Show the data you have and the text you want to get from that data; include examples of things that should match and be transformed, and things that don’t match and should be left alone; show edge cases and make sure you examples are as varied as your real data. Show the regex you already tried, and why you thought it should work; tell us what’s wrong with what you do get… Read the official NPP Searching / Regex docs and the forum’s Regular Expression FAQ. If you follow these guidelines, you’re much more likely to get helpful replies that solve your problem in the shortest number of tries.

Suresh Reddy

@PeterJones

Thank you for the kind help. I am completely new to Notepad++ and Regex. As you have mentioned in the previous about the ease of having description column at end, I tried moving the column to end in Notepad++. However I was not able to figure out the Regex code. I would be patient enough and try the syntax before posting.

Suresh Reddy

@PeterJones

I have a basic query. How did you find out the number of spaces i.e 99 followed by a new line. Is it manual processes of counting or Does Notepad++ has any feature to find out the same?

Thank you

PeterJones

@Suresh-Reddy ,

I could have counted the spaces – easier to do with View > Show Symbol > Show Whitespace and TAB

Or, with my cursor at “WITHDRAWL” from your example data, Notepad++ says the cursor is at column 100, which means that there are 99 spaces before:

Or I could select all the spaces, and the status bar tells me the selection is 99 characters wide:

Any of those will work

Suresh Reddy

Thank you for the detailed explanation.