Join Lines that are breaking

Ryan Heatherly

Hi,

I’m exporting Oracle and other software reports into text, and I’m trying to import them into excel.

The problem I’m having is that the lines in these reports are split into two lines if the character count exceeds a certain threshold. For large volume of transactions its quite difficult to clear them up…

I was wondering if there’s any way to join these broken lines on Notepad++ before importing them into excel.

Here’s the file;
https://ufile.io/b1t52

Terry R

Here’s a starting point. It may not be your final solution as I’ve made some assumptions.
First assumption is that any line of 1 to 50 characters must be an extension of the line before.
Second assumption is that the max line length is 401 characters (from your example file).
Third assumption is that there will only be 1 line break, thus you would not have a line with 401 chars followed by another 401 chars and then a 3rd line of whatever amount of chars.
Fourth assumption is that the line line must be followed by the last line break. If not then a manual check might be required to check if it needs combining.

That given, the numbers can be easily altered in the regex expression.

Find:(.{401})\R(.{1,50})\R
Replace:\1\2

What it does is look for a line with exactly 401 characters. It must then be followed by a line with 1 to 50 characters. If the test is met then the line break between is removed (\R).

Terry R

If my previous idea doesn’t work due to completely variable line lengths, that is to say it’s possible for an actual line to be less than 50 characters, yet an extension to a line be more than 50 characters, then another idea is to:
Have the report include either another field which would not exist in any other area of the line, or have it print at the start of each line a special character sequence, i.e. “xZXz”.
Then do a search and replace looking for this sequence (lets use my example)and replacing with \RxZXz. At this point we know the start of the line must be a line break followed by the char sequence xZXz.
We then remove all line breaks which do NOT occur immediately before the character sequence. Lastly we can remove the special character sequence if you wish to get back to the original report.
So we are talking about 3 separate find/replace runs to complete the job.

So we have
Find:xZXz
Replace:\RxZXz
followed by
Find:\R(?!xZXz)
Replace: “this line is empty”
and lastly
Find:\RxZXz
Replace:\R
or
Find:xZXz
Replace: “this line is empty”

As always, it’s a good idea to back up the source data and test the result, generally by running it manually at first and also comparing to what you expected to get.

Terry

guy038

Hello, @ryan-heatherly, @terry-r and All,

Assuming that, in your reports, any second line, which is wrapped, does not exceed 30 characters, an easy solution is to use the following regex S/R :

SEARCH (?-s)\R(?=.{1,30}$)

REPLACE Leave EMPTY

Of course select the Regular expression search mode and tick the Wrap around option

Et voilà ! Your lines which are split in two parts should be joined and ready for additional treatments in Excel

This regex tries to match any line-break ( \R ) but ONLY IF the following line contains between 1 and 30 characters maximum !

If so, the line-break is, then, deleted => The two lines are joined

Best regards,

guy038

Ryan Heatherly

Hi @guy038 @terry ,

thanks for the Response.

Both of the search queries are finding the lines, but there’s an issue that sometimes the lines are split into 3 lines. The first allows a maximum of 400 characters, so I amended “Find:(.{401})\R(.{1,50})\R” with 400.

However it’s not functioning as perfect because the second line could have a maximum of 179 characters which would then be split into the third line. So effectively only the first and second line would be merged but the remaining third line wouldn’t.

Would it be easier if we searched for a row that starts with “00~000000~0000~” ? the zero characters could be any numbers in that series.
But then how would we have all the rows merged until it finds the next line that begins with “00~000000~0000~”.

PS. “~” is my delimiter for this text to be imported into Excel.

Terry R

You will see my first go at an answer did make some assumptions, one of which you have just confirmed. I think given that new information, it makes the line lengths too variable to use that to determine where the breaks were added. However the addition of another group of characters, namely your number group with “~” in the middle does help.

So my new suggestion is we search for any CR/LF character which is NOT followed by your “number grouping” and remove them.
Find: \R(?!\d{1,2}~\d{1,6}~\d{1,4}~)
Replace: \1

I’ve included the possibility that the number group you refer to could be shorter in length but not longer than your example. Thus 1~2~3~ would be just as valid as 12~345678~9012~. If that is not the case then change the {1,2} to {2}, the {1,6} to {6} and the {1,4} to {4}. I hope you get the concept.

I haven’t actually tested this, but give it a go. It’s sleepy time time, I’ll check the forum in another 8 hrs or so.

Terry

Ryan Heatherly

Hi Terry,

Awesome Job ! :)

That worked like a charm on data of 783 rows!

I used this;

Find: \R(?!\d{2}~\d{16}~\d{4}~)

Because my data would always be like the following format;
“01~000002~0003”

Thanks a million guys !!! This has literally saved me a lot of time, especially since i’ve got to keep running the report and updating it!

I can now move on to my next report which has a similar issue ! :) This has save

Ryan Heatherly

What would be the best way to go joining broken lines with the following format?

Status,Company Number,Order Description,Owner,Order Number,Ordered Date,Creation Date,Supplier Name,Supplier Number,Product Code,Product,Quantity Ordered,Quantity Unit,Net Unit Price,Required Delivery Date,Quantity Received,Actual Delivery Date,Cost Centre,Cost Centre Description,Account Code,Account Code Description,Analysis Code,Analysis Code Description,Sub Account Code,Sub Account Code Description,Line Net Sum,Line VAT Sum,Line Gross Sum,Cluster,Region,

(there are three commas after region, but it’s not showing up here)

Terry R

Glad my regex did so well, and I hadn’t even tested it. I’m actually new to this as well. I had a similar issue and others helped me so I feel it’s time to pay that forward (where I can).

As to your latest question. The concept of the regex I supplied will work with some minor amendments. Some assumptions are made:

the 3 commas will only appear at the end of the line
nothing will ever appear between the 3 commas

Find: (?!,)\R
Replace: <empty line> so nothing goes in the replace field

I had in my previous answer \1 in the replace field, I don’t actually think that was necessary, the same goes for this regex.

So the expression says look for a CR/LF and so long as it doesn’t appear directly behind 3 commas, then delete it.

I am a bit concerned though that you have 3 commas together and the format suggests that commas are used between all fields. Are you sure that the 3 commas will never have data between them?

Terry

Terry R

Slight typo in my last regex, it should have been
(?!,)\R
Somehow only 1 comma had appeared where there should have been 3 together.

Terry

Terry R

So I’m not going mad. Something is happening to my typing. I know for certain that I typed 3 commas together in that last regex. Yet on my screen it’s only showing 1.
This will be a test. In all cases 3 commas should appear together.
,
“,”
/,/,/,/
/, , ,/
Where I’m typing I have the 3 commas, on the right side is a preview, and that only shows 1 in the first 2 cases. The other 2 have another character between commas so they show correctly in preview.

Now I need some assistance. Can anyone provide me an answer to why my characters are going missing?

Terry

Terry R

Ryan, i have the answer as to why your 3 commas (and mine) didn’t show. The box you type in is an interpreter and displays what it believes you really wanted. I found another post which alerted me to what will be my most favourite character on this forum going forward, the grave accent. That’s the one on the same key as your ~, thus “`”.
If I type what my regex should have been using this around the characters it should come out correctly. It will also come out highlighted.

Find: (?!,,,)\R

My preview is showing it correctly, lets hope it posts it the same way.

Can anyone provide a FAQ on this markup/Markdown (I’ve seen something that says it’s called Markdown) interpreter? I’m looking but unsure of where it should be.

Terry

PeterJones

When you reply to a post, in the upper-right corner of the typing window is the word COMPOSE ?. Click on that question mark. It pops up a window with a link to the Markdown documentation.

guy038

Hi, @terry-r and All,

In addition to Peter’s information about Markdown syntax, there is also a N++ Markdown Viewer plugin, though I have not tested it, yet !

Refer to :

https://github.com/nea/MarkdownViewerPlusPlus

And the latest v.0.8.2 release, either in 32 and 64 bits, can be downloaded from below :

https://github.com/nea/MarkdownViewerPlusPlus/releases

Best Regards,

guy038

PeterJones

@guy038, @terry-r,

I occasionally use the MarkdownViewer++ plugin… it works fairly well for rendering markdown that I’m editing in Notepad++ (though, often for posts, I just edit them in the forum / browser window with the preview visible). But the MarkdownViewer++ plugin doesn’t collapse multiple commas, whereas this forum does (both at preview and after posting). I spent about 10min googling markdown comma and similar, but I wasn’t able to find any documentation that claims markdown does it. Unfortunately, nodebb comma didn’t find anything saying it’s the forum software, either. So the collapsing commas are a mystery to me.

Also, @Scott-Sumner was thinking about turning this old post into an entry on our FAQ Desk, but apparently hasn’t found the Round Tuit™ yet.

Scott Sumner

@PeterJones

Also, @Scott-Sumner was thinking about turning this old post into an entry on our FAQ Desk, but apparently hasn’t found the Round Tuit™ yet.

I have begun getting “roun-tuit” but until it is in a form that has at least as much value as the old post there is little value in publishing it. It’s a time-available thing…if you have time, you are of course free to make it way better than I ever could–I’d gladly delete my in-progress draft if something good by someone else magically appears as a FAQ Desk posting. :-)