GURU NEEDED - Stripping, reformatting, saving HTML...

Gabriele Cripezzi

Hello everybody! Gabriele posting here for the first time.

I have a huge txt file with product descriptions I need for my online store that I’d like to export in separate HTML file after after being stripped from a bunch of other text I don’t need.

I post screenshots of the BEFORE and AFTER and also the link to a ZIP file with the source and destination files in txt and html.
I’d like the exported file to carry the CR and line breaks to the HTML so that I can have that very same formatting as shown in the destination.

Screenshots:

Files:

SOURCE
DESTINATION TXT
DESTINATION HTML FORMAT

I thank you all very much in advance!
G

guy038

Hello Gabriele,

After two hours about, I succeeded to create the right Search/Replacement, with regular expressions, in order to change your source text into the HTML destination text :-) Quite tricky, indeed !

The main ideas, which are used, are :

All the lines, searched, are ENTIRE lines, with their EOL character(s).
Search, first, for a line containing the string CC1, followed by other stuff, which is to be modified OR for a ENTIRE line, which is to be deleted.
The string to keep, in each line, begins at the first non blank character, after the regex .*CC1 +\d+ + and ends at the last non blank character, before any range of ending spaces…
For the line, containing CC1, search for the following number 1 or 22, to identify if it"s the first or the last line to change ( We must add , before the first #1 line and \r\n\r\n after the last #22 line.
Try to match, after the other lines (#2 to #21), any following line(s) containing, at least, 70 spaces, that represent(s) the end of a paragraph ( In that case, we must add, at the end of the "standard" line, \r\n\r\n, or ELSE,  \r\n ).
As usual, in search part, the \R syntax represents any EOL character(s) of a line ( \r\n, in a Windows file, \n, in an Unix/OSX file or \r, in an Old Mac file ).
In the replacement part, we use the special (?nxxx:yyy) syntax, which re-writes the string xxx if the group n was matched and re-writes the string yyy, if group n was NOT matched.

So, follow the few steps below :

Open your source text in a new tab of N++
Open the Replace dialog ( CTRL + H )
In the Find what zone, type (?-s).*CC1 +((1)|(22)|\d+) +(.+[^ ]) +\R(.+ {70,}.*\R)*|(.*(\R|\z))
In the Replace with zone, type (?6:(?2)\4(?3\r\n\r\n \r\n:(?5\r\n\r\n<\p>: \r\n)))
Select the Regular expression search mode
Finally, click on the Replace All button

Et voilà ! I, exactly, obtained the same text as from your link :

http://rc-santa.com/temp/destination-html.txt

Notes on the SEARCH regex :

The (?-s) syntax forces the dot symbol to represent a standard character only ( not an EOL character ).
The part .*CC1 matches any string CC1, preceded, from beginning of line, by any standard character.
The part +((1)|(22)|\d+) + matches the number 1 or 22 or any OTHER number, surrounded by some blank characters.
The part (.+[^ ]) +\R matches any range of standard characters, ending by a non blank character, then followed by a non null range of blank characters and the EOL character(s).
If this first alternative can’t match, the second alternative (.*(\R|\z)), then, matches any ENTIRE line, with its EOL character(s) OR at the very end of the file.
The different groups are
- The group 1, ((1)|(22)|\d+), which represents any number, after the CC1 string.
- The group 2, (1), which represents the number 1, after the CC1 string.
- The group 3, (22), which represents the number 22, after the CC1 string.
- The group 4, (.+[^ ]), which represents the text to rewrite in replacement.
- The group 5, (.+ {70,}.*\R), which represents any following line, possible, containing, at least, 70 blank characters, with its EOL character(s).
- The group 6, (.*(\R|\z)), which represents any ENTIRE line, with its EOL character(s) OR located at the very end of the file, which DOESN’T contain the string CC1.
- The group 7, (\R|\z), which represents some EOL character(s) OR the zero length string match at the very end of the file.

Notes on the REPLACEMENT regex :

If group 6 is matched, (?6:, we rewrite nothing ( it’s the text to be deleted ), ELSE :
If group 2 is matched, (?2) ,we re-write .
Then, in all cases, with \4, we rewrite the group 4 ( the user text ).
If group 3 is matched, (?3\r\n\r\n \r\n, we re-write the last paragraph and the formatting line, with its EOL character(s), ELSE :
If group 5 is matched, (?5\r\n\r\n<\p>: \r\n), we re-write the change of paragraphs, with a blank line between. If NOT, we re-write  \r\n.

Best Regards,

guy038

P.S. :

To get a first idea about it :

Uncheck the View - Word wrap option and set the View - Show Symbol - Show All Characters option
Click on the Find button, to notice the different consecutive matches found, once you reached the lines containing the string CC1 !

Gabriele Cripezzi

Looking forward to be in the office tomorrow to try. :)
I really appreciate the effort man. I Am speechless right now

guy038

Hi, Gabriele,

I slightly change my previous post, in order to match the last line of your source text, which doesn’t have any EOL character. So, I added the \z assertion, to get the very end of your source file. Otherwise, the last non wanted line, below :

AACU1821AAC41821C SC3 24 ir/jl

would have been wrongly rewritten, at the end of your HTML-destination file

Cheers

guy038

Gabriele Cripezzi

@guy038 said:

(?6:(?2)\4(?3\r\n\r\n \r\n:(?5\r\n\r\n<\p>: \r\n)))

guy, you are the man!

I ran it and it worked out as you expected, but not as I expected, but it’s my fault. I didn’t explain the job accurately enough.

My destination file has only one product, but the source file contains 3 products. So I need all those :)
Anyway, the file I need to process is huge with 80k+ products so I need your function to process them all. Maybe Notepad++ won’t be able to run it so I’ll have to split the document in several parts, which won’t be a problem.

Another mistake I made is not to mention that I need the product code (i.e.: AACR1035) at the beginning of every description block so that I have a reference when I import into DB.

This said, the very best final result would be having a CSV formatted document like THIS ONE

Sorry again for not being very clear and, again, my compliments for such result!
G

Gabriele Cripezzi

@guy038 said:

(?-s)

NOTE: I don’t need the last 2 lines:

sdw 6/12/01
ir/jl

guy038

Hello Gabriele,

Ah !, as usual, the problem it’s not the regex, itself, but, rather, the full comprehension of what you exactly need !

Firstly, when I opened your file destination-2.csv, in Excel, all your text was curiously split in two cells, only : the column A contains, practically, all your text, except for the final tag , which is located in column B, due to the semi-colon of the form  . So, it, normally, acts as a field separator, in a CSV document !
Secondly, in column A, after the product code, you begin the text with a double quotes delimiter and end it, in column B, after the  tag, with this same delimiter. However, are you aware that you, already, have such delimiters, in your last formatting line ( style="text-align:right" ) ?
Thirdly, you said :

My destination file has only one product, but the source file contains 3 products. So I need all those :)

I don’t understand very well what you mean ? Of course, I would never change your source files. Just copy your source file as a destination file. Then process the regex(es) on this destination file only !

Fourthly, when you spoke about the product code, you mentioned the value AACR1035, but the complete string is AACR1035AAC11035M. Of course, I saw that the number seems repeated. So, which string would you write in your destination file ?
Finally, and the most important, to get a right idea of the process to do, you could send me your file of 80k+ products, if you don’t mind and if it is NOT confidential, of course !. My e-mail address is :

Surely, this file will suggest me some other questions, but we’ll get near to the solution !

BTW, don’t worry about the necessity of splitting your file. I don’t think it’ll be necessary. However, time of processing the final regex S/R, may be important. Anyway, if I succeed to run it on my old XP laptop, it should be OK, on a more recent configuration :-))

Cheers,

guy038

Gabriele Cripezzi

destination-2.csv
I don’t see where you got this file from. There is only destination.csv on my server. :)
The description you gave me of that file is not as I see it in Notepad++
I’ll send you the file RARed via email. It contains only one line of 2 cells separated by the comma with both values quoted (important) as the descriptions contain commas.

“are you aware that you, already, have such delimiters, in your last formatting line?”
No I was not aware. I didn’t pay attention to it. Thanks for noticing it.
So we need to have a different separator, I guess. “|” would be fine. I can use any separator I want so… let’s go for the |

The column A contains the product code “AACR1035” (the first part of the code so that AACR1035AAC11035M become “AACR1035” - It’s the important part, the SKU.

“My destination file has only one product, but the source file contains 3 products. So I need all those :)”
Sorry I’ll try again… :) (even thought at this point we don’t need that since we can go for the CSV)
The destination file I provided was just to show you the result of ONE product, but the finale destination file (CSV) need to contain all the products.

Here is the link to a part of the complete file. It’s 1/5 of the original.
Here is the complete one.

Thanks a lot again!
G

guy038

Gabriele,

Really sorry, but your two links don’t seem to work.

The first one, relative to the 1/5 of the original file, doesn’t work at all :-((
The second one, relative to the complete file, opens the main page of the R.C… Santa site, but, even after clicking on some links, of this main page, I could not get your RAR archive ?!

BTW, what is the size of your complete file ? May be, you could send me, by e-mail, part of it ( 1/5 or even less )

I did receive your destination.rar file, attached to your e-mail. Thanks. I will probably have to ask you some other questions about it, but first, I would prefer to get your file ( or a subset of it ) in order to have a general idea, about the tasks to do !

Cheers,

guy038

Gabriele Cripezzi

Here is the link to a part of the complete file. It’s 1/5 of the original.
Here is the complete one.

the first link doens’t seem to be saved correctly by the script here on this website.
links sent also via email

guy038

Hi Gabriele,

Yeah ! This time, your links, sent by e-mail, are OK. So, I now get :

The technoteCOMPLETE.rar archive, whose I extracted the huge technote.txt file , of 183 Mo !
The Technote1.rar archive, whose I extracted the technote1.csv file, of 21,8 Mo

Just note that, in your last post, the link of the partial file is still wrong !

Finally, I think that the right syntax of these two links are, simply :

http://www.rc-santa.com/temp/technote1.rar

http://www.rc-santa.com/temp/technoteCOMPLETE.rar

Well. I’m going to glance to your two huge files !

See you soon,

Cheers

guy038

Gabriele Cripezzi

“Just note that, in your last post, the link of the partial file is still wrong !”

Yeah… there must be something wrong whit this forum script when parsing URLs. I tried to work on it but after 180 secs you can’t edit anymore so I couldn’t delete the links.