• Login
Community
  • Login

GURU NEEDED - Stripping, reformatting, saving HTML...

Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
12 Posts 2 Posters 7.6k Views
Loading More Posts
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • G
    Gabriele Cripezzi
    last edited by Gabriele Cripezzi Oct 23, 2015, 6:08 PM Oct 23, 2015, 6:07 PM

    Hello everybody! Gabriele posting here for the first time.

    I have a huge txt file with product descriptions I need for my online store that I’d like to export in separate HTML file after after being stripped from a bunch of other text I don’t need.

    I post screenshots of the BEFORE and AFTER and also the link to a ZIP file with the source and destination files in txt and html.
    I’d like the exported file to carry the CR and line breaks to the HTML so that I can have that very same formatting as shown in the destination.

    Screenshots:

    • SOURCE

    • DESTINATION

    • FINAL RESULT on store (after import)

    Files:

    SOURCE
    DESTINATION TXT
    DESTINATION HTML FORMAT

    I thank you all very much in advance!
    G

    1 Reply Last reply Reply Quote 0
    • G
      guy038
      last edited by guy038 Oct 25, 2015, 9:57 AM Oct 24, 2015, 6:38 AM

      Hello Gabriele,

      After two hours about, I succeeded to create the right Search/Replacement, with regular expressions, in order to change your source text into the HTML destination text :-) Quite tricky, indeed !

      The main ideas, which are used, are :

      • All the lines, searched, are ENTIRE lines, with their EOL character(s).

      • Search, first, for a line containing the string CC1, followed by other stuff, which is to be modified OR for a ENTIRE line, which is to be deleted.

      • The string to keep, in each line, begins at the first non blank character, after the regex .*CC1 +\d+ + and ends at the last non blank character, before any range of ending spaces…

      • For the line, containing CC1, search for the following number 1 or 22, to identify if it"s the first or the last line to change ( We must add <p>, before the first #1 line and </p>\r\n\r\n<p style="text-......"....</p> after the last #22 line.

      • Try to match, after the other lines (#2 to #21), any following line(s) containing, at least, 70 spaces, that represent(s) the end of a paragraph ( In that case, we must add, at the end of the "standard" line, </p>\r\n\r\n</p>, or ELSE, <br />\r\n ).

      • As usual, in search part, the \R syntax represents any EOL character(s) of a line ( \r\n, in a Windows file, \n, in an Unix/OSX file or \r, in an Old Mac file ).

      • In the replacement part, we use the special (?nxxx:yyy) syntax, which re-writes the string xxx if the group n was matched and re-writes the string yyy, if group n was NOT matched.


      So, follow the few steps below :

      • Open your source text in a new tab of N++

      • Open the Replace dialog ( CTRL + H )

      • In the Find what zone, type (?-s).*CC1 +((1)|(22)|\d+) +(.+[^ ]) +\R(.+ {70,}.*\R)*|(.*(\R|\z))

      • In the Replace with zone, type (?6:(?2<p>)\4(?3</p>\r\n\r\n<p style="text-align\:right">&nbsp;</p>\r\n:(?5</p>\r\n\r\n<\p>:<br />\r\n)))

      • Select the Regular expression search mode

      • Finally, click on the Replace All button

      Et voilà ! I, exactly, obtained the same text as from your link :

      http://rc-santa.com/temp/destination-html.txt


      Notes on the SEARCH regex :

      • The (?-s) syntax forces the dot symbol to represent a standard character only ( not an EOL character ).

      • The part .*CC1 matches any string CC1, preceded, from beginning of line, by any standard character.

      • The part +((1)|(22)|\d+) + matches the number 1 or 22 or any OTHER number, surrounded by some blank characters.

      • The part (.+[^ ]) +\R matches any range of standard characters, ending by a non blank character, then followed by a non null range of blank characters and the EOL character(s).

      • If this first alternative can’t match, the second alternative (.*(\R|\z)), then, matches any ENTIRE line, with its EOL character(s) OR at the very end of the file.

      • The different groups are

        • The group 1, ((1)|(22)|\d+), which represents any number, after the CC1 string.

        • The group 2, (1), which represents the number 1, after the CC1 string.

        • The group 3, (22), which represents the number 22, after the CC1 string.

        • The group 4, (.+[^ ]), which represents the text to rewrite in replacement.

        • The group 5, (.+ {70,}.*\R), which represents any following line, possible, containing, at least, 70 blank characters, with its EOL character(s).

        • The group 6, (.*(\R|\z)), which represents any ENTIRE line, with its EOL character(s) OR located at the very end of the file, which DOESN’T contain the string CC1.

        • The group 7, (\R|\z), which represents some EOL character(s) OR the zero length string match at the very end of the file.


      Notes on the REPLACEMENT regex :

      • If group 6 is matched, (?6:, we rewrite nothing ( it’s the text to be deleted ), ELSE :

      • If group 2 is matched, (?2<p>) ,we re-write <p>.

      • Then, in all cases, with \4, we rewrite the group 4 ( the user text ).

      • If group 3 is matched, (?3</p>\r\n\r\n<p style="text-align\:right">&nbsp;</p>\r\n, we re-write the last paragraph and the formatting line, with its EOL character(s), ELSE :

      • If group 5 is matched, (?5</p>\r\n\r\n<\p>:<br />\r\n), we re-write the change of paragraphs, with a blank line between. If NOT, we re-write <br />\r\n.

      Best Regards,

      guy038

      P.S. :

      To get a first idea about it :

      • Uncheck the View - Word wrap option and set the View - Show Symbol - Show All Characters option

      • Click on the Find button, to notice the different consecutive matches found, once you reached the lines containing the string CC1 !

      1 Reply Last reply Reply Quote 0
      • G
        Gabriele Cripezzi
        last edited by Oct 25, 2015, 2:28 AM

        Looking forward to be in the office tomorrow to try. :)
        I really appreciate the effort man. I Am speechless right now

        1 Reply Last reply Reply Quote 0
        • G
          guy038
          last edited by guy038 Oct 25, 2015, 9:34 AM Oct 25, 2015, 9:28 AM

          Hi, Gabriele,

          I slightly change my previous post, in order to match the last line of your source text, which doesn’t have any EOL character. So, I added the \z assertion, to get the very end of your source file. Otherwise, the last non wanted line, below :

          AACU1821AAC41821C         SC3            24                                                                             ir/jl

          would have been wrongly rewritten, at the end of your HTML-destination file

          Cheers

          guy038

          1 Reply Last reply Reply Quote 0
          • G
            Gabriele Cripezzi
            last edited by Oct 25, 2015, 2:57 PM

            @guy038 said:

            (?6:(?2<p>)\4(?3</p>\r\n\r\n<p style=“text-align:right”> </p>\r\n:(?5</p>\r\n\r\n<\p>:<br />\r\n)))

            guy, you are the man!

            I ran it and it worked out as you expected, but not as I expected, but it’s my fault. I didn’t explain the job accurately enough.

            My destination file has only one product, but the source file contains 3 products. So I need all those :)
            Anyway, the file I need to process is huge with 80k+ products so I need your function to process them all. Maybe Notepad++ won’t be able to run it so I’ll have to split the document in several parts, which won’t be a problem.

            Another mistake I made is not to mention that I need the product code (i.e.: AACR1035) at the beginning of every description block so that I have a reference when I import into DB.

            This said, the very best final result would be having a CSV formatted document like THIS ONE

            Sorry again for not being very clear and, again, my compliments for such result!
            G

            1 Reply Last reply Reply Quote 0
            • G
              Gabriele Cripezzi
              last edited by Oct 25, 2015, 3:19 PM

              @guy038 said:

              (?-s)

              NOTE: I don’t need the last 2 lines:

              sdw 6/12/01
              ir/jl

              1 Reply Last reply Reply Quote 0
              • G
                guy038
                last edited by guy038 Jun 2, 2021, 8:56 PM Oct 26, 2015, 8:03 PM

                Hello Gabriele,

                Ah !, as usual, the problem it’s not the regex, itself, but, rather, the full comprehension of what you exactly need !

                • Firstly, when I opened your file destination-2.csv, in Excel, all your text was curiously split in two cells, only : the column A contains, practically, all your text, except for the final tag </p>, which is located in column B, due to the semi-colon of the form &nbsp;. So, it, normally, acts as a field separator, in a CSV document !

                • Secondly, in column A, after the product code, you begin the text with a double quotes delimiter and end it, in column B, after the </p> tag, with this same delimiter. However, are you aware that you, already, have such delimiters, in your last formatting line ( style="text-align:right" ) ?

                • Thirdly, you said :

                My destination file has only one product, but the source file contains 3 products. So I need all those :)

                I don’t understand very well what you mean ? Of course, I would never change your source files. Just copy your source file as a destination file. Then process the regex(es) on this destination file only !

                • Fourthly, when you spoke about the product code, you mentioned the value AACR1035, but the complete string is AACR1035AAC11035M. Of course, I saw that the number seems repeated. So, which string would you write in your destination file ?

                • Finally, and the most important, to get a right idea of the process to do, you could send me your file of 80k+ products, if you don’t mind and if it is NOT confidential, of course !. My e-mail address is :

                Surely, this file will suggest me some other questions, but we’ll get near to the solution !

                BTW, don’t worry about the necessity of splitting your file. I don’t think it’ll be necessary. However, time of processing the final regex S/R, may be important. Anyway, if I succeed to run it on my old XP laptop, it should be OK, on a more recent configuration :-))

                Cheers,

                guy038

                1 Reply Last reply Reply Quote 0
                • G
                  Gabriele Cripezzi
                  last edited by Oct 26, 2015, 10:49 PM

                  destination-2.csv
                  I don’t see where you got this file from. There is only destination.csv on my server. :)
                  The description you gave me of that file is not as I see it in Notepad++
                  I’ll send you the file RARed via email. It contains only one line of 2 cells separated by the comma with both values quoted (important) as the descriptions contain commas.

                  “are you aware that you, already, have such delimiters, in your last formatting line?”
                  No I was not aware. I didn’t pay attention to it. Thanks for noticing it.
                  So we need to have a different separator, I guess. “|” would be fine. I can use any separator I want so… let’s go for the |

                  The column A contains the product code “AACR1035” (the first part of the code so that AACR1035AAC11035M become “AACR1035” - It’s the important part, the SKU.

                  “My destination file has only one product, but the source file contains 3 products. So I need all those :)”
                  Sorry I’ll try again… :) (even thought at this point we don’t need that since we can go for the CSV)
                  The destination file I provided was just to show you the result of ONE product, but the finale destination file (CSV) need to contain all the products.

                  Here is the link to a part of the complete file. It’s 1/5 of the original.
                  Here is the complete one.

                  Thanks a lot again!
                  G

                  1 Reply Last reply Reply Quote 0
                  • G
                    guy038
                    last edited by guy038 Oct 27, 2015, 4:36 PM Oct 27, 2015, 4:33 PM

                    Gabriele,

                    Really sorry, but your two links don’t seem to work.

                    • The first one, relative to the 1/5 of the original file, doesn’t work at all :-((

                    • The second one, relative to the complete file, opens the main page of the R.C… Santa site, but, even after clicking on some links, of this main page, I could not get your RAR archive ?!

                    BTW, what is the size of your complete file ? May be, you could send me, by e-mail, part of it ( 1/5 or even less )

                    I did receive your destination.rar file, attached to your e-mail. Thanks. I will probably have to ask you some other questions about it, but first, I would prefer to get your file ( or a subset of it ) in order to have a general idea, about the tasks to do !

                    Cheers,

                    guy038

                    1 Reply Last reply Reply Quote 0
                    • G
                      Gabriele Cripezzi
                      last edited by Gabriele Cripezzi Oct 27, 2015, 4:48 PM Oct 27, 2015, 4:45 PM

                      Here is the link to a part of the complete file. It’s 1/5 of the original.
                      Here is the complete one.

                      the first link doens’t seem to be saved correctly by the script here on this website.
                      links sent also via email

                      1 Reply Last reply Reply Quote 0
                      • G
                        guy038
                        last edited by guy038 Oct 27, 2015, 5:38 PM Oct 27, 2015, 5:37 PM

                        Hi Gabriele,

                        Yeah ! This time, your links, sent by e-mail, are OK. So, I now get :

                        • The technoteCOMPLETE.rar archive, whose I extracted the huge technote.txt file , of 183 Mo !

                        • The Technote1.rar archive, whose I extracted the technote1.csv file, of 21,8 Mo

                        Just note that, in your last post, the link of the partial file is still wrong !

                        Finally, I think that the right syntax of these two links are, simply :

                        http://www.rc-santa.com/temp/technote1.rar

                        http://www.rc-santa.com/temp/technoteCOMPLETE.rar

                        Well. I’m going to glance to your two huge files !

                        See you soon,

                        Cheers

                        guy038

                        1 Reply Last reply Reply Quote 0
                        • G
                          Gabriele Cripezzi
                          last edited by Oct 28, 2015, 4:49 PM

                          “Just note that, in your last post, the link of the partial file is still wrong !”

                          Yeah… there must be something wrong whit this forum script when parsing URLs. I tried to work on it but after 180 secs you can’t edit anymore so I couldn’t delete the links.

                          1 Reply Last reply Reply Quote 0
                          7 out of 12
                          • First post
                            7/12
                            Last post
                          The Community of users of the Notepad++ text editor.
                          Powered by NodeBB | Contributors