@guy038 said in Exporting pages from Hits?:
Hi, @rjm5959, @alan-kilborn, @peterjones, @coises and All,
I must apologize to @coises ! My reasoning about the necessity or not to use an atomic group was completely erroneous :-(( Indeed, I would have been right if the regex would have been :
\x0C[^\x0C]+(?=\x0C)
But, the @coises regex is slightly different :
\x0C[^~\x0C]+(?=\x0C)
And because the ~ character belongs to the negative class character [^.......] too, the fact of using an atomic group or not, for the pages containing the ~ character, is quite significant ! Indeed :
The normal regex \x0C[^~\x0C]+(?=\x0C) would force the regex engine, as soon as a ~ is found, to backtrack, one char at a time, up to the first character of a page, after \x0C, in all the lines which contain the ~ character. Then, as the next character is obviously not \x0C, the regex would skip and search for a next \x0C char, further on, followed with some standard characters
Due to the atomic structure, the enhanced regex \x0C[^~\x0C]++(?=\x0C) would fail right after getting the ~ character and would force immediately the regex engine to give up the current search and, search, further on, for an other \x0C character, followed with some standard chars !
Do note that, if the ~ character is near the beginning of each page \x0C, you cannot notice any difference !
I did verify that using an atomic group reduce the execution time, for huge files ! With a 30 Mo file, containing 159,000 lines, whose 1,325 contains the ~ char, located 4,780 chars about after the beginning of each page, the difference, in execution, was already about 1.5s !!
As a conclusion, @rjm5959, the initial @coises’s regex \x0C[^~\x0C]++(?=\x0C) is the regex to use with files of important size ;-))
BR
guy038
Thanks Guy038. I will give this a try. The file I have is 7.7 million lines and the page hits will be 89,984 for nationstar. Each page is 57 lines so that’s about 5.1 million lines. Will this work with that much data?