Hi, @rjm5959, @alan-kilborn, @peterjones, @coises and All,
I must apologize to @coises ! My reasoning about the necessity or not to use an atomic group was completely erroneous :-(( Indeed, I would have been right if the regex would have been :
\x0C[^\x0C]+(?=\x0C)
But, the @coises regex is slightly different :
\x0C[^~\x0C]+(?=\x0C)
And because the ~ character belongs to the negative class character [^.......] too, the fact of using an atomic group or not, for the pages containing the ~ character, is quite significant ! Indeed :
The normal regex \x0C[^~\x0C]+(?=\x0C) would force the regex engine, as soon as a ~ is found, to backtrack, one char at a time, up to the first character of a page, after \x0C, in all the lines which contain the ~ character. Then, as the next character is obviously not \x0C, the regex would skip and search for a next \x0C char, further on, followed with some standard characters
Due to the atomic structure, the enhanced regex \x0C[^~\x0C]++(?=\x0C) would fail right after getting the ~ character and would force immediately the regex engine to give up the current search and, search, further on, for an other \x0C character, followed with some standard chars !
Do note that, if the ~ character is near the beginning of each page \x0C, you cannot notice any difference !
I did verify that using an atomic group reduce the execution time, for huge files ! With a 30 Mo file, containing 159,000 lines, whose 1,325 contains the ~ char, located 4,780 chars about after the beginning of each page, the difference, in execution, was already about 1.5s !!
As a conclusion, @rjm5959, the initial @coises’s regex \x0C[^~\x0C]++(?=\x0C) is the regex to use with files of important size ;-))
BR
guy038