Hello, @arpad-zsok, @scott-sumner and All,
Noticing that Arpad’s text contains, only, one | character per row, I suppose that this simple regex S/R, below, would be enough !
SEARCH [^"]+\|
REPLACE Leave EMPTY
But my solution isn’t the true goal of this post, indeed ! While running the Scott’s search regex, below :
(?-i)^((?:".+?",){11}")[0-9.]+?\|(?=NEGATIVE|POSITIVE)
against the text :
"processingIdentifier","dateTime","header/instrumentName","patientIdentifier","lastName","birthdate","sex","location","specimenIdentifier","testIdentifier","dilution","testName","result/value","unit","abnormalFlags","resultStatus","operatorName","resultDateTime","result","simple/resultCode"
"A","B","C","D","E","20180328171007","VIDASPC01","F838-843","F838-843","SPT","UP Salmonella","0.07|NEGATIVE","F","20180328115445",
"A","B","C","D","E","20180328171008","VIDASPC01","F838-844","F838-844","SPT","UP Salmonella","0.25|POSITIVE","F","20180328115446",
"A","B","C","D","E","20180328171007","VIDASPC01","F838-843","F838-843","SPT","UP Salmonella","0.07|NEGATIVE","F","20180328115445",
I noticed that it took some time ( ~2s ), before finding the first expected zone :
“A”,“B”,“C”,“D”,“E”,“20180328171007”,“VIDASPC01”,“F838-843”,“F838-843”,“SPT”,“UP Salmonella”,"0.07|
And I tried to understand that behavior ! First, I realized that the matching delay was due to the headers line, which does not contain any | symbol. Thus, the problem could be simplified !
In a new tab, let’s build an unique line, with 3 initial “headers” lines, joined together, surrounded with two simple sentences :
This is a test
"ProcessingIdentifier","dateTime","header/instrumentName","patientIdentifier","lastName","birthdate","sex","location","specimenIdentifier","testIdentifier","dilution","testName","result/value","unit","abnormalFlags","resultStatus","operatorName","resultDateTime","result","simple/resultCode","processingIdentifier","dateTime","header/instrumentName","patientIdentifier","lastName","birthdate","sex","location","specimenIdentifier","testIdentifier","dilution","testName","result/value","unit","abnormalFlags","resultStatus","operatorName","resultDateTime","result","simple/resultCode","ProcessingIdentifier","dateTime","header/instrumentName","patientIdentifier","lastName","birthdate","sex","location","specimenIdentifier","testIdentifier","dilution","testName","result/value","unit","abnormalFlags","resultStatus","operatorName","resultDateTime","result","simple/resultCode"
This is a text
and let"s consider the generic regex ^(?-is)(".+?",){N}"z, where the letter N is any number > 0.Of course, as this text does not contain any lowercase letter z, the regex engine always answers Find: Can’t find the text…". On my old laptop, I obtained, depending of the used regex, the following times, for unmatching :
With regex ^(?-is)(".+?",){1}"z => Immediate answer
With regex ^(?-is)(".+?",){2}"z => Immediate answer
With regex ^(?-is)(".+?",){3}"z => 1s
With regex ^(?-is)(".+?",){4}"z => 2s
With regex ^(?-is)(".+?",){5}"z => 7s , with a wrong unique match ( all the file contents !)
Why the time, for the regex engine, to realize, that there is NO match, becomes exponential ? Well, this comes from the regex part ".+?". Indeed, to understand the process, let’s use the regex ^(?-is)(".+?",){1}"z, which can, also, be written ^(?-is)".+?","z
Actually, due to the lazy quantifier +?, this regex means : Find the smallest range of standard characters, between two " symbols, which is followed with the string ,"z !
So, let’s add a z lowercase letter, at the beginning of the second column header, as below :
"ProcessingIdentifier","zdateTime","header/instrumentName","patientIdentifier","lastName","birthdate",......
The regex part ".+?" catches the string “ProcessingIdentifier” and the part ,"z matches the ,"z string
Now, let’s add, instead, a z lowercase letter, at the beginning of the third column header, as below :
"ProcessingIdentifier","dateTime","zheader/instrumentName","patientIdentifier","lastName","birthdate",......
This time, the regex part ".+?" catches the string “ProcessingIdentifier”,“dateTime” and the part ,"z matches the ,"z string
And so on…, till inserting a z character at beginning of the last header :
"ProcessingIdentifier","dateTime","header/instrumentName",...................,"resultDateTime","result","zsimple/resultCode"
This time, the regex part ".+?" catches the huge string “ProcessingIdentifier”,“dateTime”…“resultStatus”,“operatorName”,“resultDateTime”,“result” ( 855 chars ) and the part ,"z matches the ,"z string, before the last header of the text ( simple/resultCode )
So , depending of the location of that z letter, the quantifier +? takes as many non-null characters as to meet the ,"z string ! This regex engine behavior can be considered as a foretracking operation ( by analogy to the backtracking one ! )
Now, everyone can understand why, if no z letter exists in the text, the time to answer becomes exponential, when replacing the {1} quantifier with greater values and, even with the {5} quantifier, I, personally, got a catastrophic foretracking :-((
Don’t forget that, for instance, the regex ^(?-is)(".+?",){11}"z can be rewritten :
^(?-is)".+?",".+?",".+?",".+?",".+?",".+?",".+?",".+?",".+?",".+?",".+?","z
And, as any part ".+?" may concern any form "....", "....","....", "....","....","....",…, it’s not difficult to guess that, because of the multiple possible combinations, troubles are not very far :-((
Two obvious solutions to that problem are to use the regexes ^(?-is)("[^,]+?",){11}"z or ^(?-is)("[^"]+?",){11}"z
However, I, also, got an other strange solution : to use the regex ^(?-is)(".+?",){11}+"z, with the possessive quantifier {11}+. Indeed, when no z letter exists in the list of headers, the negative answer, of the regex engine, is immediate !
At first sight, this {11}+ syntax seems quite weird, because the {11} quantifier cannot change and the backtracking/foretracking process does not seem to be involved with that quantifier ! But, actually, this means that the regex engine, will not try any combination, after a first unmatched search, in the regex part ".+?",".+?",".+?",".+?",".+?",".+?",".+?",".+?",".+?",".+?",".+?",
And, if your add a z, at beginning of the 12th header ( “ztestName” ), you 'll get , with the regex ^(?-is)(".+?",){11}+"z, the only possible match :-))
In conclusion, the initial Scott’s regex should be used, with one the three syntaxes, below :
(?-i)^((?:"[^,]+?",){11}")[0-9.]+?\|(?=NEGATIVE|POSITIVE)
(?-i)^((?:"[^"]+?",){11}")[0-9.]+?\|(?=NEGATIVE|POSITIVE)
(?-i)^((?:".+?",){11}+")[0-9.]+?\|(?=NEGATIVE|POSITIVE)
Cheers,
guy038