Remove part of a column text in a csv file (variable text to remove and 2 possible text to keep)



  • Hi,
    I have a csv file with a header row and rows below with data.
    “processingIdentifier”,“dateTime”,“header/instrumentName”,“patientIdentifier”,“lastName”,“birthdate”,“sex”,“location”,“specimenIdentifier”,“testIdentifier”,“dilution”,“testName”,“result/value”,“unit”,“abnormalFlags”,“resultStatus”,“operatorName”,“resultDateTime”,“result”,“simple/resultCode”
    ,“20180328171007”,“VIDASPC01”,“F838-843”,“F838-843”,“SPT”,“UP Salmonella”,“0.07|NEGATIVE”,“F”,“20180328115445”,
    ,“20180328171008”,“VIDASPC01”,“F838-844”,“F838-844”,“SPT”,“UP Salmonella”,“0.25|POSITIVE”,“F”,“20180328115446”,

    1. the first row is not to be changed
    2. In the second row the 12th field contains “0.07|NEGATIVE”.
      here I want to keep only the NEGATIVE and remove the number and the pipe character.
    3. In the third row the 12th field contains “0.25|POSITIVE”.
      here I want to keep only the POSITIVE and remove the number and the pipe character.

    what regex should I use in replace function?
    I would be more than happy if someone could help!
    Warmest regards,
    Arpad



    • ☑ Regular Expression
    • Find what : “(\d\.\d\d\|)(POSITIVE|NEGATIVE)”
    • Replace with : “$2”

    Caveats:

    • The example you gave used silly smart-quotes rather than real ASCII quotes. This makes it invalid CSV. I am assuming that you have an invalid CSV, because that’s the data you gave us. If it’s really a normal ASCII " quote character, please let us know.
    • The example you gave put the “0.07|NEGATIVE” in the 8th field of the example data (there were seven commas before it), but you claimed it was the 12th field. I assumed that field number was not important, and what you’re really looking for is the pattern of “number, vertical bar, positive/negative”
    • This assumes numbers in exactly the form “a single digit, the decimal point, and two digits”. It could easily be made more generic, but since your two examples were in exactly that format, I made the assumption that you would have crafted your example differently if you wanted to allow more numeric formats. (if my assumption is wrong, please completely define valid numbers before the vertical bar)
    • This assumes either NEGATIVE or POSITIVE, and nothing else would be a valid match after the vertical bar. Please correct me if I’m wrong

    Explanation:
    Find what:

    • “...”: The smart quotes at the beginning and end are exact characters
    • “(...)(...)”: The two pairs of parentheses are for “grouping”: the contents of the first pair go into $1 in the ReplaceWith, and the contents of the second pair go into $2.
    • \d: matches the decimal digits 0-9
    • \.: matches the decimal point .. (The \ escapes the special meaning of the . character, which in a regex means “match exactly one character”
    • \|: matches the vertical bar |. (The \ escapes the special meaning of the | character, which in a regex means “match what’s before the | or alternately what’s after the bar”
    • POSITIVE|NEGATIVE: match either POSITIVE or NEGATIVE, case sensitively

    Replace with

    • “...”: the smart quotes will stay in place
    • $2: use the contents of the second parentheses pair from the Find What

    P.S. :
    (paraphrasing @guy038, the forum’s regex guru, who compiled these great regex resources, but hasn’t shared them in this thread yet):

    Here is a good starting point for NPP users unfamiliar with regular expression concepts and syntax:

    Modern Notepad++ (since v6.0) uses the Boost C++ Regex library, v1.55.0 (similar to the Perl Regular Common Expressions (PRCE), v5.8):

    Other sites with valuable regular expression information include:



  • @Arpad-Zsok

    Your example data isn’t the greatest. However, if we take you at your word about the field in question being the 12th field, and make an assumption that the data doesn’t have any extra commas or double quotes, then the following could work:

    Find what zone: (?-i)^((?:".+?",){11}")[0-9.]+?\|(?=NEGATIVE|POSITIVE)
    Replace with zone: \1
    Search mode: Regular expression

    Running it on this sample data:

    "processingIdentifier","dateTime","header/instrumentName","patientIdentifier","lastName","birthdate","sex","location","specimenIdentifier","testIdentifier","dilution","testName","result/value","unit","abnormalFlags","resultStatus","operatorName","resultDateTime","result","simple/resultCode""A","B","C","D","E","20180328171007","VIDASPC01","F838-843","F838-843","SPT","UP Salmonella","0.07|NEGATIVE","F","20180328115445",
    "A","B","C","D","E","20180328171008","VIDASPC01","F838-844","F838-844","SPT","UP Salmonella","0.25|POSITIVE","F","20180328115446",
    "A","B","C","D","E","20180328171007","VIDASPC01","F838-843","F838-843","SPT","UP Salmonella","0.07|NEGATIVE","F","20180328115445",
    

    Produces this result:

    "processingIdentifier","dateTime","header/instrumentName","patientIdentifier","lastName","birthdate","sex","location","specimenIdentifier","testIdentifier","dilution","testName","result/value","unit","abnormalFlags","resultStatus","operatorName","resultDateTime","result","simple/resultCode""A","B","C","D","E","20180328171007","VIDASPC01","F838-843","F838-843","SPT","UP Salmonella","NEGATIVE","F","20180328115445",
    "A","B","C","D","E","20180328171008","VIDASPC01","F838-844","F838-844","SPT","UP Salmonella","POSITIVE","F","20180328115446",
    "A","B","C","D","E","20180328171007","VIDASPC01","F838-843","F838-843","SPT","UP Salmonella","NEGATIVE","F","20180328115445",
    


  • Hello, @arpad-zsok, @scott-sumner and All,

    Noticing that Arpad’s text contains, only, one | character per row, I suppose that this simple regex S/R, below, would be enough !

    SEARCH [^"]+\|

    REPLACE Leave EMPTY


    But my solution isn’t the true goal of this post, indeed ! While running the Scott’s search regex, below :

    (?-i)^((?:".+?",){11}")[0-9.]+?\|(?=NEGATIVE|POSITIVE)

    against the text :

    "processingIdentifier","dateTime","header/instrumentName","patientIdentifier","lastName","birthdate","sex","location","specimenIdentifier","testIdentifier","dilution","testName","result/value","unit","abnormalFlags","resultStatus","operatorName","resultDateTime","result","simple/resultCode"
    "A","B","C","D","E","20180328171007","VIDASPC01","F838-843","F838-843","SPT","UP Salmonella","0.07|NEGATIVE","F","20180328115445",
    "A","B","C","D","E","20180328171008","VIDASPC01","F838-844","F838-844","SPT","UP Salmonella","0.25|POSITIVE","F","20180328115446",
    "A","B","C","D","E","20180328171007","VIDASPC01","F838-843","F838-843","SPT","UP Salmonella","0.07|NEGATIVE","F","20180328115445",
    

    I noticed that it took some time ( ~2s ), before finding the first expected zone :

    "A",“B”,“C”,“D”,“E”,“20180328171007”,“VIDASPC01”,“F838-843”,“F838-843”,“SPT”,“UP Salmonella”,"0.07|

    And I tried to understand that behavior ! First, I realized that the matching delay was due to the headers line, which does not contain any | symbol. Thus, the problem could be simplified !


    In a new tab, let’s build an unique line, with 3 initial “headers” lines, joined together, surrounded with two simple sentences :

    This is a test
    
    "ProcessingIdentifier","dateTime","header/instrumentName","patientIdentifier","lastName","birthdate","sex","location","specimenIdentifier","testIdentifier","dilution","testName","result/value","unit","abnormalFlags","resultStatus","operatorName","resultDateTime","result","simple/resultCode","processingIdentifier","dateTime","header/instrumentName","patientIdentifier","lastName","birthdate","sex","location","specimenIdentifier","testIdentifier","dilution","testName","result/value","unit","abnormalFlags","resultStatus","operatorName","resultDateTime","result","simple/resultCode","ProcessingIdentifier","dateTime","header/instrumentName","patientIdentifier","lastName","birthdate","sex","location","specimenIdentifier","testIdentifier","dilution","testName","result/value","unit","abnormalFlags","resultStatus","operatorName","resultDateTime","result","simple/resultCode"
    
    This is a text
    

    and let"s consider the generic regex ^(?-is)(".+?",){N}"z, where the letter N is any number > 0.Of course, as this text does not contain any lowercase letter z, the regex engine always answers Find: Can’t find the text…". On my old laptop, I obtained, depending of the used regex, the following times, for unmatching :

    • With regex ^(?-is)(".+?",){1}"z => Immediate answer

    • With regex ^(?-is)(".+?",){2}"z => Immediate answer

    • With regex ^(?-is)(".+?",){3}"z => 1s

    • With regex ^(?-is)(".+?",){4}"z => 2s

    • With regex ^(?-is)(".+?",){5}"z => 7s , with a wrong unique match ( all the file contents !)

    Why the time, for the regex engine, to realize, that there is NO match, becomes exponential ? Well, this comes from the regex part ".+?". Indeed, to understand the process, let’s use the regex ^(?-is)(".+?",){1}"z, which can, also, be written ^(?-is)".+?","z

    Actually, due to the lazy quantifier +?, this regex means : Find the smallest range of standard characters, between two " symbols, which is followed with the string ,"z !

    So, let’s add a z lowercase letter, at the beginning of the second column header, as below :

    "ProcessingIdentifier","zdateTime","header/instrumentName","patientIdentifier","lastName","birthdate",......
    

    The regex part ".+?" catches the string “ProcessingIdentifier” and the part ,"z matches the ,"z string

    Now, let’s add, instead, a z lowercase letter, at the beginning of the third column header, as below :

    "ProcessingIdentifier","dateTime","zheader/instrumentName","patientIdentifier","lastName","birthdate",......
    

    This time, the regex part ".+?" catches the string “ProcessingIdentifier”,“dateTime” and the part ,"z matches the ,"z string

    And so on…, till inserting a z character at beginning of the last header :

    "ProcessingIdentifier","dateTime","header/instrumentName",...................,"resultDateTime","result","zsimple/resultCode"
    

    This time, the regex part ".+?" catches the huge string “ProcessingIdentifier”,“dateTime”…“resultStatus”,“operatorName”,“resultDateTime”,“result” ( 855 chars ) and the part ,"z matches the ,"z string, before the last header of the text ( simple/resultCode )

    So , depending of the location of that z letter, the quantifier +? takes as many non-null characters as to meet the ,"z string ! This regex engine behavior can be considered as a foretracking operation ( by analogy to the backtracking one ! )


    Now, everyone can understand why, if no z letter exists in the text, the time to answer becomes exponential, when replacing the {1} quantifier with greater values and, even with the {5} quantifier, I, personally, got a catastrophic foretracking :-((

    Don’t forget that, for instance, the regex ^(?-is)(".+?",){11}"z can be rewritten :

    ^(?-is)".+?",".+?",".+?",".+?",".+?",".+?",".+?",".+?",".+?",".+?",".+?","z

    And, as any part ".+?" may concern any form "....", "....","....", "....","....","....",…, it’s not difficult to guess that, because of the multiple possible combinations, troubles are not very far :-((


    Two obvious solutions to that problem are to use the regexes ^(?-is)("[^,]+?",){11}"z or ^(?-is)("[^"]+?",){11}"z

    However, I, also, got an other strange solution : to use the regex ^(?-is)(".+?",){11}+"z, with the possessive quantifier {11}+. Indeed, when no z letter exists in the list of headers, the negative answer, of the regex engine, is immediate !

    At first sight, this {11}+ syntax seems quite weird, because the {11} quantifier cannot change and the backtracking/foretracking process does not seem to be involved with that quantifier ! But, actually, this means that the regex engine, will not try any combination, after a first unmatched search, in the regex part ".+?",".+?",".+?",".+?",".+?",".+?",".+?",".+?",".+?",".+?",".+?",

    And, if your add a z, at beginning of the 12th header ( “ztestName” ), you 'll get , with the regex ^(?-is)(".+?",){11}+"z, the only possible match :-))


    In conclusion, the initial Scott’s regex should be used, with one the three syntaxes, below :

    • (?-i)^((?:"[^,]+?",){11}")[0-9.]+?\|(?=NEGATIVE|POSITIVE)

    • (?-i)^((?:"[^"]+?",){11}")[0-9.]+?\|(?=NEGATIVE|POSITIVE)

    • (?-i)^((?:".+?",){11}+")[0-9.]+?\|(?=NEGATIVE|POSITIVE)

    Cheers,

    guy038


Log in to reply