Search for large numbers



  • I have a relatively large dataset (~1.5 million lines) that contains ~45 000 scientific abstracts. What I would like to do is to locate all the abstracts that have 1000 or more patients. So I need some kind of search strategy that can help me do this

    Search for: “digit” “digit” “digit” “space” “patients” (i.e. 1938 patients)

    AND

    Search for: “digit” “comma” “digit” “digit” “digit” “space” “patients” i.e. (1,938 patients)

    Is this feasible??



  • You left out one ‘“digit”’ in your top search-for explanation.

    In Find dialog:

    Find What:

    \d,?\d{3}\spatients
    

    Search Mode: Regular Expression



  • You might try this search expression (be sure to check the “Regular expression” check box in the search dialog):

    [0-9,]+\w+patients
    

    This will find any string of one or more digits and commas followed by white space (1 or more spaces and/or tabs) and then the text “patients”. So, it should find strings like:

    4 patients
    1234    patients
    9,345,111 patients
    0         patients
    

    It would also find some nonsense strings like these, that I assume would not appear in the file:

    ,    patients
    ,,,, patients
    

    If the strings you are looking for are at the beginning of a line (possibly preceded by white space, you could use:

    ^\w*[0-9,]+ +patients
    

    The “^” character means the string must start at the beginning of a line. The “\w*” means there can be zero or more white space characters.



  • Oops. Sorry. The “\w” in my previous post should be “\s”.

    The stupid web interface wouldn’t let me post this for 20 minutes because I have no reputation. :-( Also, I never saw an edit button or link after I posted, so I’m assuming that’s because I’m a nobody too… Maybe some day I’ll be all powerful here. :-)



  • Hello, Eirik,

    If I suppose that :

    • Any subset of three digits, except for the first one, of the numbers, in your dataset, can be preceded or NOT, by a comma,

    • The space delimiter can be present or NOT, before the word patients, which can either be written in a singular mode,

    an other SEARCH regex expression could be :+\d+(,?\d{3})+ *patients?, with a space BEFORE the first + sign of that regex.


    With that regex, it would match all these following items :

    abc 1234         patients xyz
    abc 1,234        patients xyz
    abc              patients xyz
    abc 12345        patients xyz
    abc 12,345       patients xyz
    abc              patients xyz
    abc 123456       patients xyz
    abc 123,456      patients xyz
    abc              patients xyz
    abc 1234567      patients xyz
    abc 1,234567     patients xyz
    abc 1234,567     patient  xyz
    abc 1,234,567    patients xyz
    abc              patients xyz
    abc 12345678     patients xyz
    abc 12,345678    patients xyz
    abc 12345,678    patients xyz
    abc 12,345,678   patients xyz
    abc              patients xyz
    abc 123456789    patients xyz
    abc 123,456789   patient  xyz
    abc 123456,789   patient  xyz
    abc 123,456,789  patients xyz
    abc              patients xyz
    abc 1234567890   patients xyz
    abc 1,234567890  patients xyz
    abc 1,234,567890 patient  xyz
    abc 1,234567,890 patients xyz
    abc 1,234,567,890patient  xyz
    abc 1234,567890  patients xyz
    abc 1234,567,890 patients xyz
    abc 1234567,890  patients xyz
    

    But, it would ignore all natural numbers, under 1000 and some odd syntaxes as 1,2,3 or ,1a,23, like below :

    abc 0            patient  xyz
    abc 0,           patient  xyz
    abc 1            patient  xyz
    abc 1,           patient  xyz
    abc 12           patients xyz
    abc 12,          patients xyz
    abc 123          patients xyz
    abc 999          patients xyz
    abc 123,         patients xyz
    abc ,123         patients xyz
    abc 1,2,3        patients xyz
    abc 12,34        patients xyz
    abc 12,3,4       patients xyz
    abc 123,4        patients xyz
    abc ,            patients xyz
    abc ,,,,,,       patients xyz
    abc ,,4,,4,,     patients xyz
    abc              patients xyz
    abc ,1234        patients xyz
    abc ,1,234       patients xyz
    abc 123a456b789  patients xyz
    

    Best Regards,

    guy038


Log in to reply