Search for large numbers

Eirik Ikdahl

I have a relatively large dataset (~1.5 million lines) that contains ~45 000 scientific abstracts. What I would like to do is to locate all the abstracts that have 1000 or more patients. So I need some kind of search strategy that can help me do this

Search for: “digit” “digit” “digit” “space” “patients” (i.e. 1938 patients)

AND

Search for: “digit” “comma” “digit” “digit” “digit” “space” “patients” i.e. (1,938 patients)

Is this feasible??

Scott Sumner

You left out one ‘“digit”’ in your top search-for explanation.

In Find dialog:

Find What:

\d,?\d{3}\spatients

Search Mode: Regular Expression

Jim Dailey

You might try this search expression (be sure to check the “Regular expression” check box in the search dialog):

[0-9,]+\w+patients

This will find any string of one or more digits and commas followed by white space (1 or more spaces and/or tabs) and then the text “patients”. So, it should find strings like:

4 patients
1234    patients
9,345,111 patients
0         patients

It would also find some nonsense strings like these, that I assume would not appear in the file:

,    patients
,,,, patients

If the strings you are looking for are at the beginning of a line (possibly preceded by white space, you could use:

^\w*[0-9,]+ +patients

The “^” character means the string must start at the beginning of a line. The “\w*” means there can be zero or more white space characters.

Jim Dailey

Oops. Sorry. The “\w” in my previous post should be “\s”.

The stupid web interface wouldn’t let me post this for 20 minutes because I have no reputation. :-( Also, I never saw an edit button or link after I posted, so I’m assuming that’s because I’m a nobody too… Maybe some day I’ll be all powerful here. :-)

guy038

Hello, Eirik,

If I suppose that :

Any subset of three digits, except for the first one, of the numbers, in your dataset, can be preceded or NOT, by a comma,
The space delimiter can be present or NOT, before the word patients, which can either be written in a singular mode,

an other SEARCH regex expression could be : +\d+(,?\d{3})+ *patients?, with a space BEFORE the first + sign of that regex.

With that regex, it would match all these following items :

abc 1234         patients xyz
abc 1,234        patients xyz
abc              patients xyz
abc 12345        patients xyz
abc 12,345       patients xyz
abc              patients xyz
abc 123456       patients xyz
abc 123,456      patients xyz
abc              patients xyz
abc 1234567      patients xyz
abc 1,234567     patients xyz
abc 1234,567     patient  xyz
abc 1,234,567    patients xyz
abc              patients xyz
abc 12345678     patients xyz
abc 12,345678    patients xyz
abc 12345,678    patients xyz
abc 12,345,678   patients xyz
abc              patients xyz
abc 123456789    patients xyz
abc 123,456789   patient  xyz
abc 123456,789   patient  xyz
abc 123,456,789  patients xyz
abc              patients xyz
abc 1234567890   patients xyz
abc 1,234567890  patients xyz
abc 1,234,567890 patient  xyz
abc 1,234567,890 patients xyz
abc 1,234,567,890patient  xyz
abc 1234,567890  patients xyz
abc 1234,567,890 patients xyz
abc 1234567,890  patients xyz

But, it would ignore all natural numbers, under 1000 and some odd syntaxes as 1,2,3 or ,1a,23, like below :

abc 0            patient  xyz
abc 0,           patient  xyz
abc 1            patient  xyz
abc 1,           patient  xyz
abc 12           patients xyz
abc 12,          patients xyz
abc 123          patients xyz
abc 999          patients xyz
abc 123,         patients xyz
abc ,123         patients xyz
abc 1,2,3        patients xyz
abc 12,34        patients xyz
abc 12,3,4       patients xyz
abc 123,4        patients xyz
abc ,            patients xyz
abc ,,,,,,       patients xyz
abc ,,4,,4,,     patients xyz
abc              patients xyz
abc ,1234        patients xyz
abc ,1,234       patients xyz
abc 123a456b789  patients xyz

Best Regards,

guy038