Search for large numbers
-
I have a relatively large dataset (~1.5 million lines) that contains ~45 000 scientific abstracts. What I would like to do is to locate all the abstracts that have 1000 or more patients. So I need some kind of search strategy that can help me do this
Search for: “digit” “digit” “digit” “space” “patients” (i.e. 1938 patients)
AND
Search for: “digit” “comma” “digit” “digit” “digit” “space” “patients” i.e. (1,938 patients)
Is this feasible??
-
You left out one ‘“digit”’ in your top search-for explanation.
In Find dialog:
Find What:
\d,?\d{3}\spatients
Search Mode: Regular Expression
-
You might try this search expression (be sure to check the “Regular expression” check box in the search dialog):
[0-9,]+\w+patients
This will find any string of one or more digits and commas followed by white space (1 or more spaces and/or tabs) and then the text “patients”. So, it should find strings like:
4 patients 1234 patients 9,345,111 patients 0 patients
It would also find some nonsense strings like these, that I assume would not appear in the file:
, patients ,,,, patients
If the strings you are looking for are at the beginning of a line (possibly preceded by white space, you could use:
^\w*[0-9,]+ +patients
The “^” character means the string must start at the beginning of a line. The “\w*” means there can be zero or more white space characters.
-
Oops. Sorry. The “\w” in my previous post should be “\s”.
The stupid web interface wouldn’t let me post this for 20 minutes because I have no reputation. :-( Also, I never saw an edit button or link after I posted, so I’m assuming that’s because I’m a nobody too… Maybe some day I’ll be all powerful here. :-)
-
Hello, Eirik,
If I suppose that :
-
Any subset of three digits, except for the first one, of the numbers, in your dataset, can be preceded or NOT, by a comma,
-
The space delimiter can be present or NOT, before the word patients, which can either be written in a singular mode,
an other SEARCH regex expression could be :
+\d+(,?\d{3})+ *patients?
, with a space BEFORE the first + sign of that regex.
With that regex, it would match all these following items :
abc 1234 patients xyz abc 1,234 patients xyz abc patients xyz abc 12345 patients xyz abc 12,345 patients xyz abc patients xyz abc 123456 patients xyz abc 123,456 patients xyz abc patients xyz abc 1234567 patients xyz abc 1,234567 patients xyz abc 1234,567 patient xyz abc 1,234,567 patients xyz abc patients xyz abc 12345678 patients xyz abc 12,345678 patients xyz abc 12345,678 patients xyz abc 12,345,678 patients xyz abc patients xyz abc 123456789 patients xyz abc 123,456789 patient xyz abc 123456,789 patient xyz abc 123,456,789 patients xyz abc patients xyz abc 1234567890 patients xyz abc 1,234567890 patients xyz abc 1,234,567890 patient xyz abc 1,234567,890 patients xyz abc 1,234,567,890patient xyz abc 1234,567890 patients xyz abc 1234,567,890 patients xyz abc 1234567,890 patients xyz
But, it would ignore all natural numbers, under 1000 and some odd syntaxes as 1,2,3 or ,1a,23, like below :
abc 0 patient xyz abc 0, patient xyz abc 1 patient xyz abc 1, patient xyz abc 12 patients xyz abc 12, patients xyz abc 123 patients xyz abc 999 patients xyz abc 123, patients xyz abc ,123 patients xyz abc 1,2,3 patients xyz abc 12,34 patients xyz abc 12,3,4 patients xyz abc 123,4 patients xyz abc , patients xyz abc ,,,,,, patients xyz abc ,,4,,4,, patients xyz abc patients xyz abc ,1234 patients xyz abc ,1,234 patients xyz abc 123a456b789 patients xyz
Best Regards,
guy038
-