• Login
Community
  • Login

Search for large numbers

Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
search
5 Posts 4 Posters 8.7k Views
Loading More Posts
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • E
    Eirik Ikdahl
    last edited by Sep 11, 2015, 10:26 AM

    I have a relatively large dataset (~1.5 million lines) that contains ~45 000 scientific abstracts. What I would like to do is to locate all the abstracts that have 1000 or more patients. So I need some kind of search strategy that can help me do this

    Search for: “digit” “digit” “digit” “space” “patients” (i.e. 1938 patients)

    AND

    Search for: “digit” “comma” “digit” “digit” “digit” “space” “patients” i.e. (1,938 patients)

    Is this feasible??

    1 Reply Last reply Reply Quote 0
    • S
      Scott Sumner
      last edited by Sep 11, 2015, 11:08 AM

      You left out one ‘“digit”’ in your top search-for explanation.

      In Find dialog:

      Find What:

      \d,?\d{3}\spatients
      

      Search Mode: Regular Expression

      1 Reply Last reply Reply Quote 0
      • J
        Jim Dailey
        last edited by Sep 11, 2015, 12:38 PM

        You might try this search expression (be sure to check the “Regular expression” check box in the search dialog):

        [0-9,]+\w+patients
        

        This will find any string of one or more digits and commas followed by white space (1 or more spaces and/or tabs) and then the text “patients”. So, it should find strings like:

        4 patients
        1234    patients
        9,345,111 patients
        0         patients
        

        It would also find some nonsense strings like these, that I assume would not appear in the file:

        ,    patients
        ,,,, patients
        

        If the strings you are looking for are at the beginning of a line (possibly preceded by white space, you could use:

        ^\w*[0-9,]+ +patients
        

        The “^” character means the string must start at the beginning of a line. The “\w*” means there can be zero or more white space characters.

        1 Reply Last reply Reply Quote 0
        • J
          Jim Dailey
          last edited by Sep 11, 2015, 12:58 PM

          Oops. Sorry. The “\w” in my previous post should be “\s”.

          The stupid web interface wouldn’t let me post this for 20 minutes because I have no reputation. :-( Also, I never saw an edit button or link after I posted, so I’m assuming that’s because I’m a nobody too… Maybe some day I’ll be all powerful here. :-)

          1 Reply Last reply Reply Quote 0
          • G
            guy038
            last edited by guy038 Sep 12, 2015, 3:28 PM Sep 12, 2015, 3:27 PM

            Hello, Eirik,

            If I suppose that :

            • Any subset of three digits, except for the first one, of the numbers, in your dataset, can be preceded or NOT, by a comma,

            • The space delimiter can be present or NOT, before the word patients, which can either be written in a singular mode,

            an other SEARCH regex expression could be : +\d+(,?\d{3})+ *patients?, with a space BEFORE the first + sign of that regex.


            With that regex, it would match all these following items :

            abc 1234         patients xyz
            abc 1,234        patients xyz
            abc              patients xyz
            abc 12345        patients xyz
            abc 12,345       patients xyz
            abc              patients xyz
            abc 123456       patients xyz
            abc 123,456      patients xyz
            abc              patients xyz
            abc 1234567      patients xyz
            abc 1,234567     patients xyz
            abc 1234,567     patient  xyz
            abc 1,234,567    patients xyz
            abc              patients xyz
            abc 12345678     patients xyz
            abc 12,345678    patients xyz
            abc 12345,678    patients xyz
            abc 12,345,678   patients xyz
            abc              patients xyz
            abc 123456789    patients xyz
            abc 123,456789   patient  xyz
            abc 123456,789   patient  xyz
            abc 123,456,789  patients xyz
            abc              patients xyz
            abc 1234567890   patients xyz
            abc 1,234567890  patients xyz
            abc 1,234,567890 patient  xyz
            abc 1,234567,890 patients xyz
            abc 1,234,567,890patient  xyz
            abc 1234,567890  patients xyz
            abc 1234,567,890 patients xyz
            abc 1234567,890  patients xyz
            

            But, it would ignore all natural numbers, under 1000 and some odd syntaxes as 1,2,3 or ,1a,23, like below :

            abc 0            patient  xyz
            abc 0,           patient  xyz
            abc 1            patient  xyz
            abc 1,           patient  xyz
            abc 12           patients xyz
            abc 12,          patients xyz
            abc 123          patients xyz
            abc 999          patients xyz
            abc 123,         patients xyz
            abc ,123         patients xyz
            abc 1,2,3        patients xyz
            abc 12,34        patients xyz
            abc 12,3,4       patients xyz
            abc 123,4        patients xyz
            abc ,            patients xyz
            abc ,,,,,,       patients xyz
            abc ,,4,,4,,     patients xyz
            abc              patients xyz
            abc ,1234        patients xyz
            abc ,1,234       patients xyz
            abc 123a456b789  patients xyz
            

            Best Regards,

            guy038

            1 Reply Last reply Reply Quote 0
            5 out of 5
            • First post
              5/5
              Last post
            The Community of users of the Notepad++ text editor.
            Powered by NodeBB | Contributors