Community
    • Login

    Search for large numbers

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    search
    5 Posts 4 Posters 8.6k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Eirik IkdahlE
      Eirik Ikdahl
      last edited by

      I have a relatively large dataset (~1.5 million lines) that contains ~45 000 scientific abstracts. What I would like to do is to locate all the abstracts that have 1000 or more patients. So I need some kind of search strategy that can help me do this

      Search for: “digit” “digit” “digit” “space” “patients” (i.e. 1938 patients)

      AND

      Search for: “digit” “comma” “digit” “digit” “digit” “space” “patients” i.e. (1,938 patients)

      Is this feasible??

      1 Reply Last reply Reply Quote 0
      • Scott SumnerS
        Scott Sumner
        last edited by

        You left out one ‘“digit”’ in your top search-for explanation.

        In Find dialog:

        Find What:

        \d,?\d{3}\spatients
        

        Search Mode: Regular Expression

        1 Reply Last reply Reply Quote 0
        • Jim DaileyJ
          Jim Dailey
          last edited by

          You might try this search expression (be sure to check the “Regular expression” check box in the search dialog):

          [0-9,]+\w+patients
          

          This will find any string of one or more digits and commas followed by white space (1 or more spaces and/or tabs) and then the text “patients”. So, it should find strings like:

          4 patients
          1234    patients
          9,345,111 patients
          0         patients
          

          It would also find some nonsense strings like these, that I assume would not appear in the file:

          ,    patients
          ,,,, patients
          

          If the strings you are looking for are at the beginning of a line (possibly preceded by white space, you could use:

          ^\w*[0-9,]+ +patients
          

          The “^” character means the string must start at the beginning of a line. The “\w*” means there can be zero or more white space characters.

          1 Reply Last reply Reply Quote 0
          • Jim DaileyJ
            Jim Dailey
            last edited by

            Oops. Sorry. The “\w” in my previous post should be “\s”.

            The stupid web interface wouldn’t let me post this for 20 minutes because I have no reputation. :-( Also, I never saw an edit button or link after I posted, so I’m assuming that’s because I’m a nobody too… Maybe some day I’ll be all powerful here. :-)

            1 Reply Last reply Reply Quote 0
            • guy038G
              guy038
              last edited by guy038

              Hello, Eirik,

              If I suppose that :

              • Any subset of three digits, except for the first one, of the numbers, in your dataset, can be preceded or NOT, by a comma,

              • The space delimiter can be present or NOT, before the word patients, which can either be written in a singular mode,

              an other SEARCH regex expression could be : +\d+(,?\d{3})+ *patients?, with a space BEFORE the first + sign of that regex.


              With that regex, it would match all these following items :

              abc 1234         patients xyz
              abc 1,234        patients xyz
              abc              patients xyz
              abc 12345        patients xyz
              abc 12,345       patients xyz
              abc              patients xyz
              abc 123456       patients xyz
              abc 123,456      patients xyz
              abc              patients xyz
              abc 1234567      patients xyz
              abc 1,234567     patients xyz
              abc 1234,567     patient  xyz
              abc 1,234,567    patients xyz
              abc              patients xyz
              abc 12345678     patients xyz
              abc 12,345678    patients xyz
              abc 12345,678    patients xyz
              abc 12,345,678   patients xyz
              abc              patients xyz
              abc 123456789    patients xyz
              abc 123,456789   patient  xyz
              abc 123456,789   patient  xyz
              abc 123,456,789  patients xyz
              abc              patients xyz
              abc 1234567890   patients xyz
              abc 1,234567890  patients xyz
              abc 1,234,567890 patient  xyz
              abc 1,234567,890 patients xyz
              abc 1,234,567,890patient  xyz
              abc 1234,567890  patients xyz
              abc 1234,567,890 patients xyz
              abc 1234567,890  patients xyz
              

              But, it would ignore all natural numbers, under 1000 and some odd syntaxes as 1,2,3 or ,1a,23, like below :

              abc 0            patient  xyz
              abc 0,           patient  xyz
              abc 1            patient  xyz
              abc 1,           patient  xyz
              abc 12           patients xyz
              abc 12,          patients xyz
              abc 123          patients xyz
              abc 999          patients xyz
              abc 123,         patients xyz
              abc ,123         patients xyz
              abc 1,2,3        patients xyz
              abc 12,34        patients xyz
              abc 12,3,4       patients xyz
              abc 123,4        patients xyz
              abc ,            patients xyz
              abc ,,,,,,       patients xyz
              abc ,,4,,4,,     patients xyz
              abc              patients xyz
              abc ,1234        patients xyz
              abc ,1,234       patients xyz
              abc 123a456b789  patients xyz
              

              Best Regards,

              guy038

              1 Reply Last reply Reply Quote 0
              • First post
                Last post
              The Community of users of the Notepad++ text editor.
              Powered by NodeBB | Contributors