Interest in the concept of “big data” is growing rapidly. But what is it exactly? Wikipedia provides a simple definition: “a blanket term for any collection of data sets so large and complex that it becomes difficult to process using on-hand data base management tools or traditional data processing applications.” A confluence of events – progressively more powerful computer hardware and software, the availability of data, and the demand for information – drive the creation of “big data” sets.
Data sets that are considered “big” today depend somewhat on the industrial sector. A big data set can contain tens of thousands of elements or even billions of elements. It is the latter type of data set that tends to make the headlines. Below are several examples of “big data sets” of mind-boggling proportions, used to model customer behavior or project climate change:
- Facebook has 50 billion photos in its user data base
- Walmart handles more than 1 million customer transactions every hour
- The NASA Center for Climate Simulation stores 32 pentabytes of climate observations
(note: 1= pentabyte = 1,000 terabytes = 1,000,000 gigabytes)
In US public health and health care, “big data” sets, while not as “big” as these examples, have existed for some time and in many forms: disease registries, vital statistics registries, longitudinal research studies, national surveys, and health insurance claims data bases. JSI’s Health Services Division (HSD) has had several projects that have involved collecting data on thousands or tens of thousands of people, creating data sets comprised of hundreds of thousands of records (and in the case of our California project, millions of records). The timeline below illustrates several examples of work in “big data” within US public health and HSD in particular.
Some key features of these HSD projects:
- The data sets were too numerous and too big to analyze using Excel, SAS/SQL software was used
- The data were used to answer complex questions, such as:
- Estimating the prevalence of PTSD symptoms in older veterans;
- Estimating the rate of mother-to-child transmission of HIV, accounting for loss-to-follow-up;
- Attributing patients to usual sources of care and then estimating the relative cost-effectiveness of community health centers as usual sources of care.
New types of public health/health data systems are being developed and will likely be larger than those commonly used today: patient registries based on electronic health/medical records, all payer claims databases and social media data sets for public health (e.g., using Twitter to gauge flu prevalence)….so stay tuned!