Using Big Data to Share Scientific Knowledge
Jan. 26, 2017 - Big data. The term has been a buzzword in the media and data management circles for years now, but what does it mean and how does it relate to modern science? In general, big data is defined as extremely large data sets that cannot be easily analyzed using traditional database methods. In today’s data-driven economy, business and media companies have embraced big data as a way to analyze how to better serve their customers. Scientists look at big data from a different perspective. New tools and techniques have improved how we manage and share datasets, and also how we store, process and analyze scientific data. Having to manage and analyze large amounts of data is not new to science: Collecting and analyzing information is the foundation of scientific inquiry. What has changed is the sheer volume of digitized data available to scientists, distributed storage environments (i.e., the Cloud), and the challenge of how to integrate and broadcast those data. In the past, scientists often distributed data by presenting at conferences or publishing in peer-reviewed scientific journals. That meant good science was collected in binders and placed on bookshelves in a physical location. In addition, scientists were not always so forthcoming in sharing data because of the real fear of getting scooped, but the culture is changing — and scientists are seeing benefits of sharing data earlier to both the science community and the public. These are a few of the challenges encountered in trying to address the unprecedented magnitude and complexity of data collected and available for environmental spill response and restoration.
Integrating environmental data
The real world experience with legacy data management systems and building new data management systems to work with those existing programs, has informed our entire approach to managing environmental data, and is a key part of our approach to current and future data management. For years, NOAA and ocean advocates have been talking about a concept known as “ecosystem-based management” for marine species and habitats. Put simply, ecosystem-based management is a way to find out what happens to the larger tapestry design and function when one thread is pulled from the cloth. We were able to leverage “big data” techniques and develop a data warehouse and information portal built with open source tools for ingesting, integrating and organizing information. This tool, called the Data Integration Visualization, Exploration and Reporting (DIVER) application, allows scientific teams from different organizations to upload their field data and other key information related to their studies, such as scanned field notes, electronic data sheets, scanned images, photographs, and to filter and download results. For instance, the large quantity and multitude of sources for the data collected from the Deepwater Horizon (DWH) spill results in datasets of different types and structures. DIVER addresses this challenge by integrating standardized data and allowing users to query across multiple datasets simultaneously. To facilitate this process, the DIVER team developed common data models, which provides a consistent and standardized structure for managing and exchanging information. DIVER was developed to support data generated in the DWH oil spill response and assessment efforts. DIVER data models and a data warehouse approach have expanded to serve the entire coastal and Great Lakes of the United States. The common data model concept is based upon creating data schemas, which serve as blueprints to organize and standardize information.
Data integration systems like DIVER put all of that information in one place at one time, allowing users to look for causes and effects that they might not have ever known were there and then use that information to better manage species recovery. These data give us a new kind of power for protecting marine species. Systems like DIVER are set up to take advantage of quantum leaps in computing power and tools that were not available to the field of environmental conservation 10 years ago. These advances give DIVER the ability to accept reams of diverse and seemingly unrelated pieces of information and, over time, turn them into insight about the nature and location of the greatest threats to marine wildlife. Ultimately, all the advancement in data sharing benefits not only the science and academic communities but also the public. Ben Shorr is a physical scientist with Assessment and Restoration Division’s Spatial Data Branch.