Big data: $5M to widen ‘bottleneck to discovery’

September 24, 2015
Contact:

An animated graphic illustrating the bottleneck effect.ANN ARBOR—Buried in troves of data that scientists have gathered, but not yet analyzed, could be key insights to improving cancer treatment, understanding Alzheimer’s, predicting climate change effects and developing cheaper, clean energy technologies.

Those are just a few of the countless examples of fields where our capacity to gather scientific data now far exceeds our capacity to crunch it—especially when collaborations span the globe. Some research projects are producing the equivalent of 1,000 consumer hard drives a month, for example.

“So many different areas of science can now produce these fire hoses of data, but we haven’t kept pace with the infrastructure to make analyzing it trivial or even transparent,” said Shawn McKee, a research scientist in physics at LSA, who leads the project.

A $5 million data storage and networking project led by U-M aims to change that—to widen what McKee describes as a bottleneck to scientific discovery.

The new Multi-Institutional Open Storage Research InfraStructure, shortened to MI-OSiRIS, is a regional pilot funded by the National Science Foundation. If it’s successful, scientists say it could dramatically speed up discovery and revolutionize the research cloud. In the future, it might also boost the usefulness of the burgeoning Internet of Things.

Through the new project, U-M, Michigan State University, Wayne State University and Indiana University will install advanced data storage software and hardware and open new frequencies on the high-speed research computing network that many of them already share.

Why the focus on storage? It’s not just that the pipes transporting data aren’t wide enough. The way data is arranged and stored can make a big difference in how quickly it can be categorized and searched.

The project will test the effectiveness of so-called software-defined storage coupled with advanced networking. Software-defined storage is a new approach to handling large amounts of information. It allows relatively inexpensive, off-the-shelf hard drives to be programmed with intelligent software that can automatically manage data in ways that make it easier to copy, analyze, change, search and share.

MI-OSiRIS will also incorporate what’s called software-defined networking and other tools developed primarily at IU to continually find optimal network paths between scientists and the data storage locations.

“Like getting directions from your favorite mapping software, the best route depends on distance as well as current traffic,” said Martin Swany, IU professor of informatics. “If the most direct route is “red,” you may want to take an alternate path.”

If it’s successful, MI-OSiRIS could serve as a template for other research hubs.

“What we’re trying to do here is expedite the time to discovery,” McKee said. “Scientists should be able to focus on their science without having to become experts in data management.”

To get a sense of the scale of data, consider the massive simulations of the HYbrid Coordinate Ocean Model performed by the U.S. Navy to predict conditions for its fleet. U-M’s Brian Arbic, an associate professor of physical oceanography, is involved in running it and he regularly gets requests from researchers around the world to access it.

The model will inform a NASA satellite mission to map ocean motion at high resolution for the study of marine ecosystems and the impact of the ocean on climate. HYCOM forecasts the water’s temperature, speed in two directions, salt concentration and pressure—five variables—every hour at 2.4 billion points around the globe. It produces about 600 terabytes (600 trillion bytes) of output per simulated year.

McKee is one of the dozens of researchers at the institutions, several of which are in an alliance called the University Research Corridor, who have agreed to test the system. They’ll use it to work on projects in ocean modeling, biostatistics, cancer, degenerative diseases and aquatic biology.

“MI-OSiRIS is exciting as it will allow us to work with partner institutions to address the challenges of distributed big data that our research communities face and build a replicable model based on our experience,” said Andrew Keen, high-performance computing architect at MSU’s Institute of Cyber-Enabled Research.

For instance, Dr. Hiroko Dodge, professor of neurology at the U-M Medical School, and her colleagues at Wayne State will employ it in research studying early signs of Alzheimer’s. Sensors in the homes of seniors gather 24/7 information about their walking speed, sleep patterns, computer and phone usage. The project combines that with the seniors’ cognitive test scores, MRI results, genetic tests and more. Processing all of that into a form that can be analyzed can take a month. Then it must be analyzed.

Roger Pique-Regi, assistant professor of molecular medicine and genetics at Wayne State, will utilize MI-OSiRIS as he develops new computational methods that could provide insights into how human populations adapted to different environments during evolution. Some months, the projects he’s involved in generate a terabyte of data. The findings will illuminate the genetic architecture of complex traits such as cardiovascular disease.

“Direct access to data between our sister institutions will eliminate hours and even days lost copying massive files from one place to another,” said Patrick Gossman, deputy chief information officer for research at Wayne State. “The end result will be improved research productivity in health, aging, the environment and other areas important to us all.”

The Grand Rapids-based Van Andel Research Institute, which conducts biomedical research, also will be involved. It will house storage and network performance monitoring nodes that will allow partners to analyze data stored at each institution without having to move it.

“VARI’s Bioinformatics and Biostatistics Core receives data produced
not only in the institute’s labs but also from MSU, U-M, WSU and other institutions across the country,” said Dr. Mary Winn, manager of VARI’s Bioinformatics and Biostatistics Core. “Improved connectivity will allow bioinformaticians and biostatisticians to analyze and deliver results more efficiently and effectively, ultimately allowing researchers to develop and test more hypotheses at the bench.

“The impacts on human disease brought about by enhanced data-sharing and improved collaborative efforts could be transformative.”

The new project begins on the heels of U-M’s recent announcement that it will invest $100 million in a Data Science Initiative over the next five years. Through the Michigan Institute for Data Science, the university will hire up to 35 new faculty members, support interdisciplinary research, provide new educational opportunities for students and expand U-M’s research computing capacity.

U-M also recently received a separate $2.4 million National Science Foundation grant to help establish a unique facility for refining complex, physics-based computer models with big data techniques, closing a gap in the U.S. research computing infrastructure. U-M’s Advanced Research Computing-Technology Services is building the computing infrastructure for MI-OSiRIS, the Data Science Initiative and the complex physics project.

 

More information: