Cenic.org

Big Data: How the PRP Enables Scientists to Unlock Genetic Secrets

Categories Cultural & Scientific Healthcare RENS & NRENS Pacific Wave

Tags national science foundation nautilus pacific research platform pacific wave research

With innovative networking tools available on the Pacific Research Platform, Professor Alex Feltus at Clemson University in South Carolina analyzes massive genomics datasets to better understand how genes interact to cause disease in humans. His dream is for everyone to have access to the PRP, which would drastically accelerate scientific discovery.

Prof. Alex Feltus

“I have always wanted to help cure cancer,” Feltus said. “But to do a lot of that research, we have to develop bioinformatics software and advanced cyberinfrastructure. My dream is for anyone to be able to have access to the data, computational platforms, and workflows to do high-performance science. I’d like to have the bored kid in the back of a high school classroom analyzing data sets from The Cancer Genome Atlas (TCGA) on PRP rather than looking at Facebook.”

The PRP is a partnership of more than 50 institutions, led by researchers at the University of California at San Diego and Berkeley. PRP has support from the National Science Foundation, and includes the US Department of Energy’s Energy Science Network (ESnet), as well as scores of research universities in the nation and around the world. The PRP builds on the optical backbone of Pacific Wave, a joint project of CENIC and the Pacific Northwest GigaPOP (PNWGP), to create a seamless research platform that enables collaboration in a broad range of data-intensive fields and projects.

Interest in the PRP has sparked a call to scale the project up to a National Research Platform and even a Global Research Platform. As part of a new monthly webinar series that highlights PRP success stories, Feltus demonstrated in October how to run genomics workflows on the PRP’s Nautilus cluster. About 60 researchers and computer scientists attended the webinar to learn how PRP can support more ambitious research and realize new innovations.

“We want researchers to imagine doing more — expanding their research goals and imagining greater possibilities. The PRP is a platform to help them achieve more,” said Larry Smarr, PRP principal investigator and director of the California Institute for Telecommunications & Information Technology (Calit2) at UC San Diego.

“Nautilus has become a sort of potluck supercomputer for machine learning — researchers are bringing their computational and data resources to the Nautilus party to build a shared platform they own and control,” said Tom DeFanti, PRP co-principal investigator. “The 100-gigabit research networks like CENIC that we depend on remove the bottlenecks to long-distance data sharing, so this group-owned compute resource is now not only highly functional, but also inexpensively scalable, adaptable to new types of computing, and very clonable. It provides an on-ramp for researchers like Alex Feltus to learn how to efficiently access the much larger resources of NSF supercomputer centers and commercial clouds when they really get up to speed.”


Staff at Alex Feltus’s lab point to Clemson University’s "East of the Mississippi" Pacific Research Platform node. From left to right, master’s degree student Reed Bender, PhD candidate Ben Shealy, and cloud architect Cole McKnight. Source: Alex Feltus 2019.

Running Workflows at the Petascale

A trailblazer in the field of bioinformatics, which combines biology and computer science, Feltus uses the Pacific Research Platform’s Nautilus cluster to quickly and efficiently run deep-learning oncogenomics workflows on DNA datasets that amount to a petabyte. To comprehend the size of this mountain of data, consider that one petabyte of storage could hold 11,000 movies and take two-and-a-half years of nonstop viewing to watch.

“My lab is trying to move away from a reductionist approach to science and work toward a holistic approach as things get more complex,” Feltus said. “I want to see the real complexity of the system. We get frustrated doing this kind of work, but at the same time we get closer to reality.”

The Feltus lab is focused on fighting disease and improving agricultural production. The lab’s active projects include: discovery of genetic subsystems in legume-microbe symbiosis that can be engineered into other plants so they can make their own fertilizer; elucidation of gene expression patterns in brain tissue for better diagnosis of intellectual disability; and detection of tumor-specific gene alterations in kidneys and other tumors of relevance to precision medicine.

Demonstrating the Power of Nautilus

Clemson added a node this year to the PRP’s Nautilus cluster, a cost-effective hypercluster for running containerized big-data applications. Nautilus is equipped with a cloud of compute and storage nodes, hundreds of graphics processing units (GPUs), and thousands of kilometers of high-bandwidth fiber paths between institutions and regional/national networks around the world. Nautilus also employs Google’s Kubernetes container-orchestration system for automating deployment, scaling, and management of applications.

The Feltus lab’s interdisciplinary team runs workflows that pull data from deep public repositories such as TCGA and the National Center for Biotechnology Information (NCBI), move data across advanced research and education networks such as Pacific Wave and Internet2, and test and store machine learning workflows on various classic- and cloud-based systems, including the PRP Nautilus cluster. The lab uses open-source genomics workflows including GEMmaker, KINC, and Gene Oracle. “Since the Nautilus cluster came online, we’ve been using it to do a lot of work,” said Feltus. “I’m very interested in scaling it out because that helps us expand into the cloud.”

During the webinar, Feltus gave a live demonstration of the PRP by running a realistic workflow to find gene co-expression network relationships in yeast. Feltus used his Nautilus Kubernetes namespace, DeepGTEx-PRP, where his lab has designated storage and can add collaborators. In less than 20 minutes, Nautilus provided results.

“We have used Nautilus as a research platform to screen through thousands of gene combinations associated with normal human tissue development and shifts to a tumor state, for example. We also use Nautilus to develop open-source scientific workflows on a democratized Kubernetes platform before we shift to commercial cloud platforms for scale-up. This dramatically reduces our development costs. Further, Nautilus is an easy way to teach students how to run scientific workflows in the cloud which is an important skill for workforce development,” Feltus explained. “We love Nautilus so much that we leveraged funds from our NSF-funded Scientific Data Analysis at Scale (SciDAS) project to pay for the node at Clemson.”

The Future of Big Data Collaboration

Scaling PRP up to a national and global scale is critical to addressing end-to-end data sharing as researchers require larger, faster systems for growing datasets. “This is not just a genomics problem,” Feltus said. “Every field has a scale-up issue.”

PRP is the future of big data collaboration. The platform supports a broad range of data-intensive research projects that will impact science and technology worldwide, including projects on galaxy formation and evolution, telescope surveys, particle physics data analysis, simulations for earthquakes and natural disasters, climate modeling, and virtual reality and ultra high-resolution video development.

Watch the monthly webinar series to discover more success stories like the Feltus lab.

Watch the presentation, “The National Research Platform: An Update on Progress Towards Scaling," from CENIC’s 2019 Conference.

Other PRP Success Stories

Related blog posts

From the Ground to the Stars: Critical Big-Data Research in Africa

Get Your Cybersecurity Program Up and Running with the Trusted CI Framework