big data

Drowning in Data

New web-based framework helps scientists analyze and integrate data

By Emily Kummerfeld | Bond LSC

Large-scale data analysis on computers is not exactly what comes to mind when thinking about biological research.

But these days, the potential benefit of work done in the lab or the field depends on them. That’s because often research doesn’t focus on a single biological process, but must be viewed within the context of other processes.

Known as multi-omics, this particular field of study seeks to draw a clearer picture of dynamic biological interactions from gigantic amounts of data. But, how exactly can scientists suitably weave multiple streams of information together, especially considering technology limits and other biological variables?

Trupti Joshi and her team are seeking to find a solution to that problem.

Joshi, as part of the Interdisciplinary Plant Group faculty, works on translational bioinformatics to develop a web-based framework that can analyze large multi-omics data sets, appropriately entitled “Knowledge Base Commons” or KBCommons for short. She describes KBCommons as “a universal, comprehensive web resource for studying everything from genomics data including gene and protein expression, all the way to metabolites and phenotypes.”

Her work began about eight years ago with soybeans. Dubbed the Soybean Knowledge Base (SoyKB), her team had developed a lot of their own data analysis tools for soybean research, but they realized the same tools could help research of other organisms. From there sprouted the Knowledge Base Commons, intended for looking at plants, animals, crops or disease datasets without the need to “reinvent the wheel” each time.

SoyBean.jpg

Soybean plants used in research that utilizes Soy KB web-based network. | Emily Kummerfeld, Bond LSC

“Our main focus has been in enabling translational genomics research and applications from a biological user’s perspective, and so our development has been providing graphic visualization tools,” Joshi said.

Those tools provide an array of colorful graphics from basic bar graphs to assorted colored pie charts to help the researcher better analyze the data once data has been added to the KBCommons.

Colorful graphs and comparisons lets many researchers look past the lines of text and tables full of numbers that represent genes, plant traits or other experimental results, and making the interpretation of data much more easier and efficient.

One particular tool allows the researcher to look at the differential genes of four different comparisons or samples at the same time. Differential genes are the genes in a cell responding differently between different experimental conditions. For example, a blood cell and a skin cell both have the same DNA, however, some genes are not expressed in the blood cell that are expressed in the skin cell. With this KBCommons tool, a researcher can examine genes to see “what are the common ones, what are the unique ones to that, and at the same time look at the list of the genes and their functions directly on the website, without having to really go and pull these from different websites or be working with Excel sheets,” Joshi explained.

She envisions KBCommons as a tool to enable translational research as well. Users will be able to compare crops, such as legumes and maize for food security studies, or link research between veterinary medicine and human clinical studies for better therapies.

Intended for a wide range of users, Joshi is keenly aware of its potential users right here at MU.

One current user of the Soybean Knowledge Base (SoyKB) system is Gary Stacey, whose lab at Bond Life Sciences Center studies soybean genomics and to date has been the longest user of the SoyKB resource. Like many researchers, Stacey explained the need for a program like SoyKB that can process enormous amounts of data.

“The reason it’s called “Knowledge Base” is the idea that we’re putting information in, and what we hope to get out is knowledge. Because information is different than knowledge,” he said, “we don’t just want to collect stamps, we want to be able to actually make some sense out of it…By having a place to store the data, and then more importantly have a place to analyze it and integrate it, it allows us to ask better questions.”

This is essential, given that one soybean genome is 1.15 GB in size, and one thousand soybean genome sequences could generate 30 to 50 TB of raw sequencing data and tens of millions of genomic variations (SNPs).

But such numbers are modest compared to the program’s true capabilities.

“The KBCommons system is so powerful that it can allow you to run thousands of genomes at the same time using our XSEDE gateway allocations,” Joshi said. “This whole scalability is a unique feature of KBCommons, which a lot of databases do not provide, and we are happy we have been able to bring that to our MU Faculty collaborators on these projects, so that they can really utilize the remote high performance computing (HPC), cloud storage and new evolving techniques in the field.”

KbCommons.jpg

KB Commons is a new web-based network for biological data analysis and integration developed by students. | Emily Kummerfeld, Bond LSC

Mass data capability and colorful graphs aside, her favorite part is who exactly is designing the program.

“What I like most about KBCommons is that it serves as a training and development ground and is developed by students, undergraduate and graduate students from computer science and our MUII informatics program.”

KBCommons is still under development, but publication and access for all users is planned for the end of this year or early 2018. Users will not only be able to view public data sets, but add their own private data sets and establish collaborative groups to share data.

Dr. Trupti Joshi is an Assistant Professor and faculty in the Department of Health Management Informatics, the Director for Translational Bioinformatics with the School of Medicine, and Core Faculty of the MU Informatics Institute and Department of Computer Science and the Interdisciplinary Plant Group.

 

Seminal work

How unruly data led MU scientists to discover a new microbiome
By Roger Meissen | MU Bond Life Sciences Center

seminal vesicles 3_11_16.jpg

This seminal vesicle contains a newly-discovered microbiome in mice. Some of its bacteria, like P. acnes, could lead to higher occurrences of prostate cancer. | contributed by Cheryl Rosenfeld

It’s a strange place to call home, but seminal fluid offers the perfect environment for particular types of bacteria.

Researchers at MU’s Bond Life Sciences Center recently identified new bacteria that thrive here.

Cheryl Rosenfeld1.jpg “It’s a new microbiome that hasn’t been looked at before,” said Cheryl Rosenfeld, a Bond LSC investigator and corresponding author on the study. “Resident bacteria can help us or be harmful, but one we found called P. acnes is a very important from the standpoint of men. It can cause chronic prostatitis that results in prostate cancer. We’re speculating that the seminal vesicles could be a reservoir for this bacteria and when it spreads it can cause disease.”

Experiments published in Scientific Reportsa journal published by Nature — indicate these bacteria may start disease leading to prostate cancer in mice and could pass from father to offspring.


A place to call home

From the gut to the skin and everywhere in between, bacterial colonies can both help and hurt the animals or humans they live in.

Seminal fluid offers an attractive microbiome — a niche environment where specific bacteria flourish and impact their hosts. Not only is this component of semen chockfull of sugars that bacteria eat, it offers a warm, protected atmosphere.

“Imagine a pond where bacteria live — it’s wet it’s warm and there’s food there — that’s what this is, except it’s inside your body,” said Rosenfeld. “Depending on where they live, these bacteria can influence our cells, produce hormones that replicate our own hormones, but can also consume our sugars and metabolize them or even cause disease.”

Rosenfeld’s team wasn’t trying to find the perfect vacation spot for a family of bacteria. They initially wanted to know what bacteria in seminal fluid might mean for offspring of the mice they studied.

“We were looking at the epigenetic effects — the impact the father has on the offspring’s disease risk — but what we saw in the data led us to focus more on the effects this bacterium, P. acnes, has on the male itself,” Rosenfeld said. “We were thinking more about effect on offspring and female reproduction — we weren’t even considering the effect the bacteria that live in this fluid could have on the male — but this could be one of the more fascinating findings.”

But, how do you figure out what might live in this unique ecosystem and whether it’s harmful?

First, her team found a way to extract seminal fluid without contamination from potential bacteria in the urinary tract.

“We gowned up just like for surgery and we had to extract the fluid directly from the seminal vesicles to avoid contamination,” said Angela Javurek, primary author on the study and recent MU graduate. “You only have a certain amount of time to collect the fluid because it hardens like glue.”

Once they obtained these samples, they turned to a DNA approach, sequencing it using MU’s DNA Core.

They compared it to bacteria in fecal samples of the same mice to see if bacteria in seminal fluid were unique. They also compared samples from normal mice and ones where estrogen receptor genes were removed.


The difference in the data

It sounds daunting to sort and compare millions of DNA sequences, right? But, the right approach can make all the difference.

“A lot of it looks pretty boring, but bioinformatics allow us to decipher large amounts of data that can otherwise be almost incomprehensible,” said Scott Givan, the associate director of the Informatics Research Core Facility (IRCF) that specializes in complicated analysis of data. “Here we compared seminal fluid bacterial DNA samples to publicly available databases that come from other large experiments and found a few sequences that no one else has discovered or at least characterized, so we’re in completely new territory.”

The seminal microbiome continued to stand out when compared to mouse poop, revealing 593 unique bacteria.

One of the most important was P. acnes, a bacteria known to cause chronic prostatitis that can lead to prostate cancer in man and mouse. It was abundant in the seminal fluid, and even more so when estrogen receptor genes were present.

“We’re essentially doing a lot of counting, especially across treatments to see if particular bacteria species are more common than others,” said Bill Spollen, a lead bioinformatics analyst at the IRCF. “The premise is that the more abundant a species is, the more often we’ll see its DNA sequence and we can start making some inferences to how it could be influencing its environment.”

Although this discovery excites Rosenfeld, much is unknown about how this new microbiome might affect males and their offspring.

“We do have this bacteria that can affect the male mouse’s health, that of his partner and his offspring,” Rosenfeld said. “But we’ve been studying microbiology for a long time and we still find bacteria within our own bodies that nobody has seen before. That blows my mind.”

The study, “Discovery of a Novel Seminal Fluid Microbiome and Influence of Estrogen Receptor Alpha Genetic Status,” recently was published in Scientific Reports, a journal published by Nature.

 

SoyKB: Leading the convergence of wet and dry science in the era of Big Data

Yaya Cui, an investigator in plant sciences at the Bond Life Sciences Center examines data on fast neuron soybean mutants that are represented on the SoyKB database.

Yaya Cui, an investigator in plant sciences at the Bond Life Sciences Center examines data on fast neuron soybean mutants that are represented on the SoyKB database.

The most puzzling scientific mysteries may be solved at the same machine you’re likely reading this sentence.

In the era of “Big Data” many significant scientific discoveries — the development of new drugs to fight diseases, strategies of agricultural breeding to solve world-hunger problems and figuring out why the world exists — are being made without ever stepping foot in a lab.

Developed by researchers at the Bond Life Sciences Center, SoyKB.org allows international researchers, scientists and farmers to chart the unknown territory of soybean genomics together — sometimes continents away from one another — through that data.

 

Digital solutions to real-world questions

As part of the Obama Administration’s $200 million “Big Data” Initiative, SoyKB (Soy Knowledge Base) was born.

The digital infrastructure changes the way researchers conduct their experiments dramatically, according to plant scientists like Gary Stacey, Bond LSC researcher, endowed professor of soybean biotechnology and professor of plant sciences and biochemistry.

“It’s very powerful,” Stacey said. “Humans can only look at so many lines in an excel spreadsheet — then it just kind of blurs. So we need these kinds of tools to be able to deal with this high-throughput data.”

The website, managed by Trupti Joshi, an assistant research professor in computer science at MU’s College of Engineering, enables researchers to develop important scientific questions and theories.

“There are people that during their entire career, don’t do any bench work or wet science, they just look at the data,” Stacey said.

The Gene Pathway Viewer available on SoyKB, shows different signaling pathways and points to the function of specific genes so that researchers can develop improvements for badly performing soybean lines.

“It’s much easier to grasp this whole data and narrow it down to basically what you want to focus on,” Joshi said.

A 3D-protein modeling tool lends itself especially to drug design. A pharmaceutical company could test the hypothesis and in some situations, the proposed drug turns out to yield the expected results — formulated solely by data analysis.

The Big Data initiative drives a blending of “wet science” — conducting experiments in the lab and gathering original data — and “dry science” — using computational methods.

Testament of the times?

“Oh, absolutely,” Joshi said.

 

Collaboration between the “wet” and “dry” sciences

Before SoyKB, data from numerous experiments would be gathered and disregarded, with only the desired results analyzed. The website makes it easy to dump all of the data gathered to then be repurposed by other researchers.

“With these kinds of databases now, all the data is put there so something that’s not valuable to me may be valuable to somebody else,” Stacey said,

Joshi said infrastructure like SoyKB is becoming more necessary in all realms of scientific discovery.

“(SoyKB) has turned out to be a very good public resource for the soybean community to cross reference that and check the details of their findings,” she said.

Computer science prevents researchers having to reinvent the wheel with their own digital platforms. SoyKB has a translational infrastructure with computational methods and tools that can be used for many disciplines like health sciences, animal sciences, physics and genetic research.

“I think there’s more and more need for these types of collaborations,” Joshi said. “It can be really difficult for biologists to handle the large scope of data by themselves and you really don’t want to spend time just dealing with files — You want to focus more on the biology, so these types of collaborations work really well.

It’s a win-win situation for everyone,” she said.

The success of SoyKB was perhaps catalyzed by Joshi. She adopted the website and the compilation of data in its infant stages as her PhD dissertation.

Joshi is unique because she has both a biology degree and a computer science background. Stacey said Joshi, who has “had a foot in each camp,” serves as an irreplaceable translator.

Most recently, the progress of SoyKB as part of the Big Data Initiative was presented at the International Conference on Bioinformatics and Biomedicine Dec. 2013 in Shanghai. The ongoing project is funded by NSF grants.