GenomeQuest on Cloud Mine for Next-Generation Sequencing Data



Loading...

By Kevin Davies

July 29, 2009 | GenomeQuest today announced the launch of GenomeQuest 6.0Beta, a new sequence data management solution that provides a web-accessible, cloud computing environment for researchers to “align and mine” next-generation sequencing data.  

“There’s a lot of interest in the cloud,” says president/CEO Ron Ranauro. “In a sense, GenomeQuest has built the first commercial application-specific cloud for biocomputing.”

Users can access the cloud from any internet-connected client server. “Sitting behind all this is a 500-CPU compute farm for processing that’s purpose-built for processing volumes of sequence data.”

“We’ve always had a platform technology, but when I got to the company in 2002, the market wasn’t really ready for another platform,” Ranauro told Bio-IT World.  “The Human Genome Project had crested by then, a lot of pharma, biotech and major academic labs had defined their platform over the preceding five years. What we did, which was a good strategy, was focus our application on … mining genetic sequence data for IP.”

The strategy netted more than 100 customers, including 16/20 big pharmas, and several big agricultural science customers who recognized GenomeQuest as a powerful search engine. “The strategy has always been to create an enterprise sequence data management platform. The question we’d been facing is when would the market be ready?” The launch of the next-gen machines from 454, Illumina and Applied Biosystems in 2006-07 marked what Ranauro calls the “catalytic event for causing the enterprise and academic markets to rethink the way they’re managing sequence data.”

Easy Button

GenomeQuest 6.0Beta is the culmination of bringing GenomeQuest’s core platform technology into a broad platform capable of managing sequence data from raw FASTA to high-level pathway information. Ranauro explains: “Sequence data is (sic) not structured data, so it doesn’t lend itself to data management strategies that are organized to handle structured data very well. From the beginning, we took a distributed computing dataflow model for managing the unstructured sequence data. That gives us the scalability.”

Using the GenomeQuest Engine to provide scalability, GenomeQuest 6.0 addresses the needs of three key constituencies -- the researcher, the bioinformatician, and the IT manager.

Researchers “don’t care so much about bioinformatics,” says Ranauro. “The early visionary market for next-gen sequencing wants to do everything, but the mainstream market wants “the Easy button.” They also want some flexibility to tune parameters. They’re not interested in managing data but want common workflows.”

GenomeQuest delivers the two largest production workflows for gene expression and variant (SNP) discovery. “Any researcher can self register and upload a file, or use the sample file and start getting results very quickly.”

Bioinformaticians, on the other hand, “have to be able to access the data models and the algorithms through an open API. We’ve put a tremendous investment in exposing the application programming interface at multiple levels. Since it’s a web application, there’s a URL API used to script and access any data or workflow or database in the system. There’s a scripted command line API which most bioinformatics developers will prefer, which also has this very nice property of providing access to data, workflows, results and analytics while hiding the details of the computing and the reference data itself. A bioinformatics [specialist] can use the command line API to focus on the task at hand, and not the specifics of the IT.”

And from the perspective of the IT manager, scalability is critical. “The volumes of these next-gen machines just continues to escalate,” says Ranauro. “A system that won’t scale is going to be a difficult investment to justify.”

Web Gem

Ranauro half-jokingly says GenomeQuest is becoming a web company. Normally offering researchers a demo requires multiple steps involving a salesman, a web demo, and registering for an account. “Now the researchers can come to the site, self register, use a sample data set or upload their own, run workflows and mine the results.” The available sample data includes donations from Illumina, Life Technologies and 454, including metagenomic pathogen data (454), and variant detection workflows and gene expression data (Illumina, Life Technologies).

GenomeQuest 6.0 fits into the next-gen workflow from the generation of the raw data. Ranauro describes the pipeline: “We would pick it up from the raw FASTA files, post image processing – it’s the read and an ID... That file can be uploaded. A multigigabyte file can take half a day. If it’s an even bigger file, they can sneakernet it to us.” (GenomeQuest is currently using “fairly rudimentary” compression, but Ranauro acknowledges “there are better ways of doing it,” and is open to leveraging data-transfer services from companies such as Aspera.)

“The end user is presented with a simple web application where (s)he can select the reference genome… They can also select how much extended annotation they want. Do they want to know if the variants found are novel related to dbSNP? Are the variants falling inside coding regions..? The result file is a sequence database of the assembly which can be mined according to those properties. You might say, give me only the novel SNPs in coding regions of very high quality.

“Being able to mine and filter the results is the secret sauce of the scalable engine. Now the biologists can do this work without needing to be a programmer, through a very simple web application. That’s the contribution we’re making – allowing a broader, mainstream audience to participate fully in next-generation sequencing.”

Biologists can select and create custom views of the appropriate reference sequence or subsets thereof. “It’s providing data management, but data isn’t really moving around or up and down from the server to the PC. All the manipulation is happening in the cloud but the user is able to manipulate [it].”

The web architecture enables everything to be shared, including workflow, result databases, and selected views on results. “Those can be used as hypothesis drivers for the next set of experiments,” he says.

Upside and Roadmaps

While Ranauro has his sights set on mainstream users, he sees upside elsewhere. “In the fullness of time, a genome center is going to want to get onto the cloud, because they have to lower their costs, just the same as anybody else, to get to the $1000 genome. It might be that GenomeQuest‘s platform provides a smoother path onto the cloud than taking all the in-house infrastructure and trying to recreate it on Amazon… We see ourselves providing the on-ramp to the cloud.”

While the GenomeQuest platform currently runs on a homegrown datacenter cloud, Ranauro says, “we’re actively looking at scaling options that might include Amazon. Hosting this on Amazon is a very real possibility, but it’s not currently on our roadmap.”

De novo assembly functionality is on the roadmap, however, for the second half of 2009. “We’ll provide the computational and alignment engine but we’ll rely on the industry for the assembly. There are important assemblers, such as 454 Newbler, today. For short reads, later this year – there we’ll rely probably on Velvet or Abyss.”

Ranauro also sees a rich environment for next-gen software companies such as CLC bio and DNAStar to add value. “Those tools have a very rich feature set. There’s always going to be a researcher that can benefit. The problem we’re solving is, having that data on the PC is having it siloed again and the industry goes back to where it was ten years ago, with silos of data.”

Ranauro says he’s actively looking for feedback from early users. That will go a long way to determining how long the ‘beta’ designation lasts, but he says early users “are loving it.” He continues: “We’re actually giving a very powerful sequence data management capability away for free. You don’t have to do next-gen sequencing to get high value from this web site!”

 “This is the only product that can process the data and then mine it using an easy-to-use web-based platform,” says Ranauro. “There’s a reason why the IT industry went from client-server to web-based. It provides centralized management, local control, more of a tractable knowledge engineering environment for an enterprise. We don’t see our customers wanting to move data up and down between PCs and servers or across networks. They really want to have it stored centrally but be able to manipulate it easily. We’re really the only company offering that.

“The ability to align the data and mine the alignments using sequence analysis and annotation simultaneously in a scalable way -- no-one has that!”

Click here to login and leave a comment.  

0 Comments

Add Comment

Text Only 2000 character limit

Page 1 of 1

White Papers & Special Reports

sapiosciences
The Workflow Driven Lab
Sponsored by Sapio Sciences

Many companies have recognized that their internal business units operate as a set of business processes. These business processes are also called workflows. Modern Laboratories are highly suitable to this workflow driven approach. In fact, the lab environments successful operation is predicated on the successful definition and adherence to workflows. It could be said that a modern  laboratory is an advanced process implementing construct. It is important that laboratory management software mirrors the process driven nature of the lab thereby increasing automation, shortening learning curves, improving data quality and increasing lab throughput.

  • The modern laboratory is an advanced workflow implementing construct
  • Laboratory Management Software solutions should fully embrace and mirror this process driven approach
  • Effective information management of workflow processes with a LIMS results in increased automation, reduced training curves, better data quality and increased lab throughput


panasas
Curing Life Sciences Data Management Challenges with Scalable Storage
Sponsored by Panasas

High performance storage systems are a given to meet today’s life sciences R&D computational challenges. But with the explosive growth in data produced by next-gen lab equipment, scalability and long-term data management issues must also be addressed. Read this paper to learn:

  • Why new lab equipment will impact R&D workflows
  • How to avoid the hidden costs of long-term data management
  • What approach you should take to accommodate today’s data while having the flexibility to scale to meet future demands.


Quantum
StorNext 4.0: Technical Product Brief
Sponsored by Quantum

 
Proven in the world’s most data intensive industries, Quantum StorNext is a scalable, high-performance file system which allows data sharing across Linux, Mac, Unix, and Windows operating systems and manages data in enterprise storage environments. In this Technical Brief you'll learn:

  • How a high-performing file system can accelerate your business
  • How to simplify your data management
  • How a tiered storage approach can save you money


Life Science Webcasts & Podcasts

Predict or Perish! Shaping the Practices of Clinical Trials
Decisionview webinarSponsored by:  DecisionView

Predictive Analytics are a key differentiator in running your clinical trials successfully through 2010 and beyond. They will help you to optimize your patient enrollment, reduce your clinical operations costs and minimize your financial liability in the clinical supply chain. In this session, you will:
• Learn what predictive analytics are and what they are not
• Understand why you need predictive analytics to run your clinical trials, and
• Explore how predictive analytics will shape the future of clinical trials

Download Now. 

 



More Podcasts

Job Openings

The University of Washington Department of Genome Sciences is seeking a LINUX SYSTEMS ENGINEERING MANAGER to lead a team in a diverse scientific computing environment that includes multiple HPC systems, petascale storage, and custom application servers. Apply online at UW Hires for req number 61505.  http://www.washington.edu/admin/hr/jobs/

Loading...

For reprints and/or copyright permission, please contact The YGS Group, 3650 West Market Street, York, PA;

(717) 505-9701 ext. 125, or via email to Ashley.Zander@theYGSgroup.com.