- Martin Farach-Colton, Professor of Computer Science at Rutgers University, says:
The human genome isn’t that
big. At three billion basepairs -- each
of which takes two bits to represent, even uncompressed -- the entire genome
can be represented in about 770MB, and even smaller when compressed. So why is genomic data considered to be Big
Data?
Modern genomic analysis does
not depend on the DNA sequence of an individual. Rather, many lines of research involve the comparative
analysis of the DNA and other features of a population of individuals. For example, the Philip Awadalla lab at the University of Montreal focuses on Medical and
Population Genomics. They consolidate
genetic markers, such as so-called single-nucleotide polymorphisms (SNPs), with
other data, such as gene-expression profiles.
They then correlate such data with the expression patterns of
diseases. They thus seek to address
questions relevant to how genetics and the environment influence the frequency
and severity of diseases in human populations.
The key word is
populations. One genome may be small,
but get data on enough people and it adds up.
It’s pretty easy to bring the database infrastructure for the lab to its
knees. Since researchers rely heavily on querying the data, a slow database can
really get in the way of making research progress.
For example, researcher
Thibault de Malliard, who oversees the lab’s data, points out that he adds
hundreds of thousands of records to the lab’s MySQL database. But, as the
database grew to 200 GB, its performance plummeted. And the lab had hopes of
getting more than 1TB of data!
Within the database, the
bottleneck turned out to be the MyISAM storage engine. De Maillard tried out Tokutek’s TokuDB
database storage engine, which he had heard offered better performance on large
data. He set up two autojoined views of
MYSQL, one with MyISAM and the other running TokuDB, then tested each with a
200 GB table containing two billion records, representing around 1500 samples
with 1.3M positions, the lab’s current SNP set for CARTaGENE RNAseq. This was all performed on a Centos 5 server running
with 48GB of RAM on 6 CPU Intel® Xeon® 2.27Ghz processors.
TokuDB won out over MyISAM for the
following reasons:
- Faster Inserts -- Adding 1M records took 51 min for MyISAM, but 1 min for TokuDB. So inserting one sequencing batch with 48 samples and 1.5M positions would take 2.5 days for MyISAM but one hour with TokuDB. This number will increase as the tables grow with MyISAM, but it remains the same with TokuDB.
- Flexibility -- Any change made to the database structure will lock the table being modified by MyISAM until completion. TokuDB allows the lab to use the database while altering it.
- Compression -- TokuDB compresses the data (default set to normal). Less data goes through the network, and less data is written to the storage.
“Data management is very important for the genomic research
lab. The researchers make a lot of queries, and they want their data at their
fingertips. To find the rare record or line, which has not been seen in another
DB in the world, can mean the discovery of a new mutation or a gene marker that
causes a disease,” noted de Malliard. “With epidemiology data, we are searching
to find some state for people who have an issue by comparing a subset of people
vs. all the other people. TokuDB uniquely enables us to advance this research.”
Prof. Farach-Colton is
an expert in algorithmics and information retrieval. He was an early employee
at Google, where he worked on improving performance of the Web Crawl and where
he developed the prototype for the AdSense system. Prof. Farach-Colton received
his MD from The Johns Hopkins School of Medicine and his PhD in Computer
Science from the University of Maryland. Prof. Farach-Colton is a
Professor of Computer Science at Rutgers University.
No comments:
Post a Comment