Manihot esculenta (cassava)



About the genome:


Overview

Cassava (Manihot esculenta) is grown throughout tropical Africa, Asia and the Americas for its starchy storage roots, and feeds an estimated 750 million people each day. Farmers choose it for its high productivity and its ability to withstand a variety of environmental conditions (including significant water stress) in which other crops fail. However, it has very low protein content, and is susceptible to a range of biotic stresses. Despite these problems, the crop production potential for cassava is enormous, and its capacity to grow in a variety of environmental conditions makes it the plant of the future for emerging tropical nations. Cassava is also an excellent energy source - its roots contain 20-40% starch that costs 15-30% less to produce per hectare than starch from corn, making it an attractive and strategic source of renewable energy.

The goals of the Cassava Genome Project are to generate a draft sequence of the cassava genome, and because of the humanitarian importance of the crop, to make that sequence available to all - freely and rapidly. Much of the utility of the genome sequence will come from the development of breeding tools, and as such a perfect reference genome sequence is not needed. Our sequencing strategies have been selected accordingly. The project has built upon a pilot initiated through the DOE-JGI Community Sequencing Program (CSP) by a 14-member consortium led by Claude Fauquet, Joe Tohme and Pablo Rabinowicz. This pilot project produced a little under 1x coverage from over 700,000 Sanger shotgun reads using plasmid and fosmid libraries, and it provided insights into the overall characteristics of the cassava genome, and a valuable source of Sanger paired-end sequences to be used later.

The main phase of the project, led by Steve Rounsley, Dan Rokhsar, Chinnappa Kodira, and Tim Harkins began in Spring 2009 when 454 Life Sciences, a Roche company partnered with DOE-JGI to provide the resources for a whole genome shotgun sequencing of cassava using the 454 GS FLX Titanium platform. Nearly 61 million 454 reads (single and paired-end) were generated and combined with the Sanger data from the pilot project as input for genome assembly. The resulting assembly and its annotation is available through Phytozome and has also been deposited in GenBank in accordance with our commitment to early access and the Fort Lauderdale genome data release policy.

The University of Arizona has recently been awarded a 3-year $1.3 million grant by the Bill & Melinda Gates Foundation to expand and improve upon the initial cassava genome sequence. With its partners, DOE-JGI, 454 Life Sciences and University of Maryland, Baltimore, the newly funded project seeks targeted improvement of the genome assembly and SNP discovery via resequencing of many varieties of cassava. The SNP resource will also be accessible through Phytozome.

Statistics


Genome Size
This version (Cassava1) of the assembly consists of 11,243 scaffolds spanning 416Mb. Half of the assembled sequence is in the largest 514 scaffolds, each 180kb or larger.
Although cassava has an estimated genome size of ~760Mb, this initial assembly spans only 416Mb. We believe that the 416Mb represents nearly all of the genic regions of the genome, and that the missing portion is repetitive sequence that could not be assembled. This is supported by two pieces of evidence. 1. A large fraction of reads (both Sanger and 454) were not used by the assembly software, and were primarily repetitive in nature. 2. Transcripts assembled from publicly available cassava ESTs (P. Rabinowicz, unpublished) were mapped to the genome assembly. We were able to map 95% of the transcripts showing near-complete coverage of protein-coding genes in the assembly.
Loci
The current gene set (Cassava1.1) integrates 1.5 million ESTs with homology and ab initio-based gene predictions. 47,164 protein-coding loci have been predicted, of which 24,388 have ESTs covering more than 25% of their length.

FAQ

How was the genome sequenced?

Whole genome shotgun methodology
Although the first plant and animal genomes were sequenced by a BAC-by-BAC approach, almost all current animal and fungal genome sequencing projects use the whole genome shotgun strategy in which the entire genome is randomly sheared, subcloned, and redundantly sequenced. The ease, cost-efficiency, and speed of whole genome shotgun approach has made it the method of choice in many cases. However, such projects are still expensive with Sanger sequencing. Next generation sequencing platforms can increase the cost-efficiency dramatically, but present assembly challenges due to limited read length. This is one of the first publicly available plant genomes sequenced primarily with 454 technology.
How was the assembly generated?
This initial assembly of 61 million reads was produced by a collaboration between Steve Rounsley at University of Arizona and 454 Life Sciences using the Newbler assembly software. An incremental approach was taken where single end reads were assembled alone, and then paired end reads (Sanger and 454) were added.
Is it complete?
Comparison with a cassava unigene set (from a different genotype) suggests that 95% of known cassava genes are represented in the assembly. This result supports the claim that Cassava1.1 is largely complete with respect to "gene space." In addition to the gene space, repeat masking also shows that over 100Mb of repetitive sequence was assembled. We will be pursuing further improvements in the assembly with improved assembly algorithms as they become available.
Which germplasm was sequenced?
Both the Sanger and 454 sequence data were generated from a partially inbred line called AM560-2 which was generated at CIAT (International Center for Tropical Agriculture) in Cali, Colombia.

How do I find my favorite genes?

BLAST
To BLAST against the cassava genome with peptide or nucleotide probes, click here. The default BLAST database is a cassava genome assembly that has been masked for high copy repeats, and default BLAST parameters are suitable for use with dicot peptides and coding sequences. You can view your blast alignment against the genome by clicking on the hit of interest to see the detailed alignment, and then clicking on the scaffold name (shown in blue). If you're interested in transposable element families in the cassava genome, please DO NOT BLAST these, it'll just clog up our BLAST queue!  Similarly, please don't BLAST entire BACs. Download the assembly fasta sequence and perform such BLASTs locally.  
Search
We have pre-aligned known cassava ESTs to the cassava sequence, along with current proteomes of poplar, soybean, castor bean, rice and Arabidopsis. If you enter into the Gbrowser "Search" box text keywords from common gene names like "nod1" or "agamous", or gene identifiers like "At1g12340," the result will be a list of genomic regions that hit ESTs or poplar/soybean/castor bean/rice/Arabidopsis genes that are associated with these words/identifiers. Clicking on the red diamonds will then bring you to the specific region of interest. Note that you may need to zoom in to see details, which are only shown over regions shorter than 70 kb.

How do I work with the cassava Gbrowser browser?

How can I view the cassava sequence and various genomic features?
To facilitate early use of the cassava genome the DOE-JGI and the UC Berkeley Center for Integrative Genomics have developed a simple genome browser using the Gmod/Gbrowse software. Due to the density of information, detailed features are only visible when looking at 70 kb or smaller regions. You may need to zoom in to get to this size. Typically, clicking on a feature will reveal its sequence and alignment to the genome. For gene models, you can also click to bo
How do I retrieve cassava sequence of interest to me?
From the browser, locate the region of interest. With your region in view, select "Download Sequence" from the menu above the Scroll/Zoom bar.  Then click the "Go" button and you'll get your sequence on your browser to cut and paste.  If you click on a gene model, you can retrieve the predicted peptide and coding sequencing.
What happens when I click on a gene on the browser
You'll see a web page that displays the predicted peptide, genomic span of the gene with coding exons shaded, and the (spliced) coding sequence.  From this page you can also launch BLAST vs. the NCBI non-redundant protein database or Phytozome.

Where do the various tracks on the genome browser come from?
How were repeats identified?
Sixteen-base-pair "words" (16-mers) that are over-represented in and clustered on the genome were used to define repetitive regions (J. Chapman, unpublished). These typically represent recently active retrotransposons and simple-sequence-repeats in the cassava genome. Nearly 16% of the assembly appears to be covered by such clustered/over-represented regions.
In a subsequent analysis, RepeatScout (Price & Pevzner, 2005) was used to generate a catalog of 372 over-represented sequences over 500nt long with homology to known transposons in the nr database at NCBI. Known transposon sequences from RepBase (version from 20090604) from Viridiplantae were added to make a custom library of repeats that was used to further mask 9.4% of the genome with RepeatMasker.
How were ESTs aligned?
We aligned the cassava EST sequences using Brian Haas's PASA pipeline which aligns ESTs to the best place in the genome via gmap, then filters hits to ensure proper splice boundaries.
How were plant peptides aligned?
Rice and castor bean peptides were downloaded from TIGR; Arabidopsis peptides were downloaded from TAIR; poplar and soybean proteins were generated in our annotation pipeline. All peptides were aligned to the soft-masked genome using gapped BLASTX; high-scoring sequence pairs (HSPs) are shown. Note that gapped BLAST was used to increase sensitivity, so that in many cases the HSP (shown in orange) spans adjacent exons and the intervening intron(s). Also, small exons are often missed.

How did you determine the cassava gene set?

Gene prediction
To produce the current "Cassava1.1" gene set, we used the homology-based gene prediction program FgenesH predictions provided by Asaf Salamov at JGI, along with the PASA program to integrate over 1.5 million cassava ESTs. The gene set shown on the browser was produced by Simon Prochnik at JGI. Briefly, peptides from diverse angiosperms and cassava EST assemblies (from PASA) were aligned to the genome, and their overlaps used to define putative protein-coding gene loci. The corresponding genomic regions were extended by 1kb in each direction and submitted to FgenesH, along with related angiosperm peptides and/or ORFs from the overlapping EST assemblies. Fgenesh identifies likely protein coding exons, favoring regions that align well to the given homologous peptides. These homology-based predictions were integrated with expressed sequence information using PASA (Haas et al. 2003) using cassava ESTs. The results were filtered to remove genes identified as transposon-related. Genes with apparently truncated ORFs may be prediction errors or pseudogenes.
How come my gene is wrong?
FgenesH is one of the better homology-based gene predictors available, but like all computational gene modeling algorithms is imperfect. In addition, EST and cDNA data are often incomplete. We hope that the aggravation of having an imperfect gene set is partially compensated by the rapid release of the data.
Future gene sets will improve as assembly quality improves along with expressed sequence data and genomic data from related species. But the lesson from the annotation of other well-curated genomes like Arabidopsis and rice is that it can take years to fine tune a gene set even given a high quality genome assembly. So please be patient!

What can I do with the cassava dataset?

I would like to use this data to help clone a gene, analyse a gene family, etc.
I would like to use this data to help clone a gene, analyse a gene family, etc. Wonderful! Please feel free to use this data to advance your studies of cassava and other malpighiales. Please cite "Cassava Genome Project 2009, http:://www.phytozome.net/cassava".
I think I found an error. What should I do?
If you would like to bring any items to our attention, please send email to phytozome@jgi-psf.org.
I would like to do a large-scale comparison of cassava to other genomes, and/or a global analysis of its gene content.
The Fort Lauderdale guidelines for large scale sequencing projects aims to balance the value of rapid data release for the user community with respect for the scientific interests of the generators of the data. We have released the data prior to any publication because of the importance of this data to the cassava community. Our plans for publication of the genome sequence and associated analyese are still developing, and we would encourage any members of the cassava community who wish to contribute genome-wide, or family-specific analyses to contact us at CassavaGenomics.org.
©2010 University of California Regents. All rights reserved