Glycine max (Soybean)



About the genome:


Overview

The Soybean (Glycine max) genome project was initiated through the DOE-JGI Community Sequencing Program (CSP) by a consortium led by Gary Stacey, Randy Shoemaker, Scott Jackson, Jeremy Schmutz, and Dan Rokhsar.

Large-scale shotgun sequencing of soybean began in the middle of 2006 and was completed early in 2008. A total of ~13 million attempted Sanger shotgun reads were produced and deposited in the NCBI Trace Archive in accordance with our commitment to early access and the Fort Lauderdale genome data release policy . See below for information on the 2010 publication of the soybean genome.

The present assembly (Glyma1) is the first chromosome-scale assembly of the soybean genome. The current gene set (Glyma1.0) integrates ~1.6 million ESTs with homology and ab initio-based gene predictions. Protein-coding genes have been given identifiers using the convention adopted by the Arabidopsis community. The identifiers are of the form Glyma%%g####, where %% is the chromosome number and #### is a numerical index that increases along each chromosome. We expect that these identifiers will be preserved in future releases.

Statistics

Genome Size
Approximately 975Mb is captured in 20 chromosomes, with a small additional amount of mostly repetitive sequence in unmapped scaffolds.
Loci
66,153 protein-coding loci have been predicted. These genes were assigned a letter-code to indicate the level of support for each gene. The gene is assigned the letter code according to its highest level of support:
CodeGenes with this codeCode definition
F3305full-length cDNA consistent
E32317EST consistent
Ei7361EST overlap, but model does not match all EST boundaries
Ea1832gene generated from the longest ORF of EST evidence, as modeling programs failed to produce a model at locus
Hs13704Homology and Solexa support
H7634Homology to other plant peptide

FAQ

How was the genome sequenced?

Whole genome shotgun methodology
Although the first plant and animal genomes were sequenced by a BAC-by-BAC approach, almost all current animal and fungal genome sequencing projects use the whole genome shotgun strategy in which the entire genome is randomly sheared, subcloned, and redundantly sequenced. The ease, cost-efficiency, and speed of whole genome shotgun approach has made it the method of choice in many cases, but there are lingering concerns about its effectiveness for large repeat-rich plant genomes, especially grasses. Soybean is the most complex plant genome sequenced to date by this strategy.
How was the assembly generated?
The Glyma1 release was produced by Jeremy Schmutz at JGI-Stanford Human Genome Center using the Arachne2 assembler in a mode tuned to the highly repetitive soybean genome. These sequence scaffolds were then integrated with soybean genetic and physical maps in collaboration with Steve Cannon and his group at the University of Minnesota.
Is it complete?
Comparison with the soybean EST set suggests that more than 98% of known soybean protein-coding genes are represented in the assembly (many that aren't are turning out to be contamination of EST libraries). This result supports the claim that Glyma1 is largely complete with respect to "gene space." You'll also find that vast tracts of repetitive sequence are also assembled.
Is it accurate?
The vast majority of Glycine max ESTs align to the genome at nearly 100% identity, suggesting that Glyma1 is highly accurate in genic regions. We are currently evaluating the base-pair-level accuracy in repetitive regions by comparing the assembly with BAC clones produced for the project. Discrepancies between the shotgun assembly and the independently obtained genetic and physical maps have been manually reviewed and corrected, so there should be no errors in the large-scale structure of the genome.
What about polyploidy?
The soybean genome experienced a tetraploidization event an estimated 10-15 million years ago. Homologous regions have diverged sufficiently, however, that they can assembled apart from one another in the shotgun assembly. Thus both homologs are typically represented in the Glyma1 sequence.

How do I find my favorite genes?

BLAST
To BLAST against the soybean genome with peptide or nucleotide probes, click here and select the Glycine max node on the tree. The default BLAST database is a soybean genome assembly that has been masked for high fidelity repeats, and default BLAST parameters are suitable for use with grass peptides and coding sequences. You can view your blast alignment against the genome by clicking on the hit of interest to see the detailed alignment, and then clicking on the scaffold name (shown in blue). If you're interested in transposable element families in the sorghum genome, please DO NOT BLAST these, it'll just clog up our BLAST queue!  Similarly, please don't BLAST entire BACs. Download the assembly fasta sequence and perform such BLASTs locally.  
Search
We have pre-aligned known soybean, Medicago, and lotus ESTs to the soybean sequence, along with current proteomes of rice and Arabidopsis. If you enter text keywords from common gene names like "nod1" or "agamous", or gene identifiers like "At1g12340," into the Gbrowse "Search" box, the result will be a list of genomic regions that hit ESTs or rice/Arabidopsis genes that are associated with these words/identifiers. Clicking on the red diamonds will then bring you to the specific region of interest. Note that you may need to zoom in to see details, which are only shown over regions shorter than 70 kb.
NOTE
The chromosomal coordinates of Glyma1 are unrelated to the "super" location and coordinates from Glyma0.

How do I work with the soybean genome browser?

How can I view the soybean sequence and various genomic features?
A graphical view of the soybean genome is available here. Detailed features are only visible when looking at 100 kb or smaller regions. You may need to zoom in to get to this size. Typically, clicking on a feature will reveal its sequence and alignment to the genome.
How do I retrieve soybean sequence of interest to me?
From the browser, locate the region of interest. With your region in view, select "Download Sequence" from the menu above the Scroll/Zoom bar.  Then click the "Go" button and you'll get your sequence on your browser to cut and paste.  If you click on a gene model, you can retrieve the predicted peptide and coding sequencing.
What happens when I click on a gene on the browser
You'll see a web page that displays the predicted peptide, genomic span of the gene with coding exons shaded, and the (spliced) coding sequence.  From this page you can also launch BLAST vs. Phytozome organisms and gene families or the NCBI non-redundant protein database.

Where do the various tracks on the genome browser come from?
How were repeats identified?
Sixteen-base-pair "words" (16-mers) that are over-represented in and clustered on the genome were used to define repetitive regions (J. Chapman, unpublished). These typically represent recently active retrotransposons and simple-sequence-repeats in the soybean genome. Nearly 40% of the genome appears to be covered by such clustered/over-represented regions. In a parallel effort, a catalog of DNA transposons and LTR transposable elements was produced by Jianxing Ma. A catalog of short, tandem repeats was provided by Steve Cannon.
How were ESTs aligned?
We aligned the consensus EST sequences of Glycine max, Medicago truncatula and Lotus japonicus from the TIGR Plantta database to the soybean genome using Jim Kent's BLAT and filtered for best hit to the genome, along with any hit within 97% coverage of that hit to account for genome duplication. For final gene verification, we aligned G. max ESTs using Brian Haas's PASA pipeline, which aligns ESTs to the best place in the genome via gmap, then filters hits to ensure proper splice boundaries.
How were rice and Arabidopsis peptides aligned?
The Arabidopsis and rice peptides were downloaded from NCBI RefSeq and aligned to the (unmasked) genome by gapped BLASTX; high-scoring sequence pairs (HSP's) are shown. Note that gapped BLAST was used to increase sensitivity, so that in many cases the HSP (shown in yellow) spans adjacent exons and the intervening intron(s). Also, small exons (evident from the maize/sorghum/sugarcane ESTs) are often missed.

How did you determine the soybean gene set?

Gene prediction
To produce the current "Glyma1.0" gene set, we used the homology-based gene prediction program, GenomeScan from Chris Burge and FgenesH predictions provided by Asaf Salamov at JGI, along with the PASA program to integrate over 1.6 million soybean ESTs. The gene set shown on the browser was produced by Therese Mitros at UC Berkeley. Briefly, peptides from diverse angiosperms and TIGR legume EST assemblies were aligned to the genome, and their overlaps used to define putative protein-coding gene loci. The corresponding genomic region was submitted to GenomeScan and FgenesH, along with related angiosperm peptides and/or ORFs from the overlapping EST assemblies. GenomeScan identifies likely protein coding exons, favoring regions that align well to the given homologous peptides. These homology-based predictions were integrated with expressed sequence information using PASA (Haas et al. 2003) using legume ESTs. The results were filtered to remove genes identified as transposon-related. Genes with apparently truncated ORFs may be prediction errors or pseudogenes.
How come my gene is wrong?
GenomeScan and FgenesH are two of the better homology-based gene predictors available, but like all computational gene modeling algorithms, they are imperfect. Similarly, EST and cDNA data are often incomplete. We hope that the aggravation of an imperfect gene set is partially compensated by the rapid release of the data.
Future gene sets will improve as assembly quality improves along with associated expressed sequence data and genomic data from related species. But the lesson from the annotation of other well-curated genomes like Arabidopsis and rice is that it can take years to fine tune a gene set even given a high quality genome assembly. So please be patient!

What can I do with the soybean dataset?

I would like to use this data to help clone a gene, analyse a gene family, etc.
Wonderful! Please feel free to use this data to advance your studies of soybean and other legumes. Please include the reference below as your citation.
I think I found an error. What should I do?
If you would like to bring any items to our attention, please send email to phytozome@jgi-psf.org.

Where can I find the soybean genome publication?

The publication of the soybean genome is available from Nature:
Schmutz J, et al. (2010). "Genome sequence of the palaeopolyploid soybean." Nature 463, 178-183 (14 January 2010) | doi:10.1038/nature08670
  ©2011 University of California Regents. All rights reserved