Glycine max (Soybean)

About the genome:


The Soybean (Glycine max) genome project was initiated through the DOE-JGI Community Sequencing Program (CSP) by a consortium led by Gary Stacey, Randy Shoemaker, Scott Jackson, Jeremy Schmutz, and Dan Rokhsar.

Large-scale shotgun sequencing of soybean began in the middle of 2006 and was completed early in 2008. A total of ~13 million attempted Sanger shotgun reads were produced and deposited in the NCBI Trace Archive in accordance with our commitment to early access and the Fort Lauderdale genome data release policy . See below for information on the 2010 publication of the soybean genome.

The present assembly (Glyma1) is the first chromosome-scale assembly of the soybean genome. The v1.1 gene set integrates ~1.6 million ESTs and 1.5 billion paired-end Illumina RNA-seq reads with homology- and ab initio-based gene predictions. Protein-coding genes have been given identifiers using the convention adopted by the Arabidopsis community. The identifiers are of the form Glyma%%g####, where %% is the chromosome number and #### is a numerical index that increases along each chromosome. Gene locus and transcript identifiers from v1.0 that could be unambiguously mapped forward to corresponding v1.1 models were retained (more details below).


Note: the Soybean Glyma1.0 annotation is still available as part of our gene family clusters at this time.

Genome Size
Approximately 975Mb is captured in 20 chromosomes, with a small additional amount of mostly repetitive sequence in unmapped scaffolds.
54,175 protein-coding loci and 73,320 transcripts have been predicted


How was the genome sequenced?

Whole genome shotgun methodology
Although the first plant and animal genomes were sequenced by a BAC-by-BAC approach, almost all current animal and fungal genome sequencing projects use the whole genome shotgun strategy in which the entire genome is randomly sheared, subcloned, and redundantly sequenced. The ease, cost-efficiency, and speed of whole genome shotgun approach has made it the method of choice in many cases, but there are lingering concerns about its effectiveness for large repeat-rich plant genomes, especially grasses. Soybean is the most complex plant genome sequenced to date by this strategy.
How was the assembly generated?
The Glyma1 release was produced by Jeremy Schmutz at JGI-Stanford Human Genome Center using the Arachne2 assembler in a mode tuned to the highly repetitive soybean genome. These sequence scaffolds were then integrated with soybean genetic and physical maps in collaboration with Steve Cannon and his group at the University of Minnesota.
Is it complete?
Comparison with the soybean EST set suggests that more than 98% of known soybean protein-coding genes are represented in the assembly (many that aren't are turning out to be contamination of EST libraries). This result supports the claim that Glyma1 is largely complete with respect to "gene space." You'll also find that vast tracts of repetitive sequence are also assembled.
Is it accurate?
The vast majority of Glycine max ESTs align to the genome at nearly 100% identity, suggesting that Glyma1 is highly accurate in genic regions. We are currently evaluating the base-pair-level accuracy in repetitive regions by comparing the assembly with BAC clones produced for the project. Discrepancies between the shotgun assembly and the independently obtained genetic and physical maps have been manually reviewed and corrected, so there should be no errors in the large-scale structure of the genome.
What about polyploidy?
The soybean genome experienced a tetraploidization event an estimated 10-15 million years ago. Homologous regions have diverged sufficiently, however, that they can assembled apart from one another in the shotgun assembly. Thus both homologs are typically represented in the Glyma1 sequence.

How do I find my favorite genes?

To BLAST against the soybean genome with protein or nucleotide probes, click here and select the Glycine max node on the tree. The default BLAST database is a soybean genome assembly that has been masked for high fidelity repeats, and default BLAST parameters are suitable for use with grass proteins and coding sequences. You can view your blast alignment against the genome by clicking on the hit of interest to see the detailed alignment, and then clicking on the scaffold name (shown in blue). If you're interested in transposable element families in the sorghum genome, please DO NOT BLAST these, it'll just clog up our BLAST queue!  Similarly, please don't BLAST entire BACs. Download the assembly fasta sequence and perform such BLASTs locally.  
We have pre-aligned known soybean, Medicago, and lotus ESTs to the soybean sequence, along with current proteomes of rice and Arabidopsis. If you enter text keywords from common gene names like "nod1" or "agamous", or gene identifiers like "At1g12340," into the Gbrowse "Search" box, the result will be a list of genomic regions that hit ESTs or rice/Arabidopsis genes that are associated with these words/identifiers. Clicking on the red diamonds will then bring you to the specific region of interest. Note that you may need to zoom in to see details, which are only shown over regions shorter than 70 kb.
The chromosomal coordinates of Glyma1 are unrelated to the "super" location and coordinates from Glyma0.

How do I work with the soybean genome browser?

How can I view the soybean sequence and various genomic features?
A graphical view of the soybean genome is available here. Detailed features are only visible when looking at 100 kb or smaller regions. You may need to zoom in to get to this size. Typically, clicking on a feature will reveal its sequence and alignment to the genome.
How do I retrieve soybean sequence of interest to me?
From the browser, locate the region of interest. With your region in view, select "Download Sequence" from the menu above the Scroll/Zoom bar.  Then click the "Go" button and you'll get your sequence on your browser to cut and paste.  If you click on a gene model, you can retrieve the predicted protein and coding sequencing.

How did you determine the soybean gene set?

Gene prediction
113,859 transcript assemblies were constructed from approximately 1.5B paired-end Illumina RNA-seq reads. The transcript assemblies from RNA-seq reads were made using PERTRAN (Shu et. al., manuscript in preparation). 161,995 transcript assemblies were constructed using PASA (Haas, 2003) from 1,776,021 sequences in total, consisting of the RNA-seq transcript assemblies above, as well as Sanger and Roche/454 ESTs. Loci were determined by transcript assembly alignments and/or EXONERATE alignments of proteins from arabidopsis (Arabidopsis thaliana), medicago, grape and poplar proteins to soft-repeatmasked Glycine max genome using RepeatMasker (Smit, 1996-2012) with up to 2K BP extension on both ends unless extending into another locus on the same strand. Gene models were predicted by homology-based predictors, FGENESH+ (Salamov, 2000), FGENESH_EST (similar to FGENESH+, EST as splice site and intron input instead of protein/translated ORF), and GenomeScan (Yeh, 2001).

The highest scoring predictions for each locus are selected using multiple positive factors including EST and protein support, and one negative factor: overlap with repeats. The selected gene predictions were improved by PASA. Improvement includes adding UTRs, splicing correction, and adding alternative transcripts. PASA-improved gene model proteins were subject to protein homology analysis to above mentioned proteomes to obtain Cscore and protein coverage. Cscore is a protein BLASTP score ratio to MBH (mutual best hit) BLASTP score and protein coverage is highest percentage of protein aligned to the best of homologs. PASA-improved transcripts were selected based on Cscore, protein coverage, EST coverage, and its CDS overlapping with repeats. The transcripts were selected if its Cscore is larger than or equal to 0.5 and protein coverage larger than or equal to 0.5, or it has EST coverage, but its CDS overlapping with repeats is less than 20%. For gene models whose CDS overlaps with repeats for more that 20%, its Cscore must be at least 0.9 and homology coverage at least 70% to be selected. The selected gene models were subject to Pfam analysis and gene models whose protein is more than 30% in Pfam TE domains were removed. Based on soybean community input, some Glyma1.0 gene models were added this version.


Haas, B.J., Delcher, A.L., Mount, S.M., Wortman, J.R., Smith Jr, R.K., Jr., Hannick, L.I., Maiti, R., Ronning, C.M., Rusch, D.B., Town, C.D. et al. (2003) Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. [Nucleic Acids Res, 31, 5654-5666].

Smit, AFA, Hubley, R & Green, P. RepeatMasker Open-3.0. 1996-2011 .

Yeh, R.-F., Lim, L. P., and Burge, C. B. (2001) Computational inference of homologous gene structures in the human genome. Genome Res. 11: 803-816.

Salamov, A. A. and Solovyev, V. V. (2000). Ab initio gene finding in Drosophila genomic DNA. Genome Res 10, 516-22.

Locus name and transcript name mapping from previous annotation version
Only cases in which v1.0 and v1.1 loci overlapped in pairs was mapping attempted. The locus model name of a v1.0 gene areis mapped forward to a corresponding v1.1 gene if 1) the v1.0 and v1.1 loci overlap uniquely and appear on the same strand, and 2) at least one pair of translated transcripts from the old and new loci are MBH's (mutual best hits) with at least 70% normalized identity in a BLASTP alignment (normalized identity defined as the number of identical residues divided by the longer sequence). For a given pair of v1.0 and v1.1 transcripts at loci that map, model names are mapped forward if either a) an MBH relationship exists between the two proteins with at least 90% normalized identity or, b) the proteins have at least 90% normalized identity but are not MBH, but the corresponding transcripts sequences are (also with 90% normalized identity). This latter rule is to specifically handle the cases where the v1.0 and v1.1 models differ mainly by the addition of, or extension, of UTR to a v1.0 model. These rules allowed the model names of approximately 83% of non-TE associated transcripts in v1.0 to be assigned to corresponding v1.1 transcripts.

What can I do with the soybean dataset?

I would like to use this data to help clone a gene, analyse a gene family, etc.
Wonderful! Please feel free to use this data to advance your studies of soybean and other legumes. Please include the reference below as your citation.
I think I found an error. What should I do?
If you would like to bring any items to our attention, please send email to

Where can I find the soybean genome publication?

The publication of the soybean genome is available from Nature:
Schmutz J, et al. (2010). "Genome sequence of the palaeopolyploid soybean." Nature 463, 178-183 (14 January 2010) | doi:10.1038/nature08670
  ©2006-2014 University of California Regents. All rights reserved  
Information on Accessibility/Section508