Sorghum bicolor (Cereal grass)
About the genome:
Overview
The Sorghum bicolor genome project was initiated through the DOE-JGI Community Sequencing Program (CSP) by a consortium led by Andy Paterson, John Bowers, Steve Kresovich, C. Thomas Hash, Jo Messing, Daniel Peterson, Jeremy Schmutz, and Dan Rokhsar.
Large-scale shotgun sequencing of sorghum began at the end of 2005 and was completed on January 25th, 2007. A total of 10,717,203 shotgun reads were collected. All raw trace data is deposited in the NCBI Trace Archive in accordance with our commitment to early access and the Fort Lauderdale genome data release policy .
The present v1.0 release, comprising the Sbi1 assembly and Sbi1.4 gene set, are the assembly and annotation used in the sorghum genome paper. In all subsequent releases, chromosome and gene identifiers will be mapped forward whenever possible. This assembly was built with Arachne v20070201 with a data freeze from January 25th, 2007. After the build, 28 breaks were made and 108 manuals joins were performed. Ten of these joins were across centromeres. The size of the centromere was estimated for each chromosome from the amount of centromeric sequence already assembled. The main genome is in 10 chromosomes with many small unmapped pieces, some of which contain annotated genes. coordinate.
The Sorghum bicolor genome has been published and is available from Nature:
- Paterson AH, et al. (2009). "The Sorghum bicolor genome and the diversification of grasses." Nature 457, 551-556 (29 January 2009) | doi:10.1038/nature07723
Statistics
- Genome Size
- 697,578,683 base pairs arranged in 2n=20 chromosomes
- Loci
- 34,496 loci containing protein-coding transcripts
- Transcripts
- 36,338 protein-coding transcripts
FAQ
How was the genome sequenced?
How do I find my favorite genes?
How do I work with the sorghum genome browser?
- How can I view the sorghum sequence and various genomic features?
- The sorghum genome is available here. Detailed features are only visible when looking at 100 kb or smaller. You may need to zoom in to get to this size. Typically, clicking on a feature will reveal its sequence and alignment to the genome.
- How do I retrieve sorghum sequence of interest to me?
- From the browser, locate the region of interest. With your region in view, select "Download Sequence" from the menu above the Scroll/Zoom bar. Then click the "Go" button and you'll get your sequence on your browser to cut and paste. If you click on a gene model, you can retrieve the predicted peptide and coding sequencing.
- What happens when I click on a gene on the browser
- You'll see a web page that displays the predicted peptide, genomic span of the gene with coding exons shaded, and the (spliced) coding sequence. From this page you can also launch BLAST vs. Phytozome organisms and gene families or the NCBI non-redundant protein database.
- Where do the various tracks on the genome browser come from?
- How were repeats identified?
- The genome was masked using RepeatMasker. Nearly 66% of the genome appears to be covered by such clustered/over-represented regions. This is clearly an underestimate of the repeat content of sorghum, as many older/more diverged transposable element "fossils", as well as low copy elements, have not been characterized yet.
- What is a SAMI?
- "Sorghum assembled methyl-filtered islands" represent assemblies of methyl-filtered sorghum shotgun sequences, obtained from Pat Schnable's MAGI/SAMI analysis. These are enriched for genic regions but only cover portions of genes.
- How were ESTs aligned?
- We aligned the consensus EST sequences of sorghum, sugarcane, and maize from the TIGR Plantta database to the sorghum genome using Jim Kent's BLAT and NCBI BLAST.
- How were rice and Arabidopsis peptides aligned?
- The Arabidopsis and rice peptides were downloaded from NCBI RefSeq and aligned to the (unmasked) genome by gapped BLASTX; high-scoring sequence pairs (HSP's) are shown. Note that gapped BLAST was used to increase sensitivity, so that in many cases the HSP (shown in yellow) spans adjacent exons and the intervening intron(s). Also, small exons (evident from the maize/sorghum/sugarcane ESTs) are often missed.
How did you get the gene set for sorghum?
- Where did the gene set come from?
- Consensus gene predictions were built around several evidence sources. TIGR transcript assemblies were mapped on repeat-masked genome sequences, applying GenomeThreader with a splice site model of maize. Assemblies and ESTs of the following species were mapped: Allium cepa, Ananas comosus, Avena sativa, Brachypodium distachyon, Curcuma longa, Hordeum vulgare, Oryza sativa, Saccharum officinarum, Secale cereale, Sorghum bicolor, Sorghum halapense, Sorghum propinquum, Triticum aestivum, Zea mays, and Zingiber officinale. We also generated optimal spliced alignments (OSAs) as well as blastX alignments for a reference set of proteins consisting of the SWISSPROT database, Arabidopsis (TAIR6), Saccharomyces cerevisiae, and Rice (RAP2) proteomes. For each OSA, possible reading frames of size ³50 amino acids were collected as candidates for gene models. In addition, we identified gene models on repeat masked genomic sequences by ab initio methods (Fgenesh++, GeneID, GenomeScan/PASA). Next, we applied Jigsaw as a statistical combiner of all the supporting information above. A decision tree has been trained on a set of 987 gene models that have been edited by human supervision in the Apollo Genome Browser. All models, including those obtained from the first analysis series, were scored by blastp against the UniREF90 protein database and for each locus the best fitting model, i.e. the model with the highest bitscore, has been selected. In our final step, these predictions have been rerun through the PASA pipeline in order (i) to predict UTRs from maize, sorghum and sugarcane ESTs, (ii) to identify possible alternative splicing patterns and (iii) to fit all predicted models to the splice sites suggested by EST evidences of closely related species. This pipeline yielded 36,338 transcript models at 34,496 loci. In addition to the 28,003 complete models, we predicted 6493 candidate genes that lack a start and/or stop codon. These are therefore assigned as partial models. We only included such models in our annotation if they were not overlapping with complete predictions. Note that partial gene models may result from several, not mutually exclusive reasons: (i) sequencing or assembly errors may hinder both ab initio and homology based predictors to deduce a correct ORF; (ii) transposon activity may have truncated gene models; (iii) we have insufficient evidences from ab initio predictions or EST matches to provide a complete gene model.
- How were UTRs identified in gene predictions?
- The Program to assemble Spliced alignments, PASA (B. Haas), was run on the gene prediction set with all available sorghum ESTs. This produced 1842 alternatively spliced alignments and added UTR to 17,744 transcripts.
- Why do models sometimes disagree with "obvious" exons from ESTs or homologous rice genes?
- Two reasons. First, while annotation prediction programs does take homology information into account, they also adheres to an internal statistical model for what coding sequences in maize and related grasses "should" look like. So homology evidence may be “overriden” if it is inconsistent with expected codon usage, etc. A second and related problem is that ESTs are imperfect and sometimes grossly wrong, as they may include unspliced (retained) introns and/or genomic contamination of the cDNA library. By using a statistical model, gene predictors are able to reject such false data in some cases.
- Why don't all the open reading frames (ORFs) start with methionine? Why don't all the ORFs end with a stop codon? How come my gene is only partially predicted?
- GenomeScan is one of the better homology-based gene predictors available, but like all computational gene modeling algorithms, it is imperfect. Also, to avoid "run-on" models that inappropriately join adjacent genes, we only provided GenomeScan with our best guess for the genomic extent of a locus. If the statistical model of GenomeScan does not encounter what it believes to be the true start or end of a gene in our locus, the initial ATG or terminal stop codon may not be present in the model. So its partially GenomeScan's fault, and partially ours.


