Phytozome
Overview
Phytozome is a joint project of the Department of Energy's Joint Genome Institute and the Center for Integrative Genomics to facilitate comparative genomic studies amongst green plants. Clusters of orthologous and paralogous genes that represent the modern descendents of ancestral gene sets are constructed at key phylogenetic nodes. These clusters allow easy access to clade specific orthology/paralogy relationships as well as clade specific genes and gene expansions. As of version 3.0, Phytozome provides access to nine sequenced and annotated green plant genomes, eight of which have been clustered into gene families at six evolutionarily significant nodes. Where possible, each gene has been annotated with PFAM, KOG, KEGG, and PANTHER assignments, and publicly available annotations from RefSeq, UniProt, TAIR, JGI are hyper-linked and searchable.
Included Organisms
The proteomes of the following organisms are clustered in release 2.0.4 of Phytozome:
| Organism | common name | Source |
| Arabidopsis thaliana | Mouse-ear cress | TAIR release 7 acquired from TAIR |
| Populus trichocarpa | Poplar | JGI v1.1 annotation of the v1.0 assembly |
| Vitis vinifera | Grape | Sept 2007 annotation from Genoscope | Sorghum bicolor | Sweet Sorghum | Sbi1.4 models from MIPS/PASA on v1.0 assembly Preliminary Genomescan annotation of assembly sbi0 |
| Oryza Sativa | Rice | TIGR Release 5 of the Rice Genome Annotation |
| Selaginella moellendorffii | Spikemoss | JGI v1.0 assembly and annotation |
| Phycomitrella patens | Moss | JGI v1.1 assembly and annotation |
| Chlamydomonas reinhardtii | Green algae | JGI v3.0 assembly and annotation |
Access is also provided to the sequence and annotation of soybean, though it is not yet included in Phytozome gene families.
| Glycine max | Soybean | Preliminary Genomescan/FgenesH/PASA annotation Glyma0.1 of assembly Glyma0 |
Nodes
Clustering is used to group extant genes into sets representing the ancestral genes that existed just prior to various significant evolutionary events (nodes). Extant genes have been clustered at nodes representing the following speciation events:
| Viridiplantae (~475 Mya): | Genes representing the most recent common ancestor of Embryophytes and chlorophyta (represented by the algae Chlamydomonas). |
| Embryophyte (~450 Mya): | Genes representing the most recent common ancestor of Tracheophytes and Bryophyta (represented by Physcomitrella). |
| Tracheophyte (~420 Mya): | Genes representing the most recent common ancestor of Sorghum and Rice. |
| Angiosperm(~160 Mya): | Genes representing the most recent common ancestor of grasses and Rosids. |
| Rosid (~120 Mya): | Genes representing the most recent common ancestor of Arabidopsis, Poplar and Grape. |
| Grass (~75 Mya): | Genes representing the most recent common ancestor of Sorghum and Rice. |
For completeness, 3 additional nodes are included, representing more recent paralogy in Selaginella (more recent than the Tracheophyte node), Physcomitrella (more recent than the Embryophyte node), and Chlamydomonas (more recent than the Viridiplantae node).
Clustering Methodology
All-against-all blastp alignments were performed for all 8 plants to be clustered. The bit score per unit peptide length is chosen as the similarity metric between two peptides. Clustering was performed hierarchically, from the crown nodes to the root, creating in-group paralogous clusters and merginign ingroup and outgroup clusters across nodes. (all organisms reachable via the same branch from a given node are considered in-group with respoect to that node; organisms not reachable via the same branch from a given node constitute an outgroup to the ingroup organisms). First, paralogous single-organism ingroup clusters are constructed for each organism by comparing intra-organism similarity against inter-organism similarity; only those peptides more similar than either is to any outgroup peptides are joined into clusters (the actual thresholding rule is more complicated, to avoid spurious creation of large paralogous clusters of weakly similar peptides). Then, clusters are merged across nodes via mutual-best-hit criterion. This process continues down to the root, with paralogous clusters being merged via comparison of ingroup to outgroup similarity, and mutual best hits being used to merge clusters across nodes. Minimum coverage thresholds are used to minimize the clustering of multi-domain proteins that may share only a single common domain, or the clustering of peptides from fragmentary gene predictions. The clustering algorithm will be discussed in detail in an upcoming publication.Note that, by construction, every gene from an organism present at a particular node is in one and only one cluster at that node. Some clusters may contain only one extant gene (singletons). Singletons can come from "fast" evolution leading to so much sequence divergence that sequence-similarity based clustering is confounded, gene loss, or gene calling errors.
Clustering Statistics
| Node | Gene Families | Singletons | Median Family size |
| Viridiplantae | 11781 | 70058 | 4 |
| Embryophyte | 3364 | 73572 | 4 |
| Tracheophyte | 1057 | 76570 | 3 |
| Angiosperm | 1057 | 67616 | 3 |
| Rosid | 136 | 49076 | 3 |
| Grass | 372 | 31613 | 2 |
Phytozome Team
| Software: | David M. Goodstein, Rusty Howson, Rochak Neupane,Shengqiang Shu |
| Analysis: | Bill Dirks, Uffe Hellsten, Therese Mitros, Dan Rokhsar |