Phytozome Tutorials and Help

Video Tutorials

Help

Nodes,Clusters and Consensus Sequences

Nodes and Clusters:

Please go to the info page for information on nodes and clustering.

Consensus Sequences

A consensus peptide sequence is constructed for each cluster from the MSA (multiple sequence alignment). The consensus is that sequence which maximizes the sum of the pairwise scores with each cluster member's peptide sequence. For Phytozome, BLOSUM62 was used as the scoring matrix, with gap opening and extension costs of 11 and 2, respectively. This relatively simple approach produces a cluster consensus sequence that is comparable to more sophisticated profile construction algorithms, in terms of its ability to post facto correctly assign (via BLAST) cluster members to their correct clusters.

Viewing Gene Family Details

The Cluster Summary page provides a detailed picture of given cluster's constituent genes. This page is accessed by clicking the "magnifying glass" icon ( ) next to the cluster of interest on the Search Results or BLAST Results pages.

Family Naming and Classification
A brief summary and high level classification of the ancestral gene represented by this famiy. The summary includes the node name, the number of crown (extant organism) genes in the family, an automatically generated family name, and, where possible, both KOG letter and KEGG Brite classification of the cluster (also referred to as "Cluster KEGG Orthology").

Names for non-singleton clusters are either KOG-based (if more than 50% of a cluster's member genes are annotated with the same KOG, the cluster name is the KOG description) or SwissProt-based (if a cluster is purely orthologous, meaning all organisms at that node have one and only one member in the cluster, then the cluster will be named according to the SwissProt Description Entry of a "prominent", meaning well-annotated member). If neither of these cases applies, then the SwissProt or Trembl name of the member that is most similar to the cluster consensus sequence will be used. If this still does not yield a name, then the cluster is named simply "Hypothetical Gene." Singleton clusters are named with the definition line of their sole member.

The KOG Class assignment follows the same rule as KOG naming, described above. The KEGG Brite classification is done similarly, with the modification that only 50% of the member genes that could possibly have a KO (KEGG Orthology) are required to agree. This modification is due to the fact that not all organisms have been analyzed via KO, which is a prerequisite for KEGG Brite classification. The two shallowest levels of the KEGG Brite classification are hyperlinked. Clicking on the links will find all clusters at the current node assigned this KEGG Brite annotation.

Note that for composite clusters, naming and classification information are not provided.

Genes in this Family
Information on each member gene of this cluster is available in this section. This information includes: the species code, the genomic location (chromosome/scaffold, with the start and end coordinates available via mouse hovering), reference identifiers for this gene in other datasets (e.g, RefSeq, Unigene, Uniprot,Ensembl, JGI), gene symbol(s), defline, a cartoon of any PFAM domains found on this gene's product, and a cartoon of the upstream and downstream neighbors of this gene. Note that each of these columns can be hidden or made visible by selecting it in the Display Options panel (see below). If a is visible next to a row, the row can be expanded to show more information.

The first id in the DbXREF column is always from the primary source dataset (the dataset from which this gene was obtained, typically Ensembl or JGI). If available in other datasets, their identifiers are listed as well (you will need to expand the row to see these other identifiers). Where possible, all reference identifiers are hyperlinked to an information page provided by their source's curators. A two-letter code is used to indicate the source database of a given identifier. The codes are

RSNCBI RefSeq
STSwissProt/Trembl
UGNCBI UniGene
EGNCBI EntrezGene
HUHUGO Gene Nomenclature Database
EMEMBL

The Domains tab provides a cartoon view of any PFAM domains called on this gene's peptide. The same PFAM domain in different peptides will be rendered in the same color. Mouse over a domain to see the PFAM id, description, domain coordinates displayed in a pop-up to the left of the domains. The selected domain will also be highlighted in all rows in which it appears. Click on the domain to see it highlighted in all other rows in which it appears. Note that all peptides in the cluster are scaled to the same length for viewing.

The Synteny tab provides a view of the 5 upstream and 5 downstream neighbors ("syntenic block") of this gene (known as the "anchor gene",which is always rendered in black, except in the case of composite clusters, where the anchor genes are not necessarily from the same cluster). The syntenic blocks are oriented so that anchor genes are always on the same strand (consistent with their implied descent from a common ancestral gene). Mousing over any syntenic gene will produce an info box displaying the gene's primary id, and the name and id of the cluster containing that gene. The box also includes a link to that cluster's summary page. To access the link, click on the syntenic gene (which freezes the info box and highlights all other genes that are members of the same cluster), and move the cursor over to the hyperlink, and click. To hide the info box and deselect any highlighted genes, simply click the "close" link at the upper right hand corner of the info box.

Functional Analysis
The functional and domain annotations (e.g., KOG, KEGG, GO, PFAM, PANTHER) that have been assigned to members of this cluster are displayed here. For each annotation type, the identifier and description are provided, as well as this annotation's phylogenetic fingerprint (i.e., how many of the genes in this cluster have been assigned this annotation, broken down by organism).

Multiple Sequence Alignment
A clustalw Multiple Sequence Alignment (MSA) has been precalculated for each gene family. You can view the MSA in this panel, as well as download a conservation-colored html file of the alignment (please use the Get Data tab if you want the raw clustalw output). Note that any organisms which have been hidden in the Display Options will also be excluded from the MSA, though the MSA will not be recalculated. If you want to recalculate the MSA with certain sequences excluded or modified, you should go to the Align family members tab and launch Jalview.

Note that MSA's are not pre-calculated for composite clusters. If a composite cluster has fewer than 75 members, the Multiple Sequence Alignment tab will be visible, and clicking on it will launch a real-time alignment. For composite clusters with greater than 75 members, the MSA tab will not be accessible.


Family History
All ancestors and descendants of the current gene family are shown in the Family History tab. The current gene family is highlighted in gray.

Align family members
This tab provides access to the Jalview Multiple Sequence Viewer. Click on "Align Member protein sequences" to load this cluster's peptide sequences into Jalview. Click on "Align Member coding sequences" to launch Jalview with the cluster's coding sequences instead. For all "reasonably" sized clusters, the Clustalw Multiple Sequence Alignment has been pre-computed, and will automatically load when Jalview is launched (protein sequences only). Otherwise, you can apply Clustalw (or MUSCLE, a similar Multiple Sequence Aligner) within Jalview. Once you have an alignment, you can build Neighbor-Joining or Maximum Likelihood phylogenetic trees. All alignments, sequence, and trees can be downloaded from Jalview in multiple formats. Please see the Analyzing Cluster Sequences section for more information.

There are several methods available for finding clusters related to the current cluster by descent, sequence similarity, or functional annotation.
Families related by descent:
Click on the View Ancestor link to be taken to the cluster summary page of the parent (immediate ancestor) of the current cluster. This cluster is guaranteed to contain all the members of the current cluster. If the current cluster is a root (i.e., most ancient) node cluster, of course, no ancestor exists. Click on the View descendants link to find the children (immediate descendants) of the current cluster. The union of these child clusters exactly reconstructs the current cluster. We currently do not include crown nodes (nodes consisting solely of a single extant organism). If you are already at terminal (most modern), you won't see a link for descendant clusters. If you are interested in tracing the ancestry of only a particular subset of the members of the current cluster, select them (by checkbox) and click "Find all clusters with selected gene(s)". This will return only those clusters containing all of the selected genes.
Families related by homology:
Each (non-composite) cluster is represented by a consensus peptide sequence, which is based on a residue-by-residue consensus constructed from the multiple sequence alignment of the cluster members' peptide sequences. One can search for clusters whose consensus sequence is similar to that of the current cluster by clicking the "BLAST for similar clusters" link. This link will load the blast search page with the current node selected. One can use consensus sequences from a different node as the target database by using the node selector on the blast page.

For composite clusters of fewer than 75 members, an MSA and consensus sequence are calculated on-the-fly, and the "BLAST for similar clusters" link functions exactly as for non-composite clusters. For composite clusters with more than 75 members, however, this link is not available.

Families related by functional annotation:
Use the checkboxes in the Functional Analysis section of the Cluster summary to select one or more functional annotations that have been assigned to the current cluster. Click on "Find node clusters with selected annotation(s)" to find all clusters at the current node which also have been assigned all the selected annotations.
Get Data
Use this tab to download sequences, annotations, and functional assignments associated with a given gene family/cluster. Using the "Get Sequences (basic)" link, you can download the peptide or nucleotide (CDS) sequence for each cluster member, the consensus sequence for the cluster, or the raw clustalw Multiple Sequence Alignment. Choose "View" to load the fasta sequence into a browser window, or "Download" to save it to a file. Note that any species hidden via Display Options will not be included in the "Cluster Sequences" download, though they will be included in the "Raw CLUSTALW alignment." Note that for composite clusters with more than 75 members, neither the Raw CLUSTALW alignment nor the consensus sequence are available.

To retrieve more detailed cluster or cluster member data (annotations, functional assignments, transcripts with flanking sequeunce, etc.), click on the "Get cluster data (advanced)" link and launch BioMart. This will take you to a BioMart interface that is prefiltered for the selected clusters (though you can change the cluster selections here as well). Here you can apply additional filtering and select exactly which attributes you want to view or download. Please see the BioMart section for more information.

Display options
Click on "columns" to select which columns are displayed in the "Genes in this cluster" section. The "Graphical Analyses" column refers to the Domain and Synteny views. The synteny color control refers to how many of the (displayed and hidden) syntenic blocks must contain members of a cluster for that cluster's members to be rendered in color (all members of the same cluster will be rendered in the same non-white color). By default this number is 2, but can be increased or decreased by clicking "+" or "-" in the column heading.

The Species Visible section of the Display Options allows the user to hide results from particular species. Unchecking a species' checkbox will cause information for that organism's genes to be removed from the cluster display. This affects the "Genes in this Cluster", "Functional Analysis", and "Multiple Sequence Alignment" tabs. as well as the . If you wish to make these filter choices permanent, click on the "Save Species Settting" button.

Viewing Gene Details

The Protein page provides detailed information of a given protein. This page is accessed by clicking the "Gene Page" next to the gene of interest on the "Genes in this cluster" tab in the Cluster page. and from protein Blast Result.
About this gene
Information about the gene and its products is available in this section.

Info and Functional Annotaions

This information is organized in two columns: the left column provides basic gene information. Locus name, Transcript name, description, and external link out is available in this box. If any exist, alternative transcripts are listed here. Alternative transcripts DO NOT have a "Peptide Homologs" or "Gene Ancestry" page, but annotations, domains view, and genome view are present. The right column is dedicated to functional annotation (e.g., PFAM, PANTHER) that have been assigned to the protein. For each annotation type, the identifier and description are provided. If there are many annotations, the first few are displayed, and a bar below the list is provided to toggle between viewing all annotation and the first few. If assigned, PFAM domains are always displayed first, and sorted by best match.

Protein Domain View

This section provides a visual representation of assigned PFAM domains. Each PFAM domain is represented by a colored box that spans the matching region on the protein sequence. Matching colored box is also display next to the PFAM identifier under Functional annotations. Hovering over a domain will display PFAM id, description, and sequence start and end of domain hit.

Genomic View

Gbrowse image of the gene is displayed in this section. A direct link to the Phytozome Gbrowse environment is also available.

Sequences
This section provides sequences for associated with this gene: genomic sequence, transcript sequence, CDS sequence, and peptide sequence are available. The top of the section lists links to show a particular type of sequence. The Genomic sequence is displayed by default. Clicking on the type of sequence of interset will fold any other open sequences and only display the type selected. All sequence types can also be displayed by clicking on "Show All". Sequences are color coded; the legend for the code is shown at the top right corner, next to sequence links. Each sequence type provides two BLAST links: "Phytozome" link takes you to Phytozome BLAST page, and NCBI to NCBI BLAST.

For Genomic sequence, you can fetch upstream and downstream flanking sequences by entering the desired length in "upstream" and "downstream" input boxes and clicking on "Update sequence". Once "Update sequence" is clicked, sequences are updated without a page reload, they are added at the beginning and/or the end of the already existing sequence body.

Aside from the links at the top, clicking on will unfold that particular sequence type while keeping other already open sequence open.

Peptide Homologs
An all-against-all Smith-Waterman alignment was run, and any peptide with a hit to this gene is listed on this page. Hits are sorted by score in descending order. The defline, 3 letter organism code, Most Recent Shared Family (MRSF), score, and percent similarity is displayed for all hits to this gene. For further analysis of a gene with a hit, you can navigate to the "GENE PAGE" of a gene of interest by clicking on the "GENE PAGE" link. If you would like further information on the alignment, click on the link to view the raw alignment. In addition to the alignment, you will also see the e-value, percent identity, and percent coverage in the raw-alignment view.

A schematic representation of alignment is displayed on the right-most column. In this cartoon, a blue bar represents a hit to this gene, a break in a bar represents a gap in alignment and a dark blue bar represents insertion in the homolog.

Gene Ancestry
This section lists all families that contain this gene. List shows the size of the gene family, the node it belongs to, and its description, along with the organism footprint. If you would like to further study a family, you can click on a family of interest to load its Gene Family page.

Alignment Queries

BLAST Search:

The BLAST search implements NCBI Blast (v2.2.13) to enable sequence similarity searches of both individual organism genomes, as well as cluster consensus sequences defined at each node. Simply paste your sequence (with or without a fasta header) into the Query Sequence text box. If you are mainly interested in analyzing your sequence in the genomic context of a particular organism, use the "Organism Genome" target type. To analyze the evolutionary history and possible orthologs of your sequence, select the "Node Consensus" target type. Note the former is a nucleotide BLAST database, while the latter is a protein database.
BLAST Options:
The available options are mostly standard NCBI options These include:
Allow Gapped Alignment
Comparison MatrixSubstitution matrix that determines the cost of each possible residue mismatch between query and target sequence.
Word LengthThe minimum number of consecutive resides that must match identically between the query and target sequence in order to seed an alignment
E thresholdThe maximum expectation value of retained alignments.
# of alignments to showHow many top-scoring alignments should be displayed in the result set
Filter optionsWhether to remove low complexity regions from the query sequence, using DUST for blastn searches, and SEG for all others.
For Node Consensus searches, there is also the option whether to include singletons in the target database. Singletons are clusters of size 1. whose consensus sequence is simply the single member's protein sequence.
BLAST results
By default, BLAST results are displayed in the browser. The Results page will be automatically reloaded until the search results are successfully retrieved. If you expect your job to take a long time to complete, you can select the "Notify by email when job completse" option next to the "Run BLAST" button. Enter your email address, and click "Run BLAST." You will be emailed a link to your BLAST results when the job completes, which allows you to navigate away from the BLAST window while you await the completion of your job.

BLAST results are organized into a table containing color-coded HSPs on the right (red being the most significant alignments and lighter colors being less so), with target information (name, cluster size for node consensus BLASTs, score and e-value) on the left. A link to an explanded view of the HSP (including the alignment in text form) and a link to either a cluster summary or Gbrowse genomic view of the target are included as well. If you click on an HSP image, the BLAST text report for that HSP becomes visible.

Analysis of BLAST results: Data Retrieval and Composite Cluster creation (Node Consensus BLAST only)
Click on the "Analyze Results/ Get Data" tab to display analysis and data retrieval options. Using the checkboxes to the left of the HSP list, you can select one or more clusters and either align the associated sequences in Jalview or retrieve cluster and sequence data. For Jalview, select which kind of cluster sequence you'd like to analyze (protein, coding, or cluster consensus sequence), and whether you want to include the BLAST query sequence in the alignment. When loading more than one cluster into Jalview, each cluster's sequences will be shaded in the same color so they can be readily distinguished from sequences from other clusters. All alignments, sequence, and trees can be downloaded from Jalview in multiple formats. Please see the Analyzing Cluster Sequences section for more information.

In order to retrieve sequences, annotations, functional assignments, etc., for your selected clusters, click on the "Get data for selected clusters" and clicking the "Launch Biomart" button. This will take you to a BioMart interface that is prefiltered for the selected clusters (though you can change the cluster selections here as well). Please see the BioMart section for more information.

You can also create a composite cluster by selecting two or more clusters from the BLAST results and clicking "View selected clusters as a single composite cluster." This will immediately take you to a cluster summary page that contains all the members of the selected clusters, arranged for viewing as a single, composite cluster.

BLAT Search:

The BLAT search implements BLAT (v 34) to enable sequence similarity searches of individual organism genomes. Simply paste your sequence (with or without a fasta header) into the Query Sequence text box, then select the target organism from the "Target" drop down menu. You can set parameters under "Parameters"; default parameters are already set up here. Once you have configured the input parameters, click on"Run BLAT" to submit the alignment request.
BLAT Options:
The available options are:
Minimum Number of MatchesThe number of tile matches. Usually set from 2 to 4.
Minimum ScoreThe minimum score. This is the matches minus the mismatches minus some sort of gap penalty.
Minimum Identity (%)The minimum sequence identity (in percent).
Maximum GapThe size of maximum gap between tiles in a clump. Usually set from 0 to 3. Only relevent for MinMatch > 1.
Tile SizeThe size of match that triggers an alignment. Usually between 8 and 12.
Maximum Intron SizeThe maximum spacing between HSP's that will be joined together.
Mask Query SequenceMask out repetitive sequences in the BLAT target
BLAT results
By default, BLAT results are displayed in the browser. The Results page will be automatically reloaded until the search results are successfully retrieved. If you expect your job to take a long time to complete, or would like to navigate back to the results later, you can select the "Notify by email when job completes" option next to the "Run BLAT" button. Enter your email address, and click "Run BLAT." You will be emailed a link to your BLAT results when the job completes, which allows you to navigate away from the BLAT window while you await the completion of your job.

BLAT results are organized into a table containing color-coded HSPs on the right (red being the most significant alignments and lighter colors being less so), with score and e-value on the left. The aligning coordinate is displayed on the far right of the table. A link to an explanded view of the HSP (including the alignment in text form) and a link to Gbrowse genomic view of the target are included as well. If you click on an HSP image, the BLAST text report for that HSP becomes visible.

Analyzing Cluster Sequences

Jalview is used for sequence viewing, alignment, and tree-building. When you launch Jalview (having selected one or more clusters), the protein or coding sequences of each cluster member are loaded into an alignment panel. If the set of sequences corresponds to a single cluster, the pre-computed MSA (multiple sequence alignment) is also retrieved and loaded into another alignment panel. Otherwise, you can launch a CLUSTALW or MUSCLE MSA yoursel (under the "Align" menu in Jalview). Sequences are grouped by greatest pairwise similarity after an alignment. Once you have an MSA, you can build a neighbor-joining or maximum likelihood tree from the aligned sequences (under the "Tree" menu).

You can always remove a sequence from the set by highlighting the sequence name and choosing "Edit->Delete" from the menu. If you'd like to add one or more sequences to the set, choose "Edit->Add Sequence(s)". It's important to re-align the set after you add or delete sequences.

The Features menu allows you to visualize PFAM domains directly on the sequences in the alignment panels. Simply select "Features->PAC Protein Domains", and a list of PFAM identifiers and descriptions will appear in a panel to the right. Clicking on any one of these entries will highlight (in blue) that particular PFAM domain on all sequences in the pnale.

If you would like to save a MSA or Tree, choose the "File->Save As" menu item, and specify the desired file format (Fasta, clustal, MSF, etc.). If you'd like an image or HTML page of the alignment, choose the "File->Export" menu item instead.

More help on Jalview is available here.

Retrieving Data with BioMart

Overview
BioMart is used for the retrieval of complete or partial Gene Family (cluster) or individual genome data sets. You can apply filters (analogous to queries) that allow you to select only those gene families or genome annotations matching specific constraints (e.g., gene families annotated with KOG, gene families containing at certain number of member from particular genomes) and select which attributes (types of data) you want to include in your result set. Once you've specified the data set and filters, you can choose to retrieve the results as a text file or navigable web page, or, for large result sets, as a compressed file suitable for downloading.

Typical Use Case - Gene Family Data
Click the "BioMart" link on the menu of any page, or, if already in BioMart, click "New". Select the "Families" dataset with the "CHOOSE DATASET" pulldown. Click on "Filters" in the left panel, and filter on one or more of the following: Clusters, Cluster Members, or Annotation. Within Clusters you can specify the Node, Cluster Name (defline), the Cluster size, or a list of one or more Cluster identifies. If you choose to filter based on Cluster Members, you can specify how many members (within a range) of each species should be present in the clusters you retrieve. If you choose to filter by Cluster Annotation, you can specify that only those clusters with (or without) assigned Kegg Orthologies or KOG classifications be included. You can also filter on specific KO or KOG Letter assignments.

Once you've selected your filters, click the "Attributes" heading in the left panel and specify the data you'd like included in your result set: Features, Sequences, or Consensus Sequence. Features can include the Cluster Id,name, size and node, the counts of membership by species, any KOG or KO (KEGG) annotations assigned to the cluster, as well as a list of cluster members by species and internal transcript id. If you choose to retrieve Sequence data, you can specify the type of sequence, whether flanking sequence is included, and what information you'd like included in the fasta header for each sequence. Finally, you can also choose to retrieve the Cluster (family) consensus sequence, specifying as well what information should be included in the fasta header.

With filters and attributes specified, it's always a good idea to click the "Count" button in the left panel to get a sense of how large the result set will be (to allow you to choose sensibly between web viewing and download a file of the results). You can now click the "Results" button and get a preview of your result set. To save your result set, choose the data format data and file export method on this page and click "Go".

Typical Use Case - Single Genome Data
Please read the above information for the "Gene Family Data" use case. Select the "Genomes" dataset with the "CHOOSE DATASET" pulldown. The Filters available for individaul genomes include Organism, gene identifier list, gene functional annotation (KOG, PFAM, Panther, etc.). The attributes include external identifiers (e.g, RefSeq, SwissProt), annotations (PFAM, Panther, etc.), gene structure (exon coordinates) and gene and peptide sequences.

Typical Use Case - Multiple Genome Data accessed via Gene Family Dataset
If you want to obtain detailed genome data (e.g., functional annotations of individual genes) for the members of the a cluster, you can take advantage of BioMart's intersection capability that allows you to grab data based on the intersection of results from two datasets. To obtain detailed information about the members of a particular cluster (or set of clusters):
  • -Choose "Genomes" from the "CHOOSE DATASET" pulldown
  • -Select the attributes you want to retrieve
  • -Now, click on the "Dataset" link in the left panel. A new pulldown will appear that read "CHOOSE ADDITIONAL DATASET". Select "Families" from this pulldown.
  • -Specify the cluster (or clusters) you're interested in in the Filter set for this dataset
  • -Click on "Results", and you'll get a result set which contains the detailed genome data for all the members of the cluster(s) specified in the second dataset
These dual dataset queries can be time-consuming, and the order in which they are performed is not always the most efficient. It's often useful, after setting up the two datasets, to click "Count" and look in the left panel to see how many Entries are returned from each dataset. It's always faster if the second dataset has fewer entries (i.e., is more restrictive) than the first. If that isn't the case, you might consider redoing your query with the order of the datasets reversed (click "New" and start over, specifying the more restrictive query/dataset second).

More help on BioMart is available here.

  ©2011 University of California Regents. All rights reserved