cleverGO - Documentation

GO enrichment calculation

Based on the organism of the data submitted, we evaluate over and under represented GO terms using Fisher’s exact test. To see the universe data sets compiled for each organism, please see appendix “Appendix 1: Sources of universe datasets”. Each organism annotation data were collected from Gene Ontology Consortium [1]’s latest published data.

To correct for multiple testing bias, we are using Bonferroni [5] to control the false discovery rate (FDR). This is especially important for Gene Ontology results as individual terms are related and often part of the same hierarchical trees.

The first view of the tool is a classic enrichment table. We show the enriched GO term, coverage of the population and the test dataset, and enrichment/depletion indicator. On top of the basic information, we also provide the term depth taken from the acyclic GO graph provided in [1]. We allow users to browse the enrichment data by providing interactive filters - users can match text in the description field, sort by probabilities or exclude terms based on their level.

Semantic similarity visualisation

Most methods visualising GO enrichment only show results in a form of an enrichment table. However, many of the GO terms are related and showing results only on the enrichment basis is missing the linking richness of regular GO.

To illustrate relationships between GO terms, we have shown them not only in a tabular form but also in a form of a graph showing individual term similarity relations and their individual strengths. To calculate relation strengths between individual GO terms, we are using a measure called “Semantic Similarity” [2]. We are using the semantic similarity metric by Wang. The Wang semantic similarity [6] method is a graph-based method that calculates similarity of the two GO terms based both on the location of the terms in the GO graph and their relation with their ancestor terms.

The semantic similarity calculation yields a similarity matrix for the GO terms of interest. From this information, we construct a graph showing which terms are connected to others. We chose a force-directed graph [3] to take into account the connection strength and form context-based clusters. We apply spring-like forces to connections, gravity-like force to keep nodes centered and repulsion force to separate not connected nodes. The user can select a filtering threshold for the graph rendering interactively. For example, low cutoff threshold yields higher number of connections and potentially more connected graph. More stringent criteria yield more specific cluster.

At any point in time, user can interact with the graph in following ways: hover over each node with their cursor yields information about the node clicking a node activates information panel about a cluster the node is part of. This includes listing of all GO terms involves, as well as the proportion of the input set that were annotated by the terms in the cluster each information panel also contains a link to download list of identifiers from the input dataset that are contained in it for each of the GO term clusters, we provide at-a-glance visualisation in a form of a wordcloud [4].

Each of the operations above is based on the current context - the cluster connectedness and information panel contents are re-calculated for each cutoff threshold user specifies.

Precision calculation

To calculate expresiveness and specifity of each term, we have implemented precision calculation according to [7].

References

GO consortium - http://geneontology.org/
GOSemSim
GO-terms Semantic Similarity Measures, Guangchuang Yu, College of Life Science and Technology Jinan University, Guangzhou, China
d3.js force layout
https://github.com/mbostock/d3/wiki/Force-Layout
d3.js cloud visualisation
https://github.com/jasondavies/d3-cloud
Bonferroni correction
http://en.wikipedia.org/wiki/Bonferroni_correction
James Z. Wang, Zhidian Du, Rapeeporn Payattakool, Philip S. Yu, and Chin-Fu Chen
A new method to measure the semantic similarity of GO terms
Bioinformatics (2007) 23 (10): 1274-1281 first published online March 7, 2007 doi:10.1093/bioinformatics/btm087
link
Herrmann C., Bérard S., Tichit L.
SimCT: a generic tool to visualize ontology-based relationships for biological objects, Bioinformatics (2009) 25 (23): 3197-3198

Appendix 1: Sources of universe datasets

To evalute enrichemnt of the GO terms, each organism / ID system requires presence of a universe. At this moment in time, we are not allowing submission of a universe, we provide a general-purpose universe per each organism.

All of the proteomes bar E. coli we downloaded from the Uniprot proteome database:
ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/proteomes/

The E. coli proteome was downloaded from EBI database:
ftp://ftp.ebi.ac.uk/pub/databases/reference_proteomes/QfO/

The tool works with protein identifier and universes, except with for H. sapiens, where also gene-based calculation is supported.

Supported organisms:

H. sapiens (human)
H. sapiens (humangene)
M. musculus (mouse
C. elegans (caeel
D. melanogaster (drome)
E. coli (ecoli)
D. rerio (danre)