Rationale - Team information - Aims - User Guide - Sequence Annotations - Evaluations -

Rationale

150 years ago Charles Darwin sailed the seas on board HMS Beagle exploring life's morphological diversity. Today, ships such as the SorcererII sample the oceans in the hope of uncovering some of life's molecular diversity through metagenomic sequencing.

CAMERA[1] participants have sequenced random DNA fragments extracted from dozens of ocean water samples, from the Seychelles to New Caledonia and the Sargasso sea. This has produced over 1.045 giga nucleotides of microbial DNA sequence submitted to the public sequence databanks without any annotation other than geophysical properties of the water sample (e.g. GPS position, sample depth etc.). Bioinformatics are the tools of choice to observe biodiversity at this molecular scale!

Your mission, should you accept it, is to attempt to identify the microbial origin of these sequences (archae, protists, algae, viruses?), as well as determine the putative functions of coding sequences contained therein. Some sequences look rather familiar, whilst others are totally novel or very strange indeed!

PloSMore background on the "Global Ocean Sampling" expedition can be found in the Open Access PLoS special issue.

[1] CAMERA stands for Community Cyberinfrastructure for Advanced Marine Microbial Ecology Research and Analysis



Team information

Freelance volunteer annotator

The "Open access" team is open to all volunteer explorers! To join, please open an account in the "Open Access" team using the "Create New Account" tab at the top of this page.

Students taking a course running an Annotathon team

You are invited to open an account in your course's specific team (e.g. BioCell2007) using the "Create New Account" tab at the top of this page.

Teachers who wish to lead an Annotathon student team

If you teach and wish to create a team for your students, you will find all necessary information on the Annotathon Instructor Manual.

Aims

Your team will collectively annotate distinct DNA fragments randomly distributed from the available public sequence pool. Each registered annotator will be responsible for annotating a specific set of sequence fragments. For each fragment, annotators will produce a full report specifying if it is likely to be coding, the putative function of the protein product, as well as the most likely taxonomic classification of the host organism.

By practicing the core bioinformatics tools on a number of distinct sequences, each one a live piece of experimental data, you will become familiar with the running and interpretation of fundamental sequence analysis. Experience shows that after two or three distinct analyses, the focus shifts from bioinformatics to biological issues! All tool are available online, so all you need to start is a web browser.

User Guide

You can access the Annotathon from any computer connected to the Internet, irrespective of operating system (MAC, Windows or Linux)...

Authentification

We recommend that you simultaneously open the following pages (in different windows or tabs) in your browser:


If you don't have an Annotathon account (i.e. it is your first session), clic on the "New account" tab in the permanent menu at the top of the Annotathon pages. Follow the instructions to open a new account; make sure you select the appropriate affiliation or you might end up being supervised inside an another team, possibly using a different language... If you don't know your Team code, please ask your instructor for it. You are required to enter at least one firstname/lastname pair, and one email address in order to receive Annotathon specific notifications. Your email address is secure and will under no circumstance be made public or passed on to any third party. Only low traffic messages specific to your course duration will be mailed to this address; no further messages will be sent after the course is completed.

Finally a clic on "Open the account" should be followed by the message "Account 'XYZ' has been created". Use your 'username' and 'password' and clic "Connect" in the form at top of page to open an Annotathon session. You will be reminded that your email address is not validated until you have followed the special link included in an email automatically sent to you at account creation.

The home page (also available by clicking the "Home" tab) gives an overview of the team's annotation progress. Note that once connected you will be able to locate your position in the team at the bottom of the stats page (after your annotations start being evaluated).

Cart and sequence fragments

Each annotator can list the sequence fragments under his annotation responsability by clicking on the "Cart" tab. Your cart being initially empty, select an ocean sampling location (e.g. Caribbean Sea: Rosario Bank) and clic on "Add a new sequence to your cart".

You can only add new sequence fragments to your cart when it is empty, or when you have already annotated all available sequences. Add new sequence fragments at your discretion (or until you reach the upper limit set by your supervisor).

View annotations

Clic the icon opposite the sequence you wish to view in your cart. The initial annotation is minimal: outside the sequence itself and its geographic origin, each sequence fragment has a unique Annotathon accession number. The remaining annotation is your responsibility!

Modify annotations

Clic the icon opposite the fragment you wish to edit. After having modified any annotations, remember to save your work on the Annotathon server by clicking the "Save your annotations" button! Should you leave this editing form without submitting, all modifications since last save will be lost... Since you can submit your work as often as you wish, it is recommended you save your work regularly.

Fragment accession numbers

The accession numbers assigned to sequence fragments (e.g. GOS_5421290.1) are arbitrary and internal to the Annotathon; the last digits after the dot correspond to the annotation version (starts at 1 and increments by one at each save). You can view precedent annotation versions by selecting the appropriate version in the popup list at the top of the visualisation sheets ( icon).

Submit annotations for evaluation

When your annotations are completed, clic the icon opposite the fragment. Its status will then shift from 'Annotation 1' to 'Evaluation 1' and it will be closed for editing until your work has been evaluated. After this initial evaluation, the fragment status shifts from 'Evaluation 1' to 'Annotation 2'; you are then invited to update your initial annotations following the evaluator's comments. When your second annotation pass is completed, clic the icon to submit your annotation for the second and final round of evaluations.

Discussion forum

The "Forum" tab opens access to the Annotathon internal forum (the signals that a new unread message has been posted to the forum). Clic on a message subject to see its content. If you wish to reply to a message, use the form immediatly under it and clic "Post message". IMPORTANT! only use this method to DIRECTLY REPLY TO THE CONTENT OF A SPECIFIC MESSAGE!

If you wish to open a new discussion thread, you MUST use the special new thread form available at the top of each of your annotation records ( icon in your cart)! You can then select the appropriate forum for your new thread (e.g. Searching fo homologues: BLAST). A link to your specific sequence fragment and associated annotations will automatically be included with your post. Note that the messages you post are also emailed directly to your supervisors and fellow annotators.

If messages are often answered by supervisors, trainees who wish to offer help by answering fellow trainee questions are nonetheless strongly encouraged to do so. Constructive replies will be taken into account in trainee evaluations.

Announcements

Annoucements from your supervisors will be displayed at the top of Annotathon pages. Once read, tick the 'Read' box to transfer the message to your archive (available at the bottom of the Forum tab).

Sequence Annotations

  1. General principles
  2. ORF finding
  3. Molecular weight
  4. Conserved protein domains
  5. BLAST homolog search
  6. Multiple sequence alignement
  7. Phylogenetic tree
  8. Taxonomy
  9. Biological Process & Molecular Function
  10. Gene Symbol
  11. Conclusion

General principles

The sequence annotation editing form has two types of fields:
  1. raw results (i.e. ORFfinder, BLAST, multiple alignment, phylogenetic tree)
  2. interpretations and conclusions (i.e. molecular function, biological process, gene symbol, taxonomy, conclusion)
The Annotathon editing form is hence both a numeric "lab book" (type 1 fields) and an "annotation report" (type 2 fields).

IMPORTANT: for type 1 fields (raw results), boxes are initially filled with a standard template of the form:

PROTOCOL:

---------------------------------------------------------------------------------------------------
RESULTS ANALYSIS:

---------------------------------------------------------------------------------------------------
RAW RESULTS:

Under the "PROTOCOL" heading, specify the minimum information necessary to reproduce the exact same results. Usually, this would entail giving the name of the tool used, together with its run parameters. For instance, for the ORF finding results field, the protocol line could read:

PROTOCOL:

SMS ORFinder / direct strand / frames 1, 2 & 3 / min 60 AA / 'any codon' initiation / 'universal' genetic code 

Copy & paste the raw results of the analysis, in-extenso, under the "RAW RESULTS" heading. If you have carried out more than one analysis (for instance two SMS ORFfinder runs, one on forward and one on reverse strand), then reference the two analyses using an index exactly as follows:

PROTOCOL:

a) SMS ORFinder / forward strand / frames 1, 2 & 3 / min 60 AA / 'any codon' initiation / 'universal' genetic code 
b) SMS ORFinder / reverse strand / frames 1, 2 & 3 / min 60 AA / 'any codon' initiation / 'universal' genetic code 
---------------------------------------------------------------------------------------------------
RESULTS ANALYSIS:

[enter your observations here]

---------------------------------------------------------------------------------------------------
RAW RESULTS:

a) forward strand

>ORF number 1 in reading frame 1 on the direct strand extends from base 511 to base 744.

CGAGTGATAACTGGTCCAGTAATCGCGATACCGATCATCTTGTTGCGGATTGACGATGTT
AAAATCCCGATCAGGGCGGATATCCAGCCCCAGCCTTTCACAACGTTGCTGAATCACTTC
GGGGCGGCCTATGACGATGGGAACTTCGCTGGTTTCTTCCAAAACGGCCTGAGCGGCGCG
CAGCACCCGCTCGTCTTCGCCCTCGGCAAACACAATCCGTCGAGCGCTGCTTGA

>Translation of ORF number 1 in reading frame 1 on the direct strand.
RVITGPVIAIPIILLRIDDVKIPIRADIQPQPFTTLLNHFGAAYDDGNFAGFFQNGLSGA
QHPLVFALGKHNPSSAA*

---------------------------------------------------------------------------------------------------
b) reverse strand

>ORF number 1 in reading frame 1 on the reverse strand extends from base 517 to base 855.
CCTGATCTGTGGCGCTGTGGGCGAATTCAGATGGCATCTGAATTATATCGAGCAAATTTT
AGGCAGCAAAACCTTATCGCCAAGCGGCGCGCTGTCTTTGATGATTTTAGAAGACGGGCC
TCTGTTCATCGCAGACACCCACGTCTGGGCGGATCCCACCCCCATGCAAATTGCCCAAAC
CGCCAAAGGGGCCGCGCGCCATGTGCGCCGTTTTGGCATAGAGCCACAAGTCGCGCTGTG
CTCGCAATCACAATTTGGAAATCTGAACAGCGAGACTGGCAAGAAAATGCGCCAAGCATT
GGATATTCTCGATACCGAAAAGGTGACGTTTACCTATGA

>Translation of ORF number 1 in reading frame 1 on the reverse strand.
PDLWRCGRIQMASELYRANFRQQNLIAKRRAVFDDFRRRASVHRRHPRLGGSHPHANCPN
RQRGRAPCAPFWHRATSRAVLAITIWKSEQRDWQENAPSIGYSRYRKGDVYL*

Finally, use the "RESULTS ANALYSIS" section of the type 1 fields to expose your observations of the raw results. Results analysis, a pivotal part of scientific discourse, answers the question "what did we see that is notable when we carried out the experiment described in the protocol". These rigorous factual observations, usually accompanied by precise numerical values (percentages, E-values, number of hits, number of amino acids etc.) are offered without far reaching discussions. Focus the main discussion and interpretations in the "Conclusion" field.

Note: the last "Notepad" field at the bottom of the sequence editing form is available to store any data that isn't accommodated by other specific annotation fields. Use the Notepad to store data that can be useful for subsequent re-analyses (e.g. store your set of FASTA formatted homolog sequences here). The Notepad is your private space and is not consulted during evaluation.

Brief contextual help is available for each annotation field of the editing form by clicking the icons. The information expected for each annotation field is described below.

Remember that a Frequently Asked Questions is available for in depth explanations, tutorials and screen shots of each of the bioinformatics analyses needed to perform the sequence annotations.

Always keep in mind during your analyses the two main focal points of your annotation which consists in proposing:

No single bioinformatics tool can by itself answer any of these questions; answers will be built through recouping and synthesis of all available results.

The basic rule set below can be over ridden by more specific or alternative rules given to you by your instructors. If in doubt, always consult your instructors.

ORF finding

The first investigation for each DNA fragment will involve the identification of putative Open Reading Frames (ORFs). There are many tools to tackle this issue, including the following: For this study, you will only consider ORFs that verify the following criteria:
  1. do not contain any STOP codons (basic ORF definition...)
  2. contains at least 60 codons
  3. can be on either direct or reverse strands
  4. can be in frames 1, 2 or 3
  5. can be incomplete at the 5' or 3' ends, or both!

Copy & paste the raw UNCENSORED ORF finding output in the'ORF finding' field of the Annotathon editing form. Remember to conduct the analysis in all SIX frames, and do include the PROTOCOL line above each raw result.

If your sequence contains several ORFs, arbitrarily select the longest one for all subsequent analyses.

You must decide if the best (longest) ORF is likely to be a true or a false positive. The key elements in support of a true positive ORF are:

If homologs clearly exist, you can conclude that the sequence is coding DNA whatever the ORF size. Otherwise, the true or false positive nature of the ORF will essentially depend on the ORF size[1]. There is no real hard threshold, but it is very unlikely that a 150 amino acid ORF is a false postive...

-If the DNA fragment doesn't appear to contain any ORFs, tick the 'non-coding' box of the 'Status' field. The annotation of this fragment will be limited to populating the ORF finding and BLAST fields, as well as the conclusion of course!

-If the sequence appears to carry a true coding ORF, tick the 'coding' box of the 'Status' field. Indicate in the appropriate fields the start and end positions of the ORF[2], as well as the strand. Note that if the ORF is complete at the 3' end (i.e. finishes with a STOP), you need to substract the 3 STOP codon nucleotides from the end position. Validate this ORF by clicking "Save annotations".

If the ORF verifies the rules above, the translation will automatically by displayed; otherwise an error message will help you pin point the problem. The ORF can be incomplete, in which case simple informational messages to this effect are displayed.

[1] indeed the absence of homologs in public protein databases does not suggest that a sequence is non-coding; it merely means that there is currently no known homolog. There exists other so called ab initio approaches to identify true positive coding ORFs (for instance based on statistical codon usage biases) but these methods usually require organism specific known gene training sets or large chunks of genome sequence, which are hence not well suited to metagenome exploration.

[2] Important note on the ORF coordinate system: The ORF start & end positions must be given on the strand which carries the ORF! The ORF positions given by the SMS ORF finder can be entered as is, whereas ORF locations on the reverse strand provided by the NCBI ORF finder need to be converted (fragment length - position +1)...

Please refer to the Frequently Asked Questions for further details on ORF finding, in particular on the subtil issue of exact determination of ORF start position...

Molecular weight

If the ORF is complete at both ends, compute its theoretical polypeptide molecular weight using for instance:

Conserved protein domains

Find out whether your ORF contains any of the known conserved domains stored in one of the domain databases: INTERPRO is a good choice since as a federation of all other databases, it contains all known domains; INTERPRO analyses can however be a the slow side.

Only submit to the Annotathon domains that you have good reasons to believe are significant, that is to say:

  1. those that are not expected to be found easily purely by chance (i.e. which have sufficiently specific profiles/signatures)
  2. those whose predicted functions are coherent with the other bioinformatics resultst (e.g. a DNA binding domain for an ORF which BLAST homologs are transcription factors)
  3. domains that are non-redundant (and non-overlapping) with other domains you have submitted to the Annotathon
If you are convinced of the likelyhood of at most four domains, enter their names and coordinates if the Annotathon 'domains' field. Pay attention not to repeat essentially the same domain represented under multiple accession numbers in distinct databases (it is common for domains to be present in all three PROSITE, INTERPRO et PFam databasesS).

Please refer to the Frequently Asked Questions for further details on running BLAST and most importantly on identifying conserved domains.

BLAST homolog search

Use BLAST to identify putative sequence homologs of your ORF in public sequence databases. You can find online BLAST services at:

Two approaches can be used to identify homologs of your sequence:

You should query the two following protein databases:

Copy & paste in the 'BLAST' field:
  1. the BLAST header (or insert your protocol: program name, database queried and any other modified parameters)
  2. the complete unabridged hit list (the summary sequence list next to the two 'Score' and 'E-Value' columns)
  3. the first sequence alignements (all alignments if there are few hits, otherwise just the first dozen or so)
  4. copy the full BLAST Lineage Report (NCBI BLAST only, "Taxonomy report" link in the BLAST results header) in the Annotathon 'Taxonomy report' field. Only include the first chapter called Lineage Report!

If homologs of your ORF exist, indicate what you consider the E-value threshold that separates true positive homologs from false positive non-homologs.

Use the BLAST results (the lineage report is your friend here) to build two groups of homolog sequences which will serve, after multiple alignement, as a basis for phylogenetic tree reconstruction:

IMPORTANT: Remember that ALL sequences selected for inclusion in the study and external groups must be homologs of your ORF, i.e. their BLAST score must be above the E-value threshold determinded above.

IMPORTANT: Include under the RESULTS ANALYSIS heading of the Taxonomy report the COMPREHENSIVE list of all the sequences you have selected in the study and external groups: for each sequence, provide its accession number, the short name you have chosen for it (see below Multiple alignment of protein sequences), its BLAST E-value and score and its taxonomic group. For instance:

PROTOCOL:

BLASTp versus NR, NCBI default parameters apart from "Number of descriptions=500"
---------------------------------------------------------------------------------------------------
RESULTS ANALYSIS:

[describe here your analysis of the taxonomy report, followed by the selected list of homologs carried over to multiple sequence alignment:]

In group: proteobacteria
ref|ZP_01264926.1|	Cpelagibacter	5e-89	Candidatus Pelagibacter ubique HTCC1002	(a-proteobacteria)
gb|AAI55631.1|		Bcaryophylli	3e-79	Burkholderia caryophylli		(b-proteobacteria)
gb|AAI55631.1|		Gsulfurreducens	7e-59	Geobacter sulfurreducens		(b-proteobacteria)
sp|Q8CXD9|		Aaquariorum	2e-41	Aeromonas aquariorum			(g-proteobacteria)
[...]

Out group: other bacteria (=non proteobacteria: Firmicutes, Cyanobacteria, Thermotogales)
ref|YP_249980.1|	Linnocua	8e-38	Listeria innocua		(firmicutes)
ref|ZP_01002095.1|	Tmaritima	5e-22	Thermotoga maritima		(thermotogales)
emb|CAD31286.1|		Synechocystis	9e-21	Synechocystis sp. PCC 6803	(cyanobacteria)
[...]

---------------------------------------------------------------------------------------------------
RAW RESULTS:

cellular organisms
. Bacteria           [bacteria]
. . Proteobacteria     [proteobacteria]
. . . Sinorhizobium meliloti --------------------------------------------  348 2 hits [a-proteobacteria]       Sarcosine oxidase subunit alpha (Sarcosine oxidase subunit)
. . . Francisella novicida U112 .........................................   80 1 hit  [g-proteobacteria]       Aminomethyltransferase (Glycine cleavage system T protein)
. . . Bdellovibrio bacteriovorus ........................................   80 1 hit  [d-proteobacteria]       Aminomethyltransferase (Glycine cleavage system T protein)
[...]

Please refer to the Frequently Asked Questions for further details on running BLAST and most importantly on the sensitive issue of sequence selection for study and external groups.

Multiple sequence alignement

The aim of the multiple alignment is first to verify that the ORF integrates convincingly in its presumed homolog family: the alignment must hence present clear well conserved regions. Secondly, the multiple alignment will serve as the basis for the phylogenetic tree inference: the alignment must therefore suggest a sufficient number of mutations (informative positions) to allow the reconstruction of the evolution history! Beware not not include sequences that are too partial as these can dramatically reduce the number of informative positions in the alignment.

It is common to have to reiterate the building of the multiple alignment many times, adding or taking away more or less divergent sequences, in order to finally obtain a satisfactory result.

IMPORTANT: before proceeding to the multiple alignment, insert a legible label directly in the sequence FASTA format in order to create useful labels both for alignment and phylogenetic tree. Collect FASTA formated homolog sequences in the Notepad and insert sequence labels as follows:

FASTA sequence as produced by the NCBI (if left untouched, the sequence label will be a cryptic "gi|5581978"):

>gi|55819784|ref|YP_143054.1| serine protease inhibitor [Acanthamoeba mimivirus]
MDYSHKYIKYKKKYLSLRNKLDRENTPVIISRIEDNFSIDDKITQSNNNFTNNVFYNFDTSANIFSPMSL
TFSLALLQLAAGSETDKSLTKFLGYKYSLDDINYLFNIMNSSIMKLSNLLVVNNKYSINQEYRSMLNGIA
VIVQDDFITNKKLISQKVNEFVESETNAMIKNVINDSDIDNKSVFIMVNTIYFKANWKHKFPVDNTTKMR
FHRTQEDVVDMMYQVNSFNYYENKALQLIELPYNDEDYVMGIILPKVYNTDNVDYTINNV

FASTA sequence after insertion of a legible sequence label (the label is formed by the letters directly following the ">" sign up to the first space or up to 10 characters, which ever comes first):

>Amimivirus gi|55819784|ref|YP_143054.1| serine protease inhibitor [Acanthamoeba mimivirus]
MDYSHKYIKYKKKYLSLRNKLDRENTPVIISRIEDNFSIDDKITQSNNNFTNNVFYNFDTSANIFSPMSL
TFSLALLQLAAGSETDKSLTKFLGYKYSLDDINYLFNIMNSSIMKLSNLLVVNNKYSINQEYRSMLNGIA
VIVQDDFITNKKLISQKVNEFVESETNAMIKNVINDSDIDNKSVFIMVNTIYFKANWKHKFPVDNTTKMR
FHRTQEDVVDMMYQVNSFNYYENKALQLIELPYNDEDYVMGIILPKVYNTDNVDYTINNV

Choose an easily recognisable label, such as "Ecoli" for "Escherichia coli". It is crucial that your sequence labels are unique, or the following steps (multiple alignments and tree) will likely fail! If you have two "Ecoli" sequences, use for instance "Ecoli1" and "Ecoli2".

Build a multiple alignment (including all the in and out group sequences, as well as your ORF, naturally) using an online version of one the following software: ClustalW (widely used), MUSCLE (fast and a little more efficient) or T-COFFEE (slower but highly robust method with very useful colored conserved alignment blocks). These methods are available on the web site of:

The limitation in the number of sequences to align is simply due to computation time of multiple alignment programs, as well as subsequent phylogenetic tree reconstruction. Computation time is reasonable up to around thirty our fifty sequences of a few hundred residues.

Copy & paste the "ClustalW" formated multiple alignment in the 'Multiple Alignement' Annotathon field.

Phylogenetic tree

Use the above multiple alignment to infer a phylogenetic tree using two distinct tree reconstruction approaches: You can use the online phylogeny.fr dedicated service (recommended, includes both the BioNJ and PhyML programs), or the Pasteur Institute Mobyle online portal (includes the BioNJ and Phylip protdist/neighbor programs).

Please refer to the Frequently Asked Questions for further details and screen shots on running phylogenetic analyses.

Copy & paste the textual tree representation in the 'Tree' Annotathon field. Remember to include a protocol line in the 'Tree' field that includes the program name and run parameters (ex 'Phylip / Protdist+neighbor / Randomized input - Random number seed = 11 / rooted on: Coccidioides immitis (ascomycetes)').

Add after each tree leaf label the taxonomic group in brackets, e.g. (alpha-proteobacteria). Your textual tree representation should look like this - notice the (taxonomic group) labels added:

PROTOCOL:

a) Phylogeny.fr / BioNJ method / out group: Coccidioides immitis (ascomycetes)
b) Phylogeny.fr / PhyML method / no bootstrap / default substitution model / out group: Coccidioides immitis (ascomycetes)
---------------------------------------------------------------------------------------------------
RESULTS ANALYSIS:

[for each tree produced, explain:
-is the tree coherent with the reference phylogeny of species?
-is the tree coherent with the tree produced by the alternate method?
-to which taxonomic group does the metagenomic sequence appear to belong?]

---------------------------------------------------------------------------------------------------
RAW RESULTS:

a) BioNJ
          +---------------------Roseovarius                                 (alpha-proteobacterie)
          !  
          !     +----------------------------------aproteobac               (alpha-proteobacterie)
  +------10     !  
  !       !     !  +-------------------------------Bparapertu               (beta-proteobactérie)
  !       !     !  !  
  !       +----13  !                       +----------Jannaschia            (alpha-proteobacterie)
  !             !  !                 +-----5 
  !             !  !        +--------8     +-----------Ogranulos            (alpha-proteobacterie) 
  !             !  !        !        ! 
  !             !  !        !        +---------------------GOS_OT2311
  !             +-15        !  
  !                !        !                +-------Rhodobact              (alpha-proteobacterie)
  !                !  +----12        +-------6 
  !                !  !     !  +-----9       +----Roseobacter               (alpha-proteobacterie)
  !                !  !     !  !     ! 
  !                !  !     !  !     +--------------Obatsensis              (alpha-proteobacterie)
  !                !  !     +-11  
  !                +-14        !                      +Paerugin1            (gamma-proteobactérie)
  !                   !        !        +-------------1 
  !                   !        +--------7             +Paerugin2            (gamma-proteobactérie)
  !                   !                 ! 
  !                   !                 +------------Bcenocepacia           (beta-proteobactérie)
  !                   !  
  !                   +----------------------------------------Oceanobacter (gamma-proteobactérie)
  ! 
  !            +--Aspergillus terreus                                       (ascomycetes)
  !      +-----2 
  4------3     +---Aspergillus niger                                        (ascomycetes)
  !      ! 
  !      +-----------Aspergillus oryzae                                     (ascomycetes)
  ! 
  +-------------Coccidioides immitis                                        (ascomycetes) 

b) PhyML
[...]

Taxonomy

After you have anlysed the phylogenetic tree produced, specify the most likely taxonomic group (e.g. "Alphaproteobacteria") to which belongs the organism carrying your DNA fragment. To specify this group in the 'Taxonomy' Annotathon field you have two options: Save your annotations and make sure that the one field above that you left blank has correctly been automatically populated; for instance if you chose to indicate "Alphaproteobacteria" in the "Scientific Name" box, once saved the code "28211" should appear in the "NCBI numerical identifier" box.

Note that the "NCBI numerical identifier" box has precedence over the "Scientific Name" box, so if you wish to change the taxonomic classification of your sequence you must delete the numeric code in order to enter a new value in the "Scientific Name" box.

Once the taxonomic group is correctly specified, the full lineage should appear:


Rhodobacterales
Rank: order - Genetic Code: Bacterial and Plant Plastid - NCBI Identifier: 204455
Kingdom: Bacteria - Phylum: Proteobacteria - Class: Alphaproteobacteria - Order: Rhodobacterales
Bacteria; Proteobacteria; Alphaproteobacteria; Rhodobacterales; 

IMPORTANT: unless your DNA sequence is 100% identical to an existing GENBANK entry, you should probably not specify a precise species! Since without further evidence the precise taxonomic definition of the organism carrying the metagenomic DNA fragment is impossible, specify as likely taxonomic group the node immediatly above your sequence in the phylogentic tree.

Biological process & molecular function

When your ORF's homologs have known functions, or if the ORF presents known conserved domains, select in the available "Biological Process" & "Molecular Function" lists the most appropriate terms that most specifically describe your proposed ORF functional hypotheses. These terms are a subset of the comprehensive and hierachical "Gene Ontology", most often refered to as GO annotations: These GO annotations are frequently assigned records in well annotated databanks such as SWISSPROT or INTERPRO; use the GO terms associated to your ORF's closest homologs or conserved doamins to help you assign the most appropriate terms.

Gene symbol

In the event that your ORF has highly convincing homology with a family of well characterized proteins of known function, whose gene symbol nomenclature appears uniform and stable, you can propose in the Gene symbol field a putative gene symbol for your ORF. If the homologs have no gene symbol, or if their symbols vary to a large extent, do not invent a new symbol, just leave this field empty!

For gene symbol examples, check out those already attributed to metagenome fragments during the Annotathon on Metagenes.

Conclusion

This field is central to your evaluation: write up your interpretations and hypotheses based on the observations you have made in the preceeding "RESULTS ANALYSES". Imagine you are trying to convince a very sceptical colleague: use rigorous argumentation, cite precise evidence and numerical values when ever possible, highlight important findings, cross information from independent sources. Remember that in silico analyses generally do not constitute final proof, only suggestions. Terms such as "putative", "suggests" or "probably" can show understanding of the limitations of computational biology results.

Make sure you have at least covered:

Some common pitfalls to avoid at all cost: Concentrate on producing a scientific, structured, synthetic and rigorous argumentation that will hold up to peer scrutiny!

Evaluations

Due to lack of manpower, we are no longer able to offer evaluations of annotations outside of specific university teams!

Annotation evaluation check list

To help you anticipate potential annotation pitfalls, here is a (non comprehensive) list of the most common criticisms made about annotations submitted for evaluation:
Analysis Category Criticism
ORF analysis An ORF found with "any codon" as initiation codon with a start position above 3bp can not be incomplete at 5' end (there is a STOP codon just before)!
ORF analysis Discuss if there were any other potentially significant ORFs in the metagenomic sequence
ORF analysis Errors in ORF definition (contains stop codons, larger ORF exists etc.)
ORF analysis Please analyze the ORF results (nb of putative ORFs? 5'/3' incomplete?, which ORF did you select ORF?)
ORF analysis Unlikely to be non-coding considering the ORF size?
ORF results Incomplete results (missing strand or phases)
ORF results Missing protocol (strand, inititation codons, genetic code, min ORF size...)
blast analysis Incomplete description of BLAST results (nb of hits, E-value distribution, location of HSPs along query...)
blast analysis Incorrect analysis and interpretation of the BLAST results
blast analysis List under PROTOCOL the list of all protocols used (cf Rule Book)
blast analysis No analysis of functionnal information derived from homologues detected by BLAST
blast analysis You are confusing "similarity" with "homology"!
blast results Incorrect presentation of results (incomplete sequence list, too few or too many alignments, copy&paste error...)
blast results Missing protocol (BLAST type, database)
blast results Some BLAST's are missing (SP/NR, BLASTx, modified parameters ...)
blast results Too many pairwise alignments!
blast taxonomy Discuss your choice of Study Group
blast taxonomy Incorrect description of the BLAST taxonomy Lineage Report
blast taxonomy Incorrect selection of external group
blast taxonomy Incorrect selection of homologues (non represented groups, and/or over-represented groups...)
blast taxonomy Please fully describe the set of sequences carried over to multiple alignement, with their BLAST scores and identifiers (cf Rule Book)
blast taxonomy To correctly identify an external group, you need to resubmit a BLAST asking for more than first 100 hits (250, 500 ou more)
conclusion domains Incorrect comparison of functionnal info found through BLAST and INTERPRO
conclusion hypotheses Justify your selection of Gene Ontology terms!
conclusion hypotheses No functionnal hypothesis
divers analysis Plagiarism
divers divers
domains analysis A number of domains listed under RAW RESULTS are not discussed at all?
domains analysis Discuss the predicted conserved domains E-values!
domains analysis Incorrect conserved domains identification (non annotated true positives, redundant domains, false positive domains selected...)
domains analysis Incorrect functionnal interpretation from conserved domains identified
domains analysis Missing conserved domain analysis
domains domains
domains results Incorrect presentation of domains
domains results Missing raw Interpro textual output (RAW OUTPUT button in Interpro results page)
molecular weight results Not calculated or not applicable (if partial ORF)
multiple aln ORF Error in the interpretation of ORF start position (too long or too short in 5')
multiple aln analysis Are all sequences in the multiple alignment of similar length?
multiple aln analysis Incorrect analysis of Multiple Alignment (conserved/divergent regions, coherence with INTERPRO conserved domains...)
multiple aln analysis You have not discussed your ORF's start position compared to its homologs
multiple aln results Alignment contains non-homologous sequences
multiple aln results Incorrect multiple alignement presentation (CLUSTAL format, legible sequence names...)
multiple aln results Multiple alignement contains some sequences which are too partial (incomplete at one or both ends)
multiple aln results Several identical sequences
multiple aln results Where is your ORF?
ontologies analysis Incorrect Biological Process
ontologies analysis Incorrect Molecular Function
ontologies analysis No selection of Gene Ontology terms
phylogeny analysis Incorrect specification of Duplication/Speciation events on tree nodes
phylogeny analysis Incorrect tree interpretation (HGT missed, ORF assigned to wrong group etc...)
phylogeny analysis Missing discussion on tree topology? Congruence if more than one tree?
phylogeny analysis Missing most likely taxonomic classification of organism carrying ORF
phylogeny results Add on the tree after the leave names, the taxonomic groups in the form [alpha-protéobactéries]
phylogeny results Incorrect presentation (leaves not reformated with Genus species format, eg 'Ecolix'...)
phylogeny results Missing alternative tree reconstruction method
phylogeny results Missing protocol (method type, ext group used...)
phylogeny taxonomy Select a most likely taxonomic group (Taxonomy field)
writing Please respect the recommended presentation for RESULTS fields (cf Rule Book)
writing Conclusion should be better structured
writing Conclusion should be more concise
writing Insufficient attention to spelling
writing Lacks rigor. Cite evidence in support of your hypotheses! Refer to specific numeric values!