From September 2009 to December 2013, the oceanographic ship Tara sampled plankton across all of the planet's oceans (right, bottom). Ocean plankton produce half the oxygen we breathe. If terrestrial forests fill up our first lung, oceans fill up the second. The ocean's microscopic planktonic life is also an important carbon sink, with far reaching implications in climate change scenarios. Hence, the Tara Oceans pan-oceanic international expedition aims to define the current state of plankton biodiversity, from tropical coral reefs to Antarctica, from viruses to fish larvae.
Metagenomic sequencing of the plankton DNA contained in the thousands of Tara Oceans samples has begun at the GENOSCOPE (France). These short pieces of DNA contain precious information on the identity of the plankton species which populate seawater, as well as hints to the metabolic functions at work in these tiny ocean drifters. At this molecular scale, bioinformatics is the essential tool to shed light on the information locked up in the DNA sequences!
Your mission is to sift through this precious gigantic heap of DNA sequence data, in order to try and identify the microbial origin of these sequences (archeal, bacterial, viral?), to predict if these DNA sequences code for proteins, and if this is the case, what might be the function of these protein coding genes?
Your team will collectively annotate distinct DNA fragments randomly distributed from the available public sequence pool. Each registered annotator will be responsible for annotating a specific set of sequence fragments. For each fragment, annotators will produce a full report specifying if it is likely to be coding, the putative function of the protein product, as well as the most likely taxonomic classification of the host organism.
By practicing the core bioinformatics tools on a number of distinct sequences, each one a live piece of experimental data, you will become familiar with the running and interpretation of fundamental sequence analysis. Experience shows that after two or three distinct analyses, the focus shifts from bioinformatics to biological issues!
All tool are available online, so all you need to start is a web browser.User Guide
You can access the Annotathon from any computer connected to the Internet, irrespective of operating system (MAC, Windows or Linux)...
We recommend that you simultaneously open the following pages (in different windows or tabs) in your browser:
If you don't have an Annotathon account (i.e. it is your first session), clic on the "New account" tab in the permanent menu at the top of the Annotathon pages. Follow the instructions to open a new account; make sure you select the appropriate affiliation or you might end up being supervised inside an another team, possibly using a different language... If you don't know your Team code, please ask your instructor for it. You are required to enter at least one firstname/lastname pair, and one email address in order to receive Annotathon specific notifications. Your email address is secure and will under no circumstance be made public or passed on to any third party. Only low traffic messages specific to your course duration will be mailed to this address; no further messages will be sent after the course is completed.
Finally a clic on "Open the account" should be followed by the message "Account 'XYZ' has been created". Use your 'username' and 'password' and clic "Connect" in the form at top of page to open an Annotathon session. You will be reminded that your email address is not validated until you have followed the special link included in an email automatically sent to you at account creation.
The home page (also available by clicking the "Home" tab) gives an overview of the team's annotation progress. Note that once connected you will be able to locate your position in the team at the bottom of the stats page (after your annotations start being evaluated).
You can only add new sequence fragments to your cart when it is empty, or when you have already annotated all available sequences. Add new sequence fragments at your discretion (or until you reach the upper limit set by your supervisor).
Clic the icon opposite the sequence you wish to view in your cart. The initial annotation is minimal: outside the sequence itself and its geographic origin, each sequence fragment has a unique Annotathon accession number. The remaining annotation is your responsibility!
Clic the icon opposite the fragment you wish to edit. After having modified any annotations, remember to save your work on the Annotathon server by clicking the "Save your annotations" button! Should you leave this editing form without submitting, all modifications since last save will be lost... Since you can submit your work as often as you wish, it is recommended you save your work regularly.
When your annotations are completed, clic the icon opposite the fragment. Its status will then shift from 'Annotation 1' to 'Evaluation 1' and it will be closed for editing until your work has been evaluated. After this initial evaluation, the fragment status shifts from 'Evaluation 1' to 'Annotation 2'; you are then invited to update your initial annotations following the evaluator's comments. When your second annotation pass is completed, clic the icon to submit your annotation for the second and final round of evaluations.
The "Forum" tab opens access to the Annotathon internal forum (the signals that a new unread message has been posted to the forum). Clic on a message subject to see its content. If you wish to reply to a message, use the form immediatly under it and clic "Post message". IMPORTANT! only use this method to DIRECTLY REPLY TO THE CONTENT OF A SPECIFIC MESSAGE!
If you wish to open a new discussion thread, you MUST use the special new thread form available at the top of each of your annotation records ( icon in your cart)! You can then select the appropriate forum for your new thread (e.g. Searching fo homologues: BLAST). A link to your specific sequence fragment and associated annotations will automatically be included with your post. Note that the messages you post are also emailed directly to your supervisors and fellow annotators.
If messages are often answered by supervisors, trainees who wish to offer help by answering fellow trainee questions are nonetheless strongly encouraged to do so. Constructive replies will be taken into account in trainee evaluations.
Annoucements from your supervisors will be displayed at the top of Annotathon pages. Once read, tick the 'Read' box to transfer the message to your archive (available at the bottom of the Forum tab).Sequence Annotations
The sequence annotation editing form has three types of fields:
The Annotathon editing form is hence both a numeric "lab book" (Protocols & Raw Results) and an "annotation report" (Results Analysis & Ontologies).
IMPORTANT: Some free text fields are initially filled with a standard template looking like this:
Under the "PROTOCOL" heading, specify the minimum information necessary to reproduce the exact same results. Usually, this would entail giving the name of the tool used, the URL of the web page, together with its run parameters. For instance, for the ORF finding results field, the protocol line could read:
SMS ORFinder, http://annotathon.org/sms2/orf_find.html :
forward strand, frames 1, 2 & 3, min 60 AA, 'any codon' initiation, 'universal' genetic code
Copy & paste the raw results of the analysis, in-extenso, under the "RAW RESULTS" heading. If you have carried out more than one analysis (for instance two SMS ORFfinder runs, one on forward and one on reverse strand), then reference the two analyses using an index exactly as follows:
a) SMS ORFinder, http://annotathon.org/sms2/orf_find.html :
forward strand, frames 1, 2 & 3, min 60 AA, 'any codon' initiation, 'universal' genetic code
b) SMS ORFinder, http://annotathon.org/sms2/orf_find.html :
reverse strand, frames 1, 2 & 3, min 60 AA, 'any codon' initiation, 'universal' genetic code
[enter your observations here]
[enter your observations here]
a) forward strand
>ORF number 1 in reading frame 1 on the direct strand extends from base 511 to base 744.
>Translation of ORF number 1 in reading frame 1 on the direct strand.
b) reverse strand
>ORF number 1 in reading frame 1 on the reverse strand extends from base 517 to base 855.
>Translation of ORF number 1 in reading frame 1 on the reverse strand.
Finally, use the "RESULTS ANALYSIS" sections to expose your observations and interpretations of the raw results. Results analysis, a pivotal part of scientific discourse, answers the question "what did we see that is notable when we carried out the experiment described in the protocol". These rigorous factual observations, usually accompanied by precise numerical values (percentages, E-values, number of hits, number of conserved amino acids etc.) are offered without far reaching discussions. Focus the main discussion and interpretations in the "Conclusion" field.
Note: the last "Notepad" field at the bottom of the sequence editing form is available to store any data that isn't accommodated by other specific annotation fields. Use the Notepad to store data that can be useful for subsequent re-analyses (e.g. store your set of FASTA formatted homolog sequences here). The Notepad is your private space and is not consulted during evaluation.
Brief contextual help is available for each annotation field of the editing form by clicking the icons. The information expected for each annotation field is described below.
Remember that a Frequently Asked Questions is available for in depth explanations, tutorials and screen shots of each of the bioinformatics analyses needed to perform the sequence annotations.
Always keep in mind during your analyses the three main focal points of your annotation which consists in proposing:
No single bioinformatics tool can by itself answer any of these questions; answers will be built through recouping and synthesis of all available results.
The basic rule set below can be over ridden by more specific or alternative rules given to you by your instructors. If in doubt, always consult your instructors.
The first investigation for each DNA fragment will involve the identification of putative Open Reading Frames (ORFs). There are many tools to tackle this issue, including the following:
For this study, you will only consider ORFs that verify the following criteria:
Copy & paste the raw uncensored ORF finding output in the'ORF finding' field of the Annotathon editing form. Remember to conduct the analysis in all SIX frames, and to include a full PROTOCOL line for each raw result. If you have carried out more than one analysis, as for ORFinder calculated on both strands, or for BLASTp against SWISSPROT and against NR, then you must label each distinct Protocol line with a letter matching a corresponding header in the raw results section (as shown above for the ORFinder example).
Please synthesize all the ORFinder results in a synoptic table like follows (use the field text editor icons to add a table):
Important note: Tables should have incremental numbers (ex. Table 1, Table 2, etc...) through all annotation items, as well as titles.
You should also do a figure that resume the positions of each ORFs in the DNA fragment (example is givening below, but not link with the previous table).
If your sequence contains several ORFs, arbitrarily select either the longest one or the one obtaining the highest number of BLAST hits (see below) for all subsequent analyses.
You should also classify each putative ORF in each one of the following categories:
In order to classify your ORFs, the following elements should be considered:
If homologs clearly exist, you can conclude that the sequence is coding DNA whatever the ORF size. Otherwise, the true or false positive nature of the ORF will essentially depend on the ORF size. There is no real hard threshold, but it is very unlikely that a 150+ amino acids ORF is a false postive...
-If the DNA fragment doesn't appear to contain any ORFs and is too short to be convincing, tick the 'non-coding' box of the 'Status' field. The annotation of this fragment will be limited to populating the ORF finding and BLAST fields, as well as the conclusion of course! However, before you conclude that an ORF is non-coding, we recommend that you first look for homologs in the "environmental databases". Ask your instructor for instruction on how to proceed, since this is quite exceptionnal. If you still find no homologs in "environmental databases", you may save your annotations and add a new (hopefully less obscure) sequence to your cart.
-If the sequence appears to carry a true coding ORF (either very long ORF, or with many homologs, or both!), tick the 'coding' box of the 'Status' field. Indicate in the appropriate fields the start and end positions of the ORF, as well as the strand. Note that if the ORF is complete at the 3' end (i.e. finishes with a STOP), you need to substract the 3 STOP codon nucleotides from the end position. Validate this ORF by clicking "Save annotations".
If the ORF verifies the rules above, the translation will automatically by displayed; otherwise a red error message will help you pin point the problem. The ORF can be incomplete, in which case simple green informational messages to this effect are displayed. You should correct ORF strand, start and stop positions until you do not observe any more error messages after saving your annotations!
 indeed the absence of homologs in public protein databases does not suggest that a sequence is non-coding; it merely means that there is currently no known homolog. There exists other so called ab initio approaches to identify true positive coding ORFs (for instance based on statistical codon usage biases) but these methods usually require organism specific known gene training sets or large chunks of genome sequence, which are hence difficult to apply to metagenome exploration where by definition the organisms from which the sequences derive are unknown.
 Important note on the ORF coordinate system: The ORF start & end positions must be given on the strand which carries the ORF! The ORF positions given by the SMS ORF finder can be entered as is, whereas ORF locations on the reverse strand provided by the NCBI ORF finder need to be converted (fragment length - position +1)...
Please refer to the Frequently Asked Questions for further details on ORF finding, in particular on the subtil issue of exact determination of ORF start position...
Here is an example of how to structure the analysis of the results
1- ORFs Classification
1.1- Justify KNOWN ORFs (if present)
1.2- Justify NOVEL ORFs (if present)
1.3- Justify ORFan ORFs (if present)
1.4- Justify False positive ORFs (if present)
-> Give detailed justification of each classification
-> Refer to Table 1!
-> Cite your sources of information with web links (e.g. "homologs are epimerase (cf. Fiche SWISSPROT MJ0211)"
2- ORF selected for annotation in the rest of the report
-> Justify your selection!
-> Would other ORFs may also be subjected to bioinformatics analysis?
3- Extremeties of the selected ORF
-> Discuss the start and end positions. If possible, can you estimate the missing number of amino acids to get a complete and full protein (refer to the multiple alignment section).
Only if the ORF is complete at both ends, compute its theoretical polypeptide molecular weight using for instance:
Find out whether your ORF contains any of the known conserved domains stored in one of the domain databases:
INTERPRO is a good choice since as a federation of all other databases, it contains all known domains; INTERPRO analyses can however be a the slow side.
Only submit to the Annotathon domains that you have good reasons to believe are significant, that is to say:
If you are convinced of the likelyhood of at most four domains (but because of the short length of metagenome sequences it is very rare to have more than one non-redundant domain), enter their names and coordinates if the Annotathon 'domains' field. Pay attention not to repeat essentially the same domain represented under multiple accession numbers in distinct databases (it is common for domains to be present in all three PROSITE, PRINTS & PFam databases).
In the "Raw results" of the INTERPROscan analysis, copy the tool's results in the following form only ("Export" -> "TSV"):
TO82S_4665010 35c27f 205 SUPERFAMILY SSF52833 79 162 3.44E-7 T 30-09-2014 IPR012336 Thioredoxin-like fold
TO82S_4665010 35c27f 205 Pfam PF14595 Thioredoxin 44 167 6.3E-32 T 30-09-2014
TO82S_4665010 35c27f 205 Gene3D G3DSA:22.214.171.124 18 205 4.4E-36 T 30-09-2014 IPR012336 Thioredoxin-like fold
Please synthesize this difficult to read raw result by producing (yet another) nice table under the "Results Analysis" section, for instance:
Table 2: List of conserved protein domains identified by InterproScan
| Interpro code | Database | start | end | E-value | Original DB description | Interpro description |
| (IPRxxxxxx) | origin | position | position | | (first on raw res. line) | (last on raw res. line)|
| | | | | | | |
| IPR012336 | SUPERFAMILY | 79 | 162 | 3.44E-7 | SSF52833 | Thioredoxin-like fold |
| Néant | Pfam | 44 | 167 | 6.3E-32 | Thioredoxin | Néant |
| IPR012336 | Gene3D | 18 | 205 | 4.4E-36 | G3DSA:126.96.36.199 | Thioredoxin-like fold |
Please refer to the Frequently Asked Questions for further details on running InterproScan and most importantly on identifying conserved domains.
1. Selected domains (if present)
-> Which domains predictions do you select to annotate the ORF? Specifiy their sizes, E-value! Justify your selection!
-> Refer clearly to Table 2 for your detailled analysis.
2. Rejected domains (if present)
-> Why are some functional domains rejected? (high E-value? No IPR domains?)
3. Biological function
-> Give details of biological functions associated your retained domains? (enzymatic activities, molecular functions, biological processes, ....)
-> Cross your results with Blast results (in particular Swissprot)
-> All your sources should be cited (for example web link to Interpro or Pfam entry)
Use BLAST to identify putative sequence homologs of your ORF in public sequence databases. You can find online BLAST services at:
Two approaches can be used to identify homologs of your sequence:
You should query the two following protein databases:
Copy & paste in the 'BLAST' Raw Results field (Important note: a text version of the BLAST results are available via the "Reformat" button on the NCBI website):
If homologs of your ORF exist, indicate what you consider the E-value threshold that separates true positive homologs from false positive non-homologs.
Present under the BLAST "Results Analysis" section a synthetic table looking like this:
| number of | min | max | e-value |
| results | e-value | e-value |threshold|
| | | | |
NR | 3124 | 5e-61 | 10 | 4e-07 |
| | | | |
SP | 105 | 3e-05 | 10 | < 3e-05 |
With the help of "Definition List" tool, also include a table which lists all the distinct functions of the homologs detected by BLAST, with their range of E-values. The Definition List is of great help, but not perfect. In some cas, you may need to simplify the list by grouping in one line some definitions. e.g. DNA polymerase B from :
Provide a table of the following form:
You must discuss in the "Results analysis" section if you think this list of homolog functions is coherent or not (ie are they essentially synonyms), and if they are coherent with the functions of the conserved domains identified by INTERPRO!
Do not continue the analysis of the fragment if:
- No homologs (or very small number of homologs in NR database, ie <100)
- Your gene is already present in the NR biological database (nucleotide BLASTn with ID > 95%)
Proposed structure of your analysis section:
1. Overview of the alignments
-Synthetic description of your alignment results (number, known functions, quality of the alignments, ...)
-Give details on E-value range, % identity/similarity range, what about indels, alignment coverage, ....)
2. Identification of protein homologs
-> Justify E-value thresholds (NR & SP) (Evalue cutoffs, changes in putative homolog functions), refer to tables 3 & 4.
3. Function of homologs from SWISSPROT analysis
-> From SP entries, give details on closest homolog functions, specific role of important amino acids involved in catalytic function, cross your analysis with protein domain analysis. In all cases, cite your sources with web links.
Please refer to the Frequently Asked Questions for further details on running BLAST.
Follow the instructions on the TaxReports tool to extract taxonomic information from your list of BLASTp hits.
Copy the full BLAST Lineage Report in the Annotathon 'Taxonomy report' 'Raw results" field. Only include the first chapter called Lineage Report:
. . Prochlorales
. . .Prochlorococcaceae
. . . Prochlorococcus
. . . .Prochlorococcus marinus str. MIT 9515........ 315 4e-103 2 hits Bacteria:Cyanobacteria:Prochlorales: phytoene desaturase [Prochlorococcus mari...
. . . .Prochlorococcus marinus str. MIT 9301........ 305 3e-99 2 hits Bacteria:Cyanobacteria:Prochlorales: phytoene desaturase [Prochlorococcus mari...
. . . .Prochlorococcus marinus str. MIT 9215........ 303 8e-99 2 hits Bacteria:Cyanobacteria:Prochlorales: phytoene desaturase [Prochlorococcus mari...
. . . .Prochlorococcus marinus str. AS9601.......... 301 4e-98 2 hits Bacteria:Cyanobacteria:Prochlorales: phytoene desaturase [Prochlorococcus mari...
. . . .Prochlorococcus marinus str. NATL1A.......... 261 2e-82 2 hits Bacteria:Cyanobacteria:Prochlorales: phytoene desaturase [Prochlorococcus mari...
. . . .Prochlorococcus marinus str. MIT 9303........ 249 1e-77 2 hits Bacteria:Cyanobacteria:Prochlorales: phytoene desaturase [Prochlorococcus mari...
. . Synechococcus sp. WH 8109....................... 251 1e-78 1 hit Bacteria:Cyanobacteria:Chroococcales: Carotene 7,8-desaturase [Synechococcus sp. WH ...
. . Synechococcus sp. WH 7803....................... 251 2e-78 3 hits Bacteria:Cyanobacteria:Chroococcales: phytoene dehydrogenase [Synechococcus sp....
. . Synechococcus sp. CB0205........................ 250 3e-78 1 hit Bacteria:Cyanobacteria:Chroococcales: 15-cis-phytoene desaturase [Synechococcus...
. . Synechococcus sp. BL107......................... 250 3e-78 2 hits Bacteria:Cyanobacteria:Chroococcales: 15-cis-phytoene desaturase [Synechococcus...
. . Synechococcus sp. WH 8016....................... 250 4e-78 2 hits Bacteria:Cyanobacteria:Chroococcales: 15-cis-phytoene desaturase [Synechococcus...
. . Synechococcus sp. CC9311........................ 250 4e-78 6 hits Bacteria:Cyanobacteria:Chroococcales: phytoene desaturase [Synechococcus sp. CC931...
. . Synechococcus sp. RS9916........................ 249 1e-77 2 hits Bacteria:Cyanobacteria:Chroococcales: 15-cis-phytoene desaturase [Synechococcus...
. . Synechococcus sp. CB0101........................ 248 2e-77 1 hit Bacteria:Cyanobacteria:Chroococcales: 15-cis-phytoene desaturase [Synechococcus...
. . Synechococcus sp. RCC307........................ 236 2e-72 3 hits Bacteria:Cyanobacteria:Chroococcales: phytoene dehydrogenase [Synechococcus sp....
. . Synechococcus sp. PCC 7002...................... 217 2e-65 3 hits Bacteria:Cyanobacteria:Chroococcales: phytoene dehydrogenase [Synechococcus sp....
. . Cyanobium sp. PCC 7001.......................... 249 7e-78 2 hits Bacteria:Cyanobacteria:Chroococcales: 15-cis-phytoene desaturase [Cyanobium sp....
. . Crocosphaera watsonii........................... 231 1e-70 1 hit Bacteria:Cyanobacteria:Chroococcales: 15-cis-phytoene desaturase [Crocosphaera ...
Under the Taxonomy report "Results Analysis" section, build (with the Taxonomy List tool ) a synthetic table of your observations as follows:
Use the BLAST results (the lineage report is your friend here) to build two groups of homolog sequences which will serve, after multiple alignement, as a basis for phylogenetic tree reconstruction:
IMPORTANT: Remember that ALL sequences selected for inclusion in the study and external groups must be homologs of your ORF, i.e. their BLAST E-value must be below the E-value threshold determinded above.
IMPORTANT: Include under the RESULTS ANALYSIS heading of the Taxonomy report the COMPREHENSIVE list of all the sequences you have selected in the study and external groups: for each sequence, provide its accession number, the short name you have chosen for it (see below Multiple alignment of protein sequences), its BLAST E-value and score and its taxonomic group. You are welcome to use to this effect the header of the FASTA files you have obtained from the Tax Report. For instance:
TaxReport tool (http://oceans.embl.de/Annotathon_outils/blast_tax_report2.php) of BLASTp versus NR, default parameters
[ insert here your synoptic table (see above) ]
[ write here your descrption of the taxonomy report, justify your INGROUP and your OUTGROUP. Do mention the E-value differential between the in- and out-groups! Then list as shown below the list of sequences you selected to represent your in- and out-groups: ]
>Bac_Cya_Pro_3 [Bacteria Cyanobacteria Prochlorales] E-value=1e-15 Bacteria;Cyanobacteria;Prochlorales;Prochlorococcaceae;Prochlorococcus; gi|488894830|ref|WP_002805954.1| zeta-carotene desaturase [Prochlorococcus marinus]
>Bac_Cya_Chr_2 [Bacteria Cyanobacteria Chroococcales] E-value=7e-78 Bacteria;Cyanobacteria;Chroococcales;Cyanobium; gi|493968054|ref|WP_006911325.1| 15-cis-phytoene desaturase [Cyanobium sp. PCC 7001]
>Bac_Cya_Chr_3 [Bacteria Cyanobacteria Chroococcales] E-value=1e-70 Bacteria;Cyanobacteria;Chroococcales;Crocosphaera; gi|494523610|ref|WP_007313063.1| 15-cis-phytoene desaturase [Crocosphaera watsonii]
>Bac_Cya_Chr_4 [Bacteria Cyanobacteria Chroococcales] E-value=9e-68 Bacteria;Cyanobacteria;Chroococcales;Cyanothece; gi|218438147|ref|YP_002376476.1| phytoene desaturase [Cyanothece sp. PCC 7424]
>Bac_Cya_Chr_5 [Bacteria Cyanobacteria Chroococcales] E-value=1e-64 Bacteria;Cyanobacteria;Chroococcales;Synechocystis; gi|16330439|ref|NP_441167.1| phytoene desaturase [Synechocystis sp. PCC 6803]
>Bac_Cya_Osc_1 [Bacteria Cyanobacteria Oscillatoriales] E-value=3e-72 Bacteria;Cyanobacteria;Oscillatoriales; gi|497454285|ref|WP_009768483.1| phytoene desaturase [Oscillatoriales cyanobacterium JSC-12]
>Bac_Cya_Osc_3 [Bacteria Cyanobacteria Oscillatoriales] E-value=1e-16 Bacteria;Cyanobacteria;Oscillatoriales;Microcoleus; gi|493682519|ref|WP_006632676.1| zeta-carotene desaturase [Microcoleus vaginatus]
>Bac_Cya_Nos_1 [Bacteria Cyanobacteria Nostocales] E-value=1e-70 Bacteria;Cyanobacteria;Nostocales;Nostocaceae;Trichormus; gi|298491654|ref|YP_003721831.1| phytoene desaturase ['Nostoc azollae' 0708]
>Bac_Cya_Nos_2 [Bacteria Cyanobacteria Nostocales] E-value=5e-14 Bacteria;Cyanobacteria;Nostocales;Nostocaceae;Trichormus; gi|298492908|ref|YP_003723085.1| carotene 7,8-desaturase ['Nostoc azollae' 0708]
>Bac_Cya_Nos_3 [Bacteria Cyanobacteria Nostocales] E-value=2e-70 Bacteria;Cyanobacteria;Nostocales;Nostocaceae;Anabaena; gi|414079384|ref|YP_007000808.1| phytoene desaturase [Anabaena sp. 90]
>Bac_Cya_Sti_1 [Bacteria Cyanobacteria Stigonematales] E-value=2e-68 Bacteria;Cyanobacteria;Stigonematales;Fischerella; gi|497072507|ref|WP_009458406.1| 15-cis-phytoene desaturase [Fischerella]
Out-group: other bacteria which are not Cyanobacteria (Proteobacteria, Chloroflexi, Chlorobi, Acidobacteria, ....)
>Bac_Chl_Chl_1 [Bacteria Chloroflexi Chloroflexales] E-value=3e-32 Bacteria;Chloroflexi;Chloroflexales;Chloroflexaceae;Chloroflexus; gi|163847906|ref|YP_001635950.1| carotene 7,8-desaturase [Chloroflexus aurantiacus J-10-fl]
>Bac_Chl_Chl_2 [Bacteria Chlorobi Chlorobia] E-value=2e-30 Bacteria;Chlorobi;Chlorobia;Chlorobiales;Chlorobiaceae;Chlorobaculum; gi|193212415|ref|YP_001998368.1| carotene 7,8-desaturase [Chlorobaculum parvum NCIB 8327]
>Bac_Aci_Can_1 [Bacteria Acidobacteria Candidatus Chloracidobacterium] E-value=2e-27 Bacteria;Acidobacteria;Candidatus Chloracidobacterium; gi|347753771|ref|YP_004861335.1| hypothetical protein [Candidatus Chloracidobacterium thermophilum B]
>Bac_Fir_Bac_1 [Bacteria Firmicutes Bacillales] E-value=2e-14 Bacteria;Firmicutes;Bacillales;Bacillaceae;Bacillus; gi|407961641|dbj|BAM54881.1| zeta-carotene desaturase [Bacillus subtilis BEST7613]
>Bac_Pla_Pla_1 [Bacteria Planctomycetes Planctomycetacia] E-value=2e-11 Bacteria;Planctomycetes;Planctomycetacia;Planctomycetales;Planctomycetaceae;Singulisphaera; gi|430745940|ref|YP_007205069.1|
The aim of the multiple alignment is first to verify that the ORF integrates convincingly in its presumed homolog family: the alignment must hence present clear well conserved regions. Secondly, the multiple alignment will serve as the basis for the phylogenetic tree inference: the alignment must therefore suggest a sufficient number of mutations (informative positions) to allow the reconstruction of the evolution history! Beware not not include sequences that are too partial as these can dramatically reduce the number of informative positions in the alignment.
It is common to have to reiterate the building of the multiple alignment many times, adding or taking away more or less divergent sequences, in order to finally obtain a satisfactory result.
IMPORTANT: before proceeding to the multiple alignment, make sure legible labels are present in the sequence FASTA format in order to create useful labels both for alignment and phylogenetic tree. If you have obtained your sequence FASTA from the TaxReports tool, the sequences should already have legible labels (in red, just after the ">" sign and before the first space). It is crucial that your sequence labels are unique, or the following steps (multiple alignments and tree) will likely fail!
Note that the sequence label "AEMMMM1" are constitute by the 5 first letters of the 5 first classification levels (Archaea Euryarchaeota Methanomicrobia Methanosarcinales Methanosarcinaceae). Sometimes, it should be useful to distinguish In and Outgroup by adding "ex" to Outgroup sequences as follows:
Build a multiple alignment (including all the in and out group sequences, as well as your ORF, naturally) using an online version of one the following software: ClustalW (widely used), MUSCLE (fast and a little more efficient) or T-COFFEE (slower but highly robust method with very useful colored conserved alignment blocks). These methods are available on the web site of:
The limitation in the number of sequences to align is simply due to computation time of multiple alignment programs, as well as subsequent phylogenetic tree reconstruction. Computation time is reasonable up to around thirty our fifty sequences of a few hundred residues.
Copy & paste the "ClustalW" formated multiple alignment in the 'Multiple Alignement' Annotathon field.
Also copy & paste the full multiple alignment obtained after curation (GBlocks output) in the "Raw Results" section. Please make sure you include the end (footer) of the GBlocks output as this contains crucial estimates of the number of informative positions in your alignment.
1. Quality of the multiple alignment
-> Can you confirm that that sequences are really homologs? Similar lengths? How many identical positions? How many conservative substitutions positions? Number of indels? can you find that the conservation of sequences within alignment reflects the subgroups (In and out groups)?
-> After curation with GBLOCKS, what is the number of conserved homolog positions (informative sites) for phylogenetic reconstruction? It is enough?
2. Identification of conserved blocks
-> You can annotate well conserved blocks in your alignment with codes (such as A, B, C etc.) and refer to them in your analysis.
-> Are there any conserved amino acids that are known as actives sites for this protein family? If yes, position in alignment, function, activity?
3. N and C-termini of the studied ORF
-> Analysis of the N-ter/C-term of the alignment (complete? start codon? potentially missing number of amino acids in N and C-termini?)
Use the above multiple alignment to infer a phylogenetic tree using two distinct tree reconstruction approaches:
You can use:
Please refer to the Frequently Asked Questions for further details and screen shots on running phylogenetic analyses.
IMPORTANT NOTICE: please use this specific rerooting tool to re-root your phylogenetic trees in "TEXT" format (indeed, the re-rooting manipulation of trees on the phylogeny.fr website are not 100% functionnal!). This tool will allow you to retain the node robustness values, as well as control the width of the tree.
Copy & paste the textual tree representation in the 'Tree' Annotathon field. Remember to include a protocol line in the 'Tree' field that includes the program name and run parameters (ex 'Phylip / Protdist+neighbor / Randomized input - Random number seed = 11 / rooted on: Coccidioides immitis (ascomycetes)').
a) Phylogeny.fr / PhyML method / no bootstrap / default substitution model / out group: Firmicutes
b) Phylogeny.fr / BioNJ method / out group: Firmicutes
Important: Use codes such as "A, B, ..." (or colored) to locate major branches in your trees and refer to them within your analysis
Suggested plan for the analysis section:
1. Tree topologies
-> Describe the topology of each tree. What are the monophyletic groups?
-> Do the two independant trees describe the same evolutionary history? the same topology? Similar or different clades?
-> Identify the commonalities as well as the potential incoherencies.
2. Coherence with reference trees
-> Are the in- and out-groups correctly separated?
-> Are your gene trees coherent with the reference species trees ("tree of life")?
-> Identify each discrepancy with the reference species tree, and suggest some explanations (HGTs, gene duplications...).
3. Predict the most likely taxonomic origin of the metagenomic ORF
-> In which monophyletic clade does the the metagenomic sequence seem to emerge?
-> Propose a hypothetical taxonomic classification for the metagenomic ORF!
-> Provide detailed justification of your hypothesis, do not under/over interpret the infered phylogenetic trees!
After you have analysed the phylogenetic tree produced, specify the most likely taxonomic group (e.g. "Alphaproteobacteria") to which belongs the organism carrying your DNA fragment. To specify this group in the 'Taxonomy' Annotathon field you have two options:
Save your annotations and make sure that the one field above that you left blank has correctly been automatically populated; for instance if you chose to indicate "Alphaproteobacteria" in the "Scientific Name" box, once saved the code "28211" should appear in the "NCBI numerical identifier" box.
Note that the "NCBI numerical identifier" box has precedence over the "Scientific Name" box, so if you wish to change the taxonomic classification of your sequence you must delete the numeric code in order to enter a new value in the "Scientific Name" box.
Once the taxonomic group is correctly specified, the full lineage should appear:
Rank: order - Genetic Code: Bacterial and Plant Plastid - NCBI Identifier: 204455
Kingdom: Bacteria - Phylum: Proteobacteria - Class: Alphaproteobacteria - Order: Rhodobacterales
Bacteria; Proteobacteria; Alphaproteobacteria; Rhodobacterales;
IMPORTANT: unless your DNA sequence is 100% identical to an existing GENBANK entry, you should probably not specify a precise species! Since without further evidence the precise taxonomic definition of the organism carrying the metagenomic DNA fragment is impossible, specify as likely taxonomic group the node immediatly above your sequence in the phylogentic tree.
When your ORF's homologs have known functions, or if the ORF presents known conserved domains, select in the available "Biological Process" & "Molecular Function" lists the most appropriate terms that most specifically describe your proposed ORF functional hypotheses. These terms are a subset of the comprehensive and hierachical "Gene Ontology", most often refered to as GO annotations:
These GO annotations are frequently assigned records in well annotated databanks such as SWISSPROT or INTERPRO; use the GO terms associated to your ORF's closest homologs or conserved doamins to help you assign the most appropriate terms.
In the event that your ORF has highly convincing homology with a family of well characterized proteins of known function, whose gene symbol nomenclature appears uniform and stable, you can propose in the Gene symbol field a putative gene symbol for your ORF. If the homologs have no gene symbol, or if their symbols vary to a large extent, do not invent a new symbol, just leave this field empty!
For gene symbol examples, check out those already attributed to metagenome fragments during the Annotathon on Metagenes.
This field is central to your evaluation: write up your interpretations and hypotheses based on the observations you have made in the preceeding "RESULTS ANALYSES". Imagine you are trying to convince a very sceptical colleague: use rigorous argumentation, cite precise evidence and numerical values when ever possible, highlight important findings, cross information from independent sources. Remember that in silico analyses generally do not constitute final proof, only suggestions. Terms such as "putative", "suggests" or "probably" can show understanding of the limitations of computational biology results.
Make sure you have at least covered:
Some common pitfalls to avoid at all cost:
Concentrate on producing a scientific, structured, synthetic and rigorous argumentation that will hold up to peer scrutiny!
Due to lack of manpower, we are no longer able to offer evaluations of annotations outside of specific university teams!