Annotathon!

Welcome - Team information - Aims - User Guide - Sequence Annotations - Evaluations -

Welcome

150 years ago Charles Darwin sailed the high seas on board HMS Beagle (right, top) exploring life's morphological diversity. Today, ships such as Tara (right, middle) sample the oceans to uncover life's molecular diversity through metagenomic sequencing.

From September 2009 to December 2013, the oceanographic ship Tara sampled plankton across all of the planet's oceans (right, bottom). Ocean plankton produce half the oxygen we breathe. If terrestrial forests fill up our first lung, oceans fill up the second. The ocean's microscopic planktonic life is also an important carbon sink, with far reaching implications in climate change scenarios. Hence, the Tara Oceans pan-oceanic international expedition aims to define the current state of plankton biodiversity, from tropical coral reefs to Antarctica, from viruses to fish larvae.

Metagenomic sequencing of the plankton DNA contained in the thousands of Tara Oceans samples has begun at the GENOSCOPE (France). These short pieces of DNA contain precious information on the identity of the plankton species which populate seawater, as well as hints to the metabolic functions at work in these tiny ocean drifters. At this molecular scale, bioinformatics is the essential tool to shed light on the information locked up in the DNA sequences!

Your mission is to sift through this precious gigantic heap of DNA sequence data, in order to try and identify the microbial origin of these sequences (archeal, bacterial, viral?), to predict if these DNA sequences code for proteins, and if this is the case, what might be the function of these protein coding genes?

Team information

Freelance volunteer annotator

The "Open access" team is open to all volunteer explorers! To join, please open an account in the "Open Access" team using the "Create New Account" tab at the top of this page.

Students taking a course running an Annotathon team

You are invited to open an account in your course's specific team (e.g. BioCell2007) using the "Create New Account" tab at the top of this page.

Teachers who wish to lead an Annotathon student team

If you teach and wish to create a team for your students, you will find all necessary information on the Annotathon Instructor Manual.

Aims

Your team will collectively annotate distinct DNA fragments randomly distributed from the available public sequence pool. Each registered annotator will be responsible for annotating a specific set of sequence fragments. For each fragment, annotators will produce a full report specifying if it is likely to be coding, the putative function of the protein product, as well as the most likely taxonomic classification of the host organism.

By practicing the core bioinformatics tools on a number of distinct sequences, each one a live piece of experimental data, you will become familiar with the running and interpretation of fundamental sequence analysis. Experience shows that after two or three distinct analyses, the focus shifts from bioinformatics to biological issues! All tool are available online, so all you need to start is a web browser.

User Guide

You can access the Annotathon from any computer connected to the Internet, irrespective of operating system (MAC, Windows or Linux)...

Authentification

We recommend that you simultaneously open the following pages (in different windows or tabs) in your browser:

The main Annotathon work page
This Rule Book
The Frequently Asked Questions page with advanced illustrated guides to the tools that will be used during the practical

If you don't have an Annotathon account (i.e. it is your first session), clic on the "New account" tab in the permanent menu at the top of the Annotathon pages. Follow the instructions to open a new account; make sure you select the appropriate affiliation or you might end up being supervised inside an another team, possibly using a different language... If you don't know your Team code, please ask your instructor for it. You are required to enter at least one firstname/lastname pair, and one email address in order to receive Annotathon specific notifications. Your email address is secure and will under no circumstance be made public or passed on to any third party. Only low traffic messages specific to your course duration will be mailed to this address; no further messages will be sent after the course is completed.

Finally a clic on "Open the account" should be followed by the message "Account 'XYZ' has been created". Use your 'username' and 'password' and clic "Connect" in the form at top of page to open an Annotathon session. You will be reminded that your email address is not validated until you have followed the special link included in an email automatically sent to you at account creation.

The home page (also available by clicking the "Home" tab) gives an overview of the team's annotation progress. Note that once connected you will be able to locate your position in the team at the bottom of the stats page (after your annotations start being evaluated).

Cart and sequence fragments

Each annotator can list the sequence fragments under his annotation responsability by clicking on the "Cart" tab. Your cart being initially empty, select an ocean sampling location (e.g. Caribbean Sea: Rosario Bank) and clic on "Add a new sequence to your cart".

You can only add new sequence fragments to your cart when it is empty, or when you have already annotated all available sequences. Add new sequence fragments at your discretion (or until you reach the upper limit set by your supervisor).

View annotations

Clic the icon opposite the sequence you wish to view in your cart. The initial annotation is minimal: outside the sequence itself and its geographic origin, each sequence fragment has a unique Annotathon accession number. The remaining annotation is your responsibility!

Modify annotations

Clic the icon opposite the fragment you wish to edit. After having modified any annotations, remember to save your work on the Annotathon server by clicking the "Save your annotations" button! Should you leave this editing form without submitting, all modifications since last save will be lost... Since you can submit your work as often as you wish, it is recommended you save your work regularly.

Fragment accession numbers

The accession numbers assigned to sequence fragments (e.g. GOS_5421290.1) are arbitrary and internal to the Annotathon; the last digits after the dot correspond to the annotation version (starts at 1 and increments by one at each save). You can view precedent annotation versions by selecting the appropriate version in the popup list at the top of the visualisation sheets (

icon).

Submit annotations for evaluation

When your annotations are completed, clic the icon opposite the fragment. Its status will then shift from 'Annotation 1' to 'Evaluation 1' and it will be closed for editing until your work has been evaluated. After this initial evaluation, the fragment status shifts from 'Evaluation 1' to 'Annotation 2'; you are then invited to update your initial annotations following the evaluator's comments. When your second annotation pass is completed, clic the icon to submit your annotation for the second and final round of evaluations.

Discussion forum

The "Forum" tab opens access to the Annotathon internal forum (the signals that a new unread message has been posted to the forum). Clic on a message subject to see its content. If you wish to reply to a message, use the form immediatly under it and clic "Post message". IMPORTANT! only use this method to DIRECTLY REPLY TO THE CONTENT OF A SPECIFIC MESSAGE!

If you wish to open a new discussion thread, you MUST use the special new thread form available at the top of each of your annotation records ( icon in your cart)! You can then select the appropriate forum for your new thread (e.g. Searching fo homologues: BLAST). A link to your specific sequence fragment and associated annotations will automatically be included with your post. Note that the messages you post are also emailed directly to your supervisors and fellow annotators.

If messages are often answered by supervisors, trainees who wish to offer help by answering fellow trainee questions are nonetheless strongly encouraged to do so. Constructive replies will be taken into account in trainee evaluations.

Announcements

Annoucements from your supervisors will be displayed at the top of Annotathon pages. Once read, tick the 'Read' box to transfer the message to your archive (available at the bottom of the Forum tab).

Sequence Annotations

General principles

The sequence annotation editing form has three types of fields:

Protocol & Results Analysis (free text fields)
Raw Results (i.e. ORFfinder, BLAST, multiple alignment, phylogenetic tree)
Ontologies (i.e. molecular function, biological process, gene symbol, taxonomy)

The Annotathon editing form is hence both a numeric "lab book" (Protocols & Raw Results) and an "annotation report" (Results Analysis & Ontologies).

IMPORTANT: Some free text fields are initially filled with a standard template looking like this:

PROTOCOL:

---------------------------------------------------------------------------------------------------

RESULTS ANALYSIS:

---------------------------------------------------------------------------------------------------

RAW RESULTS:

Under the "PROTOCOL" heading, specify the minimum information necessary to reproduce the exact same results. Usually, this would entail giving the name of the tool used, the URL of the web page, together with its run parameters. For instance, for the ORF finding results field, the protocol line could read:

PROTOCOL:

SMS ORFinder, http://annotathon.org/sms2/orf_find.html :

    forward strand, frames 1, 2 & 3, min 60 AA, 'any codon' initiation, 'universal' genetic code

Copy & paste the raw results of the analysis, in-extenso, under the "RAW RESULTS" heading. If you have carried out more than one analysis (for instance two SMS ORFfinder runs, one on forward and one on reverse strand), then reference the two analyses using an index exactly as follows:

PROTOCOL:

a) SMS ORFinder, http://annotathon.org/sms2/orf_find.html :

   forward strand, frames 1, 2 & 3, min 60 AA, 'any codon' initiation, 'universal' genetic code

b) SMS ORFinder, http://annotathon.org/sms2/orf_find.html :

   reverse strand, frames 1, 2 & 3, min 60 AA, 'any codon' initiation, 'universal' genetic code

RESULTS ANALYSIS: [enter your observations here]

RAW RESULTS:

a) forward strand

>ORF number 1 in reading frame 1 on the direct strand extends from base 511 to base 744.

CGAGTGATAACTGGTCCAGTAATCGCGATACCGATCATCTTGTTGCGGATTGACGATGTT

AAAATCCCGATCAGGGCGGATATCCAGCCCCAGCCTTTCACAACGTTGCTGAATCACTTC

GGGGCGGCCTATGACGATGGGAACTTCGCTGGTTTCTTCCAAAACGGCCTGAGCGGCGCG

CAGCACCCGCTCGTCTTCGCCCTCGGCAAACACAATCCGTCGAGCGCTGCTTGA

>Translation of ORF number 1 in reading frame 1 on the direct strand.

RVITGPVIAIPIILLRIDDVKIPIRADIQPQPFTTLLNHFGAAYDDGNFAGFFQNGLSGA

QHPLVFALGKHNPSSAA*

---------------------------------------------------------------------------------------------------

b) reverse strand

>ORF number 1 in reading frame 1 on the reverse strand extends from base 517 to base 855.

CCTGATCTGTGGCGCTGTGGGCGAATTCAGATGGCATCTGAATTATATCGAGCAAATTTT

AGGCAGCAAAACCTTATCGCCAAGCGGCGCGCTGTCTTTGATGATTTTAGAAGACGGGCC

TCTGTTCATCGCAGACACCCACGTCTGGGCGGATCCCACCCCCATGCAAATTGCCCAAAC

CGCCAAAGGGGCCGCGCGCCATGTGCGCCGTTTTGGCATAGAGCCACAAGTCGCGCTGTG

CTCGCAATCACAATTTGGAAATCTGAACAGCGAGACTGGCAAGAAAATGCGCCAAGCATT

GGATATTCTCGATACCGAAAAGGTGACGTTTACCTATGA

>Translation of ORF number 1 in reading frame 1 on the reverse strand.

PDLWRCGRIQMASELYRANFRQQNLIAKRRAVFDDFRRRASVHRRHPRLGGSHPHANCPN

RQRGRAPCAPFWHRATSRAVLAITIWKSEQRDWQENAPSIGYSRYRKGDVYL*

Finally, use the "RESULTS ANALYSIS" sections to expose your observations and interpretations of the raw results. Results analysis, a pivotal part of scientific discourse, answers the question "what did we see that is notable when we carried out the experiment described in the protocol". These rigorous factual observations, usually accompanied by precise numerical values (percentages, E-values, number of hits, number of conserved amino acids etc.) are offered without far reaching discussions. Focus the main discussion and interpretations in the "Conclusion" field.

Note: the last "Notepad" field at the bottom of the sequence editing form is available to store any data that isn't accommodated by other specific annotation fields. Use the Notepad to store data that can be useful for subsequent re-analyses (e.g. store your set of FASTA formatted homolog sequences here). The Notepad is your private space and is not consulted during evaluation.

Brief contextual help is available for each annotation field of the editing form by clicking the icons. The information expected for each annotation field is described below.

Remember that a Frequently Asked Questions is available for in depth explanations, tutorials and screen shots of each of the bioinformatics analyses needed to perform the sequence annotations.

Always keep in mind during your analyses the three main focal points of your annotation which consists in proposing:

is the DNA coding or not? If it is, then what are the ORF's start & end positions
a functionnal hypothesis for the putative protein encoded by the DNA fragment
a taxonomic hypothesis regarding the organism likely to carry the DNA fragment

No single bioinformatics tool can by itself answer any of these questions; answers will be built through recouping and synthesis of all available results.

The basic rule set below can be over ridden by more specific or alternative rules given to you by your instructors. If in doubt, always consult your instructors.

ORF finding

The first investigation for each DNA fragment will involve the identification of putative Open Reading Frames (ORFs). There are many tools to tackle this issue, including the following:

Sequence Manipulation Suite (SMS) (recommended)
NCBI ORF finder

For this study, you will only consider ORFs that verify the following criteria:

do not contain any STOP codons (basic ORF definition...)
contains at least 60 codons
can be on either direct or reverse strands
can be in frames 1, 2 or 3 on each strand
can be complete or incomplete at the 5' or 3' ends, or both!

Copy & paste the raw uncensored ORF finding output in the'ORF finding' field of the Annotathon editing form. Remember to conduct the analysis in all SIX frames, and to include a full PROTOCOL line for each raw result. If you have carried out more than one analysis, as for ORFinder calculated on both strands, or for BLASTp against SWISSPROT and against NR, then you must label each distinct Protocol line with a letter matching a corresponding header in the raw results section (as shown above for the ORFinder example).

Please synthesize all the ORFinder results in a synoptic table like follows (use the field text editor icons to add a table):

Table 1: List of putative ORFs present within the metagenomic DNA fragment

	Size (nt)	Size (aa)	Strand	Start position	End position	ORF complete in 5'	NORF complete in 3'	Number of Hits Blast NR Ev < 1e-10	ORF classification
ORF1	267	88	direct	95	361	yes	yes	0	Less probable ORFan
ORF2	891	297	reverse	120	1010	yes	no	4256	KNOWN (studied here)

Important note: Tables should have incremental numbers (ex. Table 1, Table 2, etc...) through all annotation items, as well as titles.

You should also do a figure that resume the positions of each ORFs in the DNA fragment (example is givening below, but not link with the previous table).

Figure 1: Localization of ORFs on TO72D_5186010 DNA fragment

(50)==ORF1==>(249) (268)====ORF2====>(579) (744)========ORF3======>(1068) DIRECT : 1 ------------------------------------------------------------------------------ 1070 REVERSE : 1070 ------------------------------------------------------------------------------ 1 (1068)<====================ORF4====================(394) (407)<==========ORF5==========(134) Legend: ==ORFx==> False positive ==ORFx==> KNOWN (Studied ORF) ==ORFx==> NOVEL (Not studied here)

If your sequence contains several ORFs, arbitrarily select either the longest one or the one obtaining the highest number of BLAST hits (see below) for all subsequent analyses.

You should also classify each putative ORF in each one of the following categories:

False positive (succession de some codons without STOP)
ORFan (protein coding gene without homologs at the date of your analysis)
Novel (protein coding gene having homologs with unknown function)
Known (protein coding gene having homologs with known functions)

In order to classify your ORFs, the following elements should be considered:

the presence of homologous proteins (see BLAST item)(E-value <1E-10) ==> NOVEL or KNOWN)
the ORF size (ORF >100aa without homologs is probably an ORFan)
an ORF without homologs is certainly a False positive if this ORF mainly overlaps another ORF with homologs.

If homologs clearly exist, you can conclude that the sequence is coding DNA whatever the ORF size. Otherwise, the true or false positive nature of the ORF will essentially depend on the ORF size[1]. There is no real hard threshold, but it is very unlikely that a 150+ amino acids ORF is a false postive...

-If the DNA fragment doesn't appear to contain any ORFs and is too short to be convincing, tick the 'non-coding' box of the 'Status' field. The annotation of this fragment will be limited to populating the ORF finding and BLAST fields, as well as the conclusion of course! However, before you conclude that an ORF is non-coding, we recommend that you first look for homologs in the "environmental databases". Ask your instructor for instruction on how to proceed, since this is quite exceptionnal. If you still find no homologs in "environmental databases", you may save your annotations and add a new (hopefully less obscure) sequence to your cart.

-If the sequence appears to carry a true coding ORF (either very long ORF, or with many homologs, or both!), tick the 'coding' box of the 'Status' field. Indicate in the appropriate fields the start and end positions of the ORF[2], as well as the strand. Note that if the ORF is complete at the 3' end (i.e. finishes with a STOP), you need to substract the 3 STOP codon nucleotides from the end position. Validate this ORF by clicking "Save annotations".

If the ORF verifies the rules above, the translation will automatically by displayed; otherwise a red error message will help you pin point the problem. The ORF can be incomplete, in which case simple green informational messages to this effect are displayed. You should correct ORF strand, start and stop positions until you do not observe any more error messages after saving your annotations!

[1] indeed the absence of homologs in public protein databases does not suggest that a sequence is non-coding; it merely means that there is currently no known homolog. There exists other so called ab initio approaches to identify true positive coding ORFs (for instance based on statistical codon usage biases) but these methods usually require organism specific known gene training sets or large chunks of genome sequence, which are hence difficult to apply to metagenome exploration where by definition the organisms from which the sequences derive are unknown.

[2] Important note on the ORF coordinate system: The ORF start & end positions must be given on the strand which carries the ORF! The ORF positions given by the SMS ORF finder can be entered as is, whereas ORF locations on the reverse strand provided by the NCBI ORF finder need to be converted (fragment length - position +1)...

Please refer to the Frequently Asked Questions for further details on ORF finding, in particular on the subtil issue of exact determination of ORF start position...

RESULTS ANALYSIS

Here is an example of how to structure the analysis of the results 
1- ORFs Classification
  1.1- Justify KNOWN ORFs (if present)
  1.2- Justify NOVEL ORFs (if present)
  1.3- Justify ORFan ORFs (if present)
  1.4- Justify False positive ORFs (if present)

     -> Give detailed justification of each classification
     -> Refer to Table 1!
     -> Cite your sources of information with web links (e.g. "homologs are epimerase (cf. Fiche SWISSPROT MJ0211)"

2- ORF selected for annotation in the rest of the report
     -> Justify your selection!
     -> Would other ORFs may also be subjected to bioinformatics analysis?

3- Extremeties of the selected ORF
     -> Discuss the start and end positions. If possible, can you estimate the missing number of amino acids to get a complete and full protein (refer to the multiple alignment section).

Molecular weight

Only if the ORF is complete at both ends, compute its theoretical polypeptide molecular weight using for instance:

Conserved protein domains

Find out whether your ORF contains any of the known conserved domains stored in one of the domain databases:

INTERPRO (recommended)
PFam

INTERPRO is a good choice since as a federation of all other databases, it contains all known domains; INTERPRO analyses can however be a the slow side.

Only submit to the Annotathon domains that you have good reasons to believe are significant, that is to say:

those that are not expected to be found easily purely by chance (i.e. which have sufficiently specific profiles/signatures), check the E-values!
those whose predicted functions are coherent with the other bioinformatics results (e.g. a DNA binding domain for an ORF which BLAST homologs are transcription factors)
domains that are non-redundant (and non-overlapping) with other domains you have submitted to the Annotathon

If you are convinced of the likelyhood of at most four domains (but because of the short length of metagenome sequences it is very rare to have more than one non-redundant domain), enter their names and coordinates if the Annotathon 'domains' field. Pay attention not to repeat essentially the same domain represented under multiple accession numbers in distinct databases (it is common for domains to be present in all three PROSITE, PRINTS & PFam databases).

In the "Raw results" of the INTERPROscan analysis, copy the tool's results in the following form only ("Export" -> "TSV"):

RAW RESULTS:

TO82S_4665010 35c27f 205 SUPERFAMILY SSF52833 79 162 3.44E-7 T 30-09-2014 IPR012336 Thioredoxin-like fold

TO82S_4665010 35c27f 205 Pfam PF14595 Thioredoxin 44 167 6.3E-32 T 30-09-2014

TO82S_4665010 35c27f 205 Gene3D G3DSA:3.40.30.10 18 205 4.4E-36 T 30-09-2014 IPR012336 Thioredoxin-like fold

Please synthesize this difficult to read raw result by producing (yet another) nice table under the "Results Analysis" section, for instance:

Table 2: List of conserved protein domains identified by  InterproScan

______________________________________________________________________________________________________________________________

|  Interpro code  |  Database   |  start     |  end       |  E-value  |  Original DB description   |   Interpro description   |

|   (IPRxxxxxx)   |   origin    |  position  |  position  |           |  (first on raw res. line)  |   (last on raw res. line)|

|_________________|_____________|____________|____________|___________|____________________________|__________________________|

|                 |             |            |            |           |                            |                          |

|  IPR012336      | SUPERFAMILY |      79    |     162    |  3.44E-7  |   SSF52833                 |   Thioredoxin-like fold  |

|_________________|_____________|____________|____________|___________|____________________________|__________________________|

|      Néant      |     Pfam    |      44    |     167    |  6.3E-32  |   Thioredoxin              |         Néant            |

|_________________|_____________|____________|____________|___________|____________________________|__________________________|

|  IPR012336      |    Gene3D   |      18    |     205    |  4.4E-36  |   G3DSA:3.40.30.10         |   Thioredoxin-like fold  |

|_________________|_____________|____________|____________|___________|____________________________|__________________________|

Please refer to the Frequently Asked Questions for further details on running InterproScan and most importantly on identifying conserved domains.

RESULTS ANALYSIS

1. Selected domains (if present)
    -> Which domains predictions do you select to annotate the ORF? Specifiy their sizes, E-value! Justify your selection!
    -> Refer clearly to Table 2 for your detailled analysis.

2. Rejected domains (if present)
    -> Why are some functional domains rejected? (high E-value?  No IPR domains?)

3. Biological function
    -> Give details of biological functions associated your retained domains? (enzymatic activities, molecular functions, biological processes, ....)
    -> Cross your results with Blast results (in particular Swissprot)
    -> All your sources should be cited (for example web link to Interpro or Pfam entry)

BLAST homolog search

Use BLAST to identify putative sequence homologs of your ORF in public sequence databases. You can find online BLAST services at:

Two approaches can be used to identify homologs of your sequence:

BLASTp: ORF protein sequence against protein database
BLASTx: submit your nucleotide sequence and it will be translated by BLAST in all six frames before comparison with a protein database; use BLASTx if you are in doubt as to the ORF location on your DNA fragement, or if your ORF search did not yield any convincing ORFS (BLASTx is unsensitive to small sequencing errors that can introduce framshifts that confuse simple ORF finding programs)

You should query the two following protein databases:

NR the most comprehensive database available (useful for the subsequent phylogenetic analysis)
SWISSPROT very small database with highly accurate and informative annotations (useful for subsequent fonctionnal hypotheses)

Copy & paste in the 'BLAST' Raw Results field (Important note: a text version of the BLAST results are available via the "Reformat" button on the NCBI website):

the BLAST header (or insert your protocol: program name, database queried and any other modified parameters)
the complete unabridged hit list (the summary sequence list next to the two 'Score' and 'E-Value' columns)
the first sequence alignements (all alignments if there are few hits, otherwise just the first dozen or so)

If homologs of your ORF exist, indicate what you consider the E-value threshold that separates true positive homologs from false positive non-homologs.

Present under the BLAST "Results Analysis" section a synthetic table looking like this:

Table 3: Number and quality of Blastp alignments vs NR and SWISSPROT

__________________________________________________

      |  number of  |   min   |   max   | e-value |

      |  results    | e-value | e-value |threshold|

______|_____________|_________|_________|_________|

      |             |         |         |         |

  NR  |    3124     |  5e-61  |    10   |  4e-07  |

______|_____________|_________|_________|_________|

      |             |         |         |         |

  SP  |     105     |  3e-05  |    10   | < 3e-05 |

______|_____________|_________|_________|_________|

With the help of "Definition List" tool, also include a table which lists all the distinct functions of the homologs detected by BLAST, with their range of E-values. The Definition List is of great help, but not perfect. In some cas, you may need to simplify the list by grouping in one line some definitions. e.g. DNA polymerase B from :

"DNA polymerase B"
"DNA polymerase B PolB"
"Putative DNA polymerase B"
"DNA polymerase B family proteins"
"DNA polymerase B, partial"

Provide a table of the following form:

Table 4: Catalog of protein homolog functions from BLASTp vs NR (With the help of "Definition List" tool)
___________________________________________________________________________________________________ | descriptions : | min e-value | max e-value | |_______________________________________________________________________|_____________|_____________| | • carbamoyl phosphate synthase large subunit | 5e-61 | 10 | | • transcriptional regulator | 7e-33 | 2e-29 | | • haloacid dehalogenase | 2e-31 | 3e-10 | | • UDP-phosphate galactose phosphotransferase | 5e-30 | 0.35 | | • pilin glycosyl transferase B2 | 2e-28 | 4e-18 | | • carboxylate-amine ligase | 9e-25 | 8.3 | | • sialic acid O-acetyltransferase NeuD family sugar O-acyltransferase | 5e-19 | 6e-19 | | • NAD-dependent epimerase/dehydratase | 1e-17 | 0.040 | | • biotin carboxylase | 7e-17 | 9.9 | | • carboxyltransferase | 5e-16 | 4.1 | | • DNA polymerase | 2e-08 | 9.9 | |_______________________________________________________________________|_____________|_____________|

You must discuss in the "Results analysis" section if you think this list of homolog functions is coherent or not (ie are they essentially synonyms), and if they are coherent with the functions of the conserved domains identified by INTERPRO!

RESULTS ANALYSIS

Do not continue the analysis of the fragment if:

- No homologs (or very small number of homologs in NR database, ie <100)

- Your gene is already present in the NR biological database (nucleotide BLASTn with ID > 95%)

Proposed structure of your analysis section:

1. Overview of the alignments
-Synthetic description of your alignment results (number, known functions, quality of the alignments, ...)
-Give details on E-value range, % identity/similarity range, what about indels, alignment coverage, ....)

2. Identification of protein homologs
    -> Justify E-value thresholds (NR & SP) (Evalue cutoffs, changes in putative homolog functions), refer to tables 3 & 4.

3. Function of homologs from SWISSPROT analysis
   -> From SP entries, give details on closest homolog functions, specific role of important amino acids involved in catalytic function, cross your analysis with protein domain analysis. In all cases, cite your sources with web links.

Please refer to the Frequently Asked Questions for further details on running BLAST.

BLAST taxonomic report

Follow the instructions on the TaxReports tool to extract taxonomic information from your list of BLASTp hits.

Copy the full BLAST Lineage Report in the Annotathon 'Taxonomy report' 'Raw results" field. Only include the first chapter called Lineage Report:

---------------------------------------------------------------------------------------------------

RAW RESULTATS:

Lineage report

.LUCA

. Bacteria

. .Cyanobacteria

. . Prochlorales

. . .Prochlorococcaceae

. . . Prochlorococcus

. . . .Prochlorococcus marinus str. MIT 9515........ 315  4e-103 2 hits  Bacteria:Cyanobacteria:Prochlorales:     phytoene desaturase [Prochlorococcus mari...  

. . . .Prochlorococcus marinus str. MIT 9301........ 305  3e-99  2 hits  Bacteria:Cyanobacteria:Prochlorales:     phytoene desaturase [Prochlorococcus mari...  

. . . .Prochlorococcus marinus str. MIT 9215........ 303  8e-99  2 hits  Bacteria:Cyanobacteria:Prochlorales:     phytoene desaturase [Prochlorococcus mari...  

. . . .Prochlorococcus marinus str. AS9601.......... 301  4e-98  2 hits  Bacteria:Cyanobacteria:Prochlorales:     phytoene desaturase [Prochlorococcus mari...  

. . . .Prochlorococcus marinus str. NATL1A.......... 261  2e-82  2 hits  Bacteria:Cyanobacteria:Prochlorales:     phytoene desaturase [Prochlorococcus mari...  

. . . .Prochlorococcus marinus str. MIT 9303........ 249  1e-77  2 hits  Bacteria:Cyanobacteria:Prochlorales:     phytoene desaturase [Prochlorococcus mari...  

. . Synechococcus sp. WH 8109....................... 251  1e-78  1 hit   Bacteria:Cyanobacteria:Chroococcales:    Carotene 7,8-desaturase [Synechococcus sp. WH ...  

. . Synechococcus sp. WH 7803....................... 251  2e-78  3 hits  Bacteria:Cyanobacteria:Chroococcales:    phytoene dehydrogenase [Synechococcus sp....  

. . Synechococcus sp. CB0205........................ 250  3e-78  1 hit   Bacteria:Cyanobacteria:Chroococcales:    15-cis-phytoene desaturase [Synechococcus...  

. . Synechococcus sp. BL107......................... 250  3e-78  2 hits  Bacteria:Cyanobacteria:Chroococcales:    15-cis-phytoene desaturase [Synechococcus...  

. . Synechococcus sp. WH 8016....................... 250  4e-78  2 hits  Bacteria:Cyanobacteria:Chroococcales:    15-cis-phytoene desaturase [Synechococcus...  

. . Synechococcus sp. CC9311........................ 250  4e-78  6 hits  Bacteria:Cyanobacteria:Chroococcales:    phytoene desaturase [Synechococcus sp. CC931...  

. . Synechococcus sp. RS9916........................ 249  1e-77  2 hits  Bacteria:Cyanobacteria:Chroococcales:    15-cis-phytoene desaturase [Synechococcus...  

. . Synechococcus sp. CB0101........................ 248  2e-77  1 hit   Bacteria:Cyanobacteria:Chroococcales:    15-cis-phytoene desaturase [Synechococcus...  

. . Synechococcus sp. RCC307........................ 236  2e-72  3 hits  Bacteria:Cyanobacteria:Chroococcales:    phytoene dehydrogenase [Synechococcus sp....  

. . Synechococcus sp. PCC 7002...................... 217  2e-65  3 hits  Bacteria:Cyanobacteria:Chroococcales:    phytoene dehydrogenase [Synechococcus sp....  

. . Cyanobium sp. PCC 7001.......................... 249  7e-78  2 hits  Bacteria:Cyanobacteria:Chroococcales:    15-cis-phytoene desaturase [Cyanobium sp....  

. . Crocosphaera watsonii........................... 231  1e-70  1 hit   Bacteria:Cyanobacteria:Chroococcales:    15-cis-phytoene desaturase [Crocosphaera ...  

[...]

Under the Taxonomy report "Results Analysis" section, build (with the Taxonomy List tool ) a synthetic table of your observations as follows:

Table 5: Overview of taxonomic classifications of homologs identified by BLASTp vs NR _________________________________________________________________________________________________________________ | | | | | | | phylum | class | Order | e-value range | number of results | |_______________________|___________________|__________________________|__________________|_____________________| | | | | | | | [Firmicutes] | [Clostridia] | [Clostridiales] | 5e-61 - 8.7 | 117 | | | |__________________________|__________________|_____________________| | | | | | | | | | [Thermoanaerobacterales] | 7e-39 - 0.21 | 9 | | | |__________________________|__________________|_____________________| | | | | | | | | | [Natranaerobiales] | 2e-27 | 1 | | | |__________________________|__________________|_____________________| | | | | | | | | | [Halanaerobiales] | 3e-04 - 0.56 | 3 | | |___________________|__________________________|__________________|_____________________| | | | | | | | | [Erysipelotrichi] | [Erysipelotrichales] | 2e-43 - 5.1 | 12 | | |___________________|__________________________|__________________|_____________________| | | | | | | | | [Bacilli] | [Bacillales] | 3e-38 - 9.7 | 104 | | | |__________________________|__________________|_____________________| | | | | | | | | | [Lactobacillales] | 9e-25 - 4.9 | 109 | | |___________________|__________________________|__________________|_____________________| | | | | | | | | [Negativicutes] | [Selenomonadales] | 2e-18 - 9.3 | 14 | |_______________________|___________________|__________________________|__________________|_____________________| | | | | | | | [Deinococcus-Thermus] | [Deinococci] | [Deinococcales] | 2e-50 - 1e-15 | 2 | |_______________________|___________________|__________________________|__________________|_____________________|

Use the BLAST results (the lineage report is your friend here) to build two groups of homolog sequences which will serve, after multiple alignement, as a basis for phylogenetic tree reconstruction:

a study/in group (around 20-30 sequences) that represents homologs belonging to the same taxonomic group as your ORF. Those sequences are sampled from the list of homologs (ranged from the best hit[the first homolog] to the lowest hit [the last homolog])
a external/out group (around 10-15 sequences) that represent the closest homologs not belonging to the study group (will serve to root the phylogenetic tree). The out group is usually mechanically derived by going one node up from selected in group in the tree of life. This is arguably the most difficult part of the whole annotation procedure! You must read the Frequently Asked Questions section concerning in/out-group selection!

IMPORTANT: Remember that ALL sequences selected for inclusion in the study and external groups must be homologs of your ORF, i.e. their BLAST E-value must be below the E-value threshold determinded above.

IMPORTANT: Include under the RESULTS ANALYSIS heading of the Taxonomy report the COMPREHENSIVE list of all the sequences you have selected in the study and external groups: for each sequence, provide its accession number, the short name you have chosen for it (see below Multiple alignment of protein sequences), its BLAST E-value and score and its taxonomic group. You are welcome to use to this effect the header of the FASTA files you have obtained from the Tax Report. For instance:

PROTOCOL:

TaxReport tool (http://oceans.embl.de/Annotathon_outils/blast_tax_report2.php) of BLASTp versus NR, default parameters

RESULTS ANALYSIS: [ insert here your synoptic table (see above) ] [ write here your descrption of the taxonomy report, justify your INGROUP and your OUTGROUP. Do mention the E-value differential between the in- and out-groups! Then list as shown below the list of sequences you selected to represent your in- and out-groups: ]

In-group: Cynaobacteria

>Bac_Cya_Pro_3 [Bacteria Cyanobacteria Prochlorales]  E-value=1e-15  Bacteria;Cyanobacteria;Prochlorales;Prochlorococcaceae;Prochlorococcus; gi|488894830|ref|WP_002805954.1| zeta-carotene desaturase [Prochlorococcus marinus] 
>Bac_Cya_Chr_2 [Bacteria Cyanobacteria Chroococcales]  E-value=7e-78  Bacteria;Cyanobacteria;Chroococcales;Cyanobium; gi|493968054|ref|WP_006911325.1| 15-cis-phytoene desaturase [Cyanobium sp. PCC 7001] 
>Bac_Cya_Chr_3 [Bacteria Cyanobacteria Chroococcales]  E-value=1e-70  Bacteria;Cyanobacteria;Chroococcales;Crocosphaera; gi|494523610|ref|WP_007313063.1| 15-cis-phytoene desaturase [Crocosphaera watsonii] 
>Bac_Cya_Chr_4 [Bacteria Cyanobacteria Chroococcales]  E-value=9e-68  Bacteria;Cyanobacteria;Chroococcales;Cyanothece; gi|218438147|ref|YP_002376476.1| phytoene desaturase [Cyanothece sp. PCC 7424] 
>Bac_Cya_Chr_5 [Bacteria Cyanobacteria Chroococcales]  E-value=1e-64  Bacteria;Cyanobacteria;Chroococcales;Synechocystis; gi|16330439|ref|NP_441167.1| phytoene desaturase [Synechocystis sp. PCC 6803] 
>Bac_Cya_Osc_1 [Bacteria Cyanobacteria Oscillatoriales]  E-value=3e-72  Bacteria;Cyanobacteria;Oscillatoriales; gi|497454285|ref|WP_009768483.1| phytoene desaturase [Oscillatoriales cyanobacterium JSC-12] 
>Bac_Cya_Osc_3 [Bacteria Cyanobacteria Oscillatoriales]  E-value=1e-16  Bacteria;Cyanobacteria;Oscillatoriales;Microcoleus; gi|493682519|ref|WP_006632676.1| zeta-carotene desaturase [Microcoleus vaginatus] 
>Bac_Cya_Nos_1 [Bacteria Cyanobacteria Nostocales]  E-value=1e-70  Bacteria;Cyanobacteria;Nostocales;Nostocaceae;Trichormus; gi|298491654|ref|YP_003721831.1| phytoene desaturase ['Nostoc azollae' 0708] 
>Bac_Cya_Nos_2 [Bacteria Cyanobacteria Nostocales]  E-value=5e-14  Bacteria;Cyanobacteria;Nostocales;Nostocaceae;Trichormus; gi|298492908|ref|YP_003723085.1| carotene 7,8-desaturase ['Nostoc azollae' 0708] 
>Bac_Cya_Nos_3 [Bacteria Cyanobacteria Nostocales]  E-value=2e-70  Bacteria;Cyanobacteria;Nostocales;Nostocaceae;Anabaena; gi|414079384|ref|YP_007000808.1| phytoene desaturase [Anabaena sp. 90] 
>Bac_Cya_Sti_1 [Bacteria Cyanobacteria Stigonematales]  E-value=2e-68  Bacteria;Cyanobacteria;Stigonematales;Fischerella; gi|497072507|ref|WP_009458406.1| 15-cis-phytoene desaturase [Fischerella]

Out-group: other bacteria which are not Cyanobacteria (Proteobacteria, Chloroflexi, Chlorobi, Acidobacteria, ....)

>Bac_Chl_Chl_1 [Bacteria Chloroflexi Chloroflexales]  E-value=3e-32  Bacteria;Chloroflexi;Chloroflexales;Chloroflexaceae;Chloroflexus; gi|163847906|ref|YP_001635950.1| carotene 7,8-desaturase [Chloroflexus aurantiacus J-10-fl] 
>Bac_Chl_Chl_2 [Bacteria Chlorobi Chlorobia]  E-value=2e-30  Bacteria;Chlorobi;Chlorobia;Chlorobiales;Chlorobiaceae;Chlorobaculum; gi|193212415|ref|YP_001998368.1| carotene 7,8-desaturase [Chlorobaculum parvum NCIB 8327] 
>Bac_Aci_Can_1 [Bacteria Acidobacteria Candidatus Chloracidobacterium]  E-value=2e-27  Bacteria;Acidobacteria;Candidatus Chloracidobacterium; gi|347753771|ref|YP_004861335.1| hypothetical protein [Candidatus Chloracidobacterium thermophilum B] 
>Bac_Fir_Bac_1 [Bacteria Firmicutes Bacillales]  E-value=2e-14  Bacteria;Firmicutes;Bacillales;Bacillaceae;Bacillus; gi|407961641|dbj|BAM54881.1| zeta-carotene desaturase [Bacillus subtilis BEST7613]
>Bac_Pla_Pla_1 [Bacteria Planctomycetes Planctomycetacia]  E-value=2e-11  Bacteria;Planctomycetes;Planctomycetacia;Planctomycetales;Planctomycetaceae;Singulisphaera; gi|430745940|ref|YP_007205069.1|

Multiple sequence alignement

The aim of the multiple alignment is first to verify that the ORF integrates convincingly in its presumed homolog family: the alignment must hence present clear well conserved regions. Secondly, the multiple alignment will serve as the basis for the phylogenetic tree inference: the alignment must therefore suggest a sufficient number of mutations (informative positions) to allow the reconstruction of the evolution history! Beware not not include sequences that are too partial as these can dramatically reduce the number of informative positions in the alignment.

It is common to have to reiterate the building of the multiple alignment many times, adding or taking away more or less divergent sequences, in order to finally obtain a satisfactory result.

IMPORTANT: before proceeding to the multiple alignment, make sure legible labels are present in the sequence FASTA format in order to create useful labels both for alignment and phylogenetic tree. If you have obtained your sequence FASTA from the TaxReports tool, the sequences should already have legible labels (in red, just after the ">" sign and before the first space). It is crucial that your sequence labels are unique, or the following steps (multiple alignments and tree) will likely fail!

>AEMMMM1 [Archaea Euryarchaeota Methanomicrobia Methanosarcinales Methanosarcinaceae Methanosarcina] E-value=1e-85 Archaea; Euryarchaeota; Methanomicrobia; Methanosarcinales; Methanosarcinaceae; Methanosarcina; gi|851310952|ref|WP_048174166.1| UDP-glucose 4-epimerase [Methanosarcina siciliae] MSFNLADYAELLEDLSPHSQNALQANWHEATKVFSPRGLDNYLKGAAAIRGLGKGDSLVETWIEKAPMVAKEVGEDVVGD LATASLELASRTSGTVIELLLATSAIAANRLGDAELFIKYLQFINTLIAQAPRGVRPMLDKLEVLFQHLTLGGLRRWALW GAHAHRTNYEEQIRYFSLDSKESMAMLQKERKGTLLVDVQRRINMYLRALWARDFFMRPTSGDFETREGYRPYIEDYLLH VPDAFDDFTVEGQEPVSGLELYRATAAHCAAHVVETKLPISAEALNPMQIAVISVIEDARVETLSIRRFPGLKQLWSKLH TATPEMNGSMGDYLNRLARALLDESYKDKDPWIVEARALFALAQEKLDSNLTSWDIGVQLAHSFGQKRIPFNPRTDLLTA PYRDDNRYFWEFEEFDFNKAASAGYESIKQVRKYVSVMEMANEIDVETAGDDAEEIWVLGTELFPYENIGDESGGKSFNE LEGKEPVSDPFHYSEWDYQIQLERPAWATVLEKRAKAGDLQIIEAITAQYKREIHRMKFLLDAMQPQGVQRIRRLEDGDE IDINAAISSLTDIRLGNQPDPRIMMRSVRKTRDFSILVLLDLSESTNEKVQDQEYSVRELTQQACVLLADAINKVGDPFA IHGFCSDGRHDVEYYRFKDFDQHWDETPKSRLAGMTGQLSTRMGAAIRHAGHHLQLQRSAKKLLIVITDGEPADVDVRDP QYLRYDTKKAVEEVAKLGVTTYCMSLDPRADNYVSRIFGQKNYMVVDHVQRLPEKLPLLYAGLTR

Note that the sequence label "AEMMMM1" are constitute by the 5 first letters of the 5 first classification levels (Archaea Euryarchaeota Methanomicrobia Methanosarcinales Methanosarcinaceae). Sometimes, it should be useful to distinguish In and Outgroup by adding "ex" to Outgroup sequences as follows:

>exAEMMMM1 [Archaea Euryarchaeota Methanomicrobia Methanosarcinales Methanosarcinaceae Methanosarcina] E-value=1e-85 Archaea; Euryarchaeota; Methanomicrobia; Methanosarcinales; Methanosarcinaceae; Methanosarcina; gi|851310952|ref|WP_048174166.1| UDP-glucose 4-epimerase [Methanosarcina siciliae] MSFNLADYAELLEDLSPHSQNALQANWHEATKVFSPRGLDNYLKGAAAIRGLGKGDSLVETWIEKAPMVAKEVGEDVVGD LATASLELASRTSGTVIELLLATSAIAANRLGDAELFIKYLQFINTLIAQAPRGVRPMLDKLEVLFQHLTLGGLRRWALW GAHAHRTNYEEQIRYFSLDSKESMAMLQKERKGTLLVDVQRRINMYLRALWARDFFMRPTSGDFETREGYRPYIEDYLLH VPDAFDDFTVEGQEPVSGLELYRATAAHCAAHVVETKLPISAEALNPMQIAVISVIEDARVETLSIRRFPGLKQLWSKLH TATPEMNGSMGDYLNRLARALLDESYKDKDPWIVEARALFALAQEKLDSNLTSWDIGVQLAHSFGQKRIPFNPRTDLLTA PYRDDNRYFWEFEEFDFNKAASAGYESIKQVRKYVSVMEMANEIDVETAGDDAEEIWVLGTELFPYENIGDESGGKSFNE LEGKEPVSDPFHYSEWDYQIQLERPAWATVLEKRAKAGDLQIIEAITAQYKREIHRMKFLLDAMQPQGVQRIRRLEDGDE IDINAAISSLTDIRLGNQPDPRIMMRSVRKTRDFSILVLLDLSESTNEKVQDQEYSVRELTQQACVLLADAINKVGDPFA IHGFCSDGRHDVEYYRFKDFDQHWDETPKSRLAGMTGQLSTRMGAAIRHAGHHLQLQRSAKKLLIVITDGEPADVDVRDP QYLRYDTKKAVEEVAKLGVTTYCMSLDPRADNYVSRIFGQKNYMVVDHVQRLPEKLPLLYAGLTR

Build a multiple alignment (including all the in and out group sequences, as well as your ORF, naturally) using an online version of one the following software: ClustalW (widely used), MUSCLE (fast and a little more efficient) or T-COFFEE (slower but highly robust method with very useful colored conserved alignment blocks). These methods are available on the web site of:

Phylogeny.fr (recommended)
EBI

The limitation in the number of sequences to align is simply due to computation time of multiple alignment programs, as well as subsequent phylogenetic tree reconstruction. Computation time is reasonable up to around thirty our fifty sequences of a few hundred residues.

Copy & paste the "ClustalW" formated multiple alignment in the 'Multiple Alignement' Annotathon field.

Also copy & paste the full multiple alignment obtained after curation (GBlocks output) in the "Raw Results" section. Please make sure you include the end (footer) of the GBlocks output as this contains crucial estimates of the number of informative positions in your alignment.

RESULTS ANALYSIS:

1. Quality of the multiple alignment 
   -> Can you confirm that that sequences are really homologs? Similar lengths? How many identical positions? How many conservative substitutions positions? Number of indels? can you find that the conservation of sequences within alignment reflects the subgroups (In and out groups)? 
   -> After curation with GBLOCKS, what is the number of conserved homolog positions (informative sites) for phylogenetic reconstruction? It is enough? 

2. Identification of conserved blocks
    -> You can annotate well conserved blocks in your alignment with codes (such as A, B, C etc.) and refer to them in your analysis. 
    -> Are there any conserved amino acids that are known as actives sites for this protein family? If yes, position in alignment, function, activity? 

3. N and C-termini of the studied ORF
   -> Analysis of the N-ter/C-term of the alignment (complete? start codon? potentially missing number of amino acids in N and C-termini?)

Phylogenetic tree

Use the above multiple alignment to infer a phylogenetic tree using two distinct tree reconstruction approaches:

'distance' method (e.g. 'neighbor-joining (NJ)', 'BioNJ' or 'Phylip protdist/neighbor')
'maximun likelyhood' method (e.g. 'PhyML)')

You can use:

phylogeny.fr dedicated service (recommended, includes both the BioNJ and PhyML programs)
Pasteur Institute Mobyle online portal (includes the BioNJ and Phylip protdist/neighbor programs).

Please refer to the Frequently Asked Questions for further details and screen shots on running phylogenetic analyses.

IMPORTANT NOTICE: please use this specific rerooting tool to re-root your phylogenetic trees in "TEXT" format (indeed, the re-rooting manipulation of trees on the phylogeny.fr website are not 100% functionnal!). This tool will allow you to retain the node robustness values, as well as control the width of the tree.

Copy & paste the textual tree representation in the 'Tree' Annotathon field. Remember to include a protocol line in the 'Tree' field that includes the program name and run parameters (ex 'Phylip / Protdist+neighbor / Randomized input - Random number seed = 11 / rooted on: Coccidioides immitis (ascomycetes)').

PROTOCOL:

a) Phylogeny.fr / PhyML method / no bootstrap / default substitution model / out group: Firmicutes

b) Phylogeny.fr / BioNJ method / out group: Firmicutes

RESULTS ANALYSIS:

[...]

RAW RESULTS:

a)PhyML

                                                          ,--------------+ BARSPP1 Bacteria Actinobacteria Rubrobacteridae Solirubrobactera
                                                  ,-------+ 0.92
                                                  |       '--------------------+ BAAAPP1 Bacteria Actinobacteria Actinobacteridae Actinomycetales
                                                  |
                                                  |          ,-------------+ BAAACN1 Bacteria Actinobacteria Actinobacteridae Actinomycetales
                                                  |          |
                                                  |     ,----+ 0.87           ,-------+ BAAAMI1 Bacteria Actinobacteria Actinobacteridae Actinomycetales
                                                  |     |    '----------------+ 0.99
                                                  |,----+ 0.85                '------+ BAAACN2 Bacteria Actinobacteria Actinobacteridae Actinomycetales
                                                  ||    |
                                                  ||    '-----------------+ BAAASS1 Bacteria Actinobacteria Actinobacteridae Actinomycetales
                                                  ||
                                                  ||                                                ,-----+ BAAAPP14 Bacteria Actinobacteria Actinobacteridae Actinomycetale
                                                  ||                                            ,---+ 0.52
                                               ,--+|0.69                            ,-----------+ 0.97------------+ BAAAPP8 Bacteria Actinobacteria Actinobacteridae Actinomycetales
                                               | ||                                |           |
                                               | ||                      ,---------+ 0.98      '--------+ BAAAPP10 Bacteria Actinobacteria Actinobacteridae Actinomycetale
                                               | ||                      |         |
                                               | ||                      |         '---------+ BAAAPP2 Bacteria Actinobacteria Actinobacteridae Actinomycetales
                                               | ||                ,-----+ 0.9
                                               | ||                |     |     ,------------------+ BAAAPP13 Bacteria Actinobacteria Actinobacteridae Actinomycetale
                                               | ||                |     |     |
                                               | ||            ,---+ 0.81'-----+ 0.83     ,----------------+ BAAAPP4 Bacteria Actinobacteria Actinobacteridae Actinomycetales
                                               | ||            |   |           '----------+ 0.92
                                               | ||            |   |                      '-------------+ BAAAPP3 Bacteria Actinobacteria Actinobacteridae Actinomycetales
                                               | ||            |   |
                                               | ||        ,---+ 0.74-------------------------+ BAAAPN1 Bacteria Actinobacteria Actinobacteridae Actinomycetales
                                               | ||        |   |
                                               | '+ 0.55   |   |   ,----------------------+ BAAAMM3 Bacteria Actinobacteria Actinobacteridae Actinomycetales
                                               |   |        |   |   |
                                               |   |        |   '---+ 0.82           ,-------+ BAAAPP7 Bacteria Actinobacteria Actinobacteridae Actinomycetales
                                               |   |        |       |       ,--------+ 0.9
                                               |   |        |       '-------+ 0.89   '---+ BAAAPP5 Bacteria Actinobacteria Actinobacteridae Actinomycetales
                                               |   |        |               |
                                               |   |        |               '------------------------+ BAAAMM4 Bacteria Actinobacteria Actinobacteridae Actinomycetales
                                               |   |        |
                                               |   |   ,----+ 0.82   ,---------------------+ BAAASS2 Bacteria Actinobacteria Actinobacteridae Actinomycetales
                                               |   |   |    |        |
                                               |   |   |    |    ,---+ 0.23     ,----------+ BAAAMI3 Bacteria Actinobacteria Actinobacteridae Actinomycetales
                                               |   |   |    |    |   '----------+ 0.94
                                               |   |   |    |    |              '----+ BAAAMI2 Bacteria Actinobacteria Actinobacteridae Actinomycetales
                                           ,---+ 0.76 |    |    |
                                           |   |   |   |    |    |                    ,+ BAAAMM2 Bacteria Actinobacteria Actinobacteridae Actinomycetales
                                           |   |   |   |    |    |    ,---------------+ 1
                                           |   |   |   |    |    |    |               ' BAAAMM1 Bacteria Actinobacteria Actinobacteridae Actinomycetales
                                           |   |   |   |    '----+ 0.83
                                           |   |   |   |         |    |      ,------------------+ BAAAPP9 Bacteria Actinobacteria Actinobacteridae Actinomycetales
                                           |   |   |   |         | ,--+ 0.71 |
                                           |   |   '---+ 0.81    | | | ,----+ 0.88----------+ BAAAPP6 Bacteria Actinobacteria Actinobacteridae Actinomycetales
                                           |   |       |         | | | |    | |
                                           |   |       |         | | | |    '-+ 0.39----------+ BAAAPP11 Bacteria Actinobacteria Actinobacteridae Actinomycetale
                                           |   |       |         | | '-+ 0.7 '--+ 0.076
                                           |   |       |         '-+ 0.35         '-----------------------+ BAAAAA2 Bacteria Actinobacteria Actinobacteridae Actinomycetales
                                           |   |       |           |    |
                                           |   |       |           |    '------------------------+ BAAAPP12 Bacteria Actinobacteria Actinobacteridae Actinomycetale
                                           |   |       |           |
                                           |   |       |           |    ,--------------+ BAAAFG1 Bacteria Actinobacteria Actinobacteridae Actinomycetales
                                           |   |       |           '----+ 0.52
                                  ,--------+ 0.83 D    |                '---------+ BAAAST1 Bacteria Actinobacteria Actinobacteridae Actinomycetales
                                  |        |   |       |
                                  |        |   |       '---------------------------------+ BAAACN3 Bacteria Actinobacteria Actinobacteridae Actinomycetales
                                  |        |   |
                                  |        |   |                   ,-------------+ BAA1 Bacteria Actinobacteria Acidimicrobidae E-value3e-75 Bacte
                                  |        |   |            ,------+ 0.2 J
                                  |        |   |            |      '------------------------+ BA3 Bacteria Actinobacteria E-value1e-68 Bacteria Actinobacteri
                                  |        |   |     ,------+ 0.82
                                  |        |   |     |      |      ,-------+ BA2 Bacteria Actinobacteria E-value7e-82 Bacteria Actinobacteri
                                  |        |   |     |      '------+ 0.84   I
                       ,----------+ 0.83 C |   '-----+ 0.88        '-----------------+ BAC1 Bacteria Actinobacteria Candidatus Microthrix E-value4e-74
                       |          |        |         |
                       |          |        |         |       ,----+ BA1 Bacteria Actinobacteria E-value5e-101 Bacteria Actinobacter
                       |          |        |         '-------+ 0.94    E
                       |          |        |                 '-+ ORF7 Translation of ORF number 2 in reading frame 3 on the rever
                       |          |        |
,---------------------+ 1 B      |        |           ,------------------+ exBCCCCC1 Bacteria Chloroflexi Caldilineae Caldilineales Caldili
|                     |          |        '-----------+ 0.91
|                     |          |                    '------------------------------------------+ exBFLSS1 Bacteria Firmicutes Lactobacillales Streptococcaceae St
|                     |          |
|                     |          '-----------------------------+ exBCNSS1 Bacteria Cyanobacteria Nostocales Scytonemataceae Scyto
|                     |
|                     '------------------------+ exBCPPP1 Bacteria Cyanobacteria Prochlorales Prochlorococcaceae
|
|                                                  ,-------------------------------------------------------------+ exBFBBA1 Bacteria Firmicutes Bacillales Bacillaceae Anoxybacillu
|                                         ,--------+ 0.77
|                                  ,------+ 0.49   '------------------------------+ exBFBPV1 Bacteria Firmicutes Bacillales Planococcaceae Viridibac
|                                  |      |
|                                  |      '---------------------------------------------+ exBFCCPD2 Bacteria Firmicutes Clostridia Clostridiales Peptococc
|                                  |
=+ A                               |                 ,-----------------------+ exBFCCPD1 Bacteria Firmicutes Clostridia Clostridiales Peptococc
|                                  |                 |
|                                  |   ,-------------+ 0.68                               ,+ exBFBPAA2 Bacteria Firmicutes Bacillales Paenibacillaceae Aneuri
|                         ,--------+ 0.75            '------------------------------------+ 1
|                         |        |   |                                                  '------+ exBFBPAA1 Bacteria Firmicutes Bacillales Paenibacillaceae Aneuri
|                         |        |   |
|                         |        | ,+ 0                                  ,--------------+ exBFBPS1 Bacteria Firmicutes Bacillales Planococcaceae Sporosarc
|                         |        | ||                         ,----------+ 0.85
|                         |        | ||                 ,-------+ 0.72     '--------------------+ exBFBBV1 Bacteria Firmicutes Bacillales Bacillaceae Virgibacillu
|                         |        | ||                 |       |
|                         |        '--+'0.69-------------+ 0.9   '-------------+ exBFBBC1 Bacteria Firmicutes Bacillales Bacillaceae Caldalkaliba
|                         |           |                  |
|                         |           |                  '------------------------+ exBFBBP1 Bacteria Firmicutes Bacillales Bacillaceae Pontibacillu
|                         |           |
'-------------------------+ 1         '-------------------------------------+ exBFCC1 Bacteria Firmicutes Clostridia Clostridiales E-value7e-
                           |
                           |                                         ,---------------+ exBFBPU1 Bacteria Firmicutes Bacillales Planococcaceae Ureibacil
                           |                                      ,--+ 0.54
                           |                                      | '-------------------+ exBFBPS2 Bacteria Firmicutes Bacillales Planococcaceae Solibacil
                           |                     ,----------------+ 0.96
                           |                     |                |          , exBFBBL2 Bacteria Firmicutes Bacillales Bacillaceae Lysinibacill
                           |                     |                '----------+ 0.91
                           |      ,--------------+ 0.88                      '----+ exBFBBL1 Bacteria Firmicutes Bacillales Bacillaceae Lysinibacill
                           |      |              |
                           '------+ 0.78         '--------------------------------------+ exBFBPK1 Bacteria Firmicutes Bacillales Planococcaceae Kurthia E
                                  |
                                  |                               , exBFCC2 Bacteria Firmicutes Clostridia Clostridiales E-value8e-
                                  '-------------------------------+ 0.99
                                                                  '-+ exBFCCCP1 Bacteria Firmicutes Clostridia Clostridiales Clostridi

|--------------------------|---------------------------|--------------------------|---------------------------|---
0                       0.25                         0.5                       0.75                           1
substitutions/site

b) BioNJ [...]

Important: Use codes such as "A, B, ..." (or colored) to locate major branches in your trees and refer to them within your analysis

Suggested plan for the analysis section:

1. Tree topologies
    -> Describe the topology of each tree. What are the monophyletic groups?
    -> Do the two independant trees describe the same evolutionary history? the same topology? Similar or different clades? 
    -> Identify the commonalities as well as the potential incoherencies.

2. Coherence with reference trees
    -> Are the in- and out-groups correctly separated?
    -> Are your gene trees coherent with the reference species trees ("tree of life")? 
    -> Identify each discrepancy with the reference species tree, and suggest some explanations (HGTs, gene duplications...).

3. Predict the most likely taxonomic origin of the metagenomic ORF
    -> In which monophyletic clade does the the metagenomic sequence seem to emerge?
    -> Propose a hypothetical taxonomic classification for the metagenomic ORF!
    -> Provide detailed justification of your hypothesis, do not under/over interpret the infered phylogenetic trees!

Taxonomy

After you have analysed the phylogenetic tree produced, specify the most likely taxonomic group (e.g. "Alphaproteobacteria") to which belongs the organism carrying your DNA fragment. To specify this group in the 'Taxonomy' Annotathon field you have two options:

specify in the 'NCBI numerical identifier' box the taxonomic group code (for instance 204455 for Rhodobacterales, these codes are found in GENBANK records in the feature table, such as /db_xref="taxon:204455" or can be found by querying the NCBI taxonomy database using the link in the Help tab)
specify the exact scientific name for this group in the 'Scientific name' box (e.g. Rhodobacterales)

Save your annotations and make sure that the one field above that you left blank has correctly been automatically populated; for instance if you chose to indicate "Alphaproteobacteria" in the "Scientific Name" box, once saved the code "28211" should appear in the "NCBI numerical identifier" box.

Note that the "NCBI numerical identifier" box has precedence over the "Scientific Name" box, so if you wish to change the taxonomic classification of your sequence you must delete the numeric code in order to enter a new value in the "Scientific Name" box.

Once the taxonomic group is correctly specified, the full lineage should appear:

Rhodobacterales
Rank: order - Genetic Code: Bacterial and Plant Plastid - NCBI Identifier: 204455
Kingdom: Bacteria - Phylum: Proteobacteria - Class: Alphaproteobacteria - Order: Rhodobacterales
Bacteria; Proteobacteria; Alphaproteobacteria; Rhodobacterales;

IMPORTANT: unless your DNA sequence is 100% identical to an existing GENBANK entry, you should probably not specify a precise species! Since without further evidence the precise taxonomic definition of the organism carrying the metagenomic DNA fragment is impossible, specify as likely taxonomic group the node immediatly above your sequence in the phylogentic tree.

Biological process & molecular function

When your ORF's homologs have known functions, or if the ORF presents known conserved domains, select in the available "Biological Process" & "Molecular Function" lists the most appropriate terms that most specifically describe your proposed ORF functional hypotheses. These terms are a subset of the comprehensive and hierachical "Gene Ontology", most often refered to as GO annotations:

Molecular Function: biochemical activity of the protein (e.g. kinase)
Biological Process: role of this activity in the cell (e.g. signal transduction)

These GO annotations are frequently assigned records in well annotated databanks such as SWISSPROT or INTERPRO; use the GO terms associated to your ORF's closest homologs or conserved doamins to help you assign the most appropriate terms.

Gene symbol

In the event that your ORF has highly convincing homology with a family of well characterized proteins of known function, whose gene symbol nomenclature appears uniform and stable, you can propose in the Gene symbol field a putative gene symbol for your ORF. If the homologs have no gene symbol, or if their symbols vary to a large extent, do not invent a new symbol, just leave this field empty!

Conclusion

This field is central to your evaluation: write up your interpretations and hypotheses based on the observations you have made in the preceeding "RESULTS ANALYSES". Imagine you are trying to convince a very sceptical colleague: use rigorous argumentation, cite precise evidence and numerical values when ever possible, highlight important findings, cross information from independent sources. Remember that in silico analyses generally do not constitute final proof, only suggestions. Terms such as "putative", "suggests" or "probably" can show understanding of the limitations of computational biology results.

Make sure you have at least covered:

arguments supporting your coding versus non-coding hypothesis; make sure you discuss the start position of your ORF (refer to the FAQ for all the subtlities and pitfalls)!
functionnal predictions for the protein, both at the biochemical level (e.g. "ubiquitin conjugation enzyme"), and the biological role at the organism level (e.g. "implicated in the control of the cell cycle"). Make careful use of annotations available for sequence homologs or conserved domains
your taxonomic classification hypothesis for the organism carrying this DNA fragment.

Some common pitfalls to avoid at all cost:

explain the theoretical aims and methodology of the tools used (please consider the readers knowledgable in these matters)
describe on which button you clicked (please consider that readers are highly familiar with running BLASTs)
write in telegraphic style
dilute, inflate, digress in the hope that evaluation is proportional to word volume
copy raw results in extenso in the discussion when the reader has direct access to them in the appropriate fields
write linearly without any structure, or purely chronologically
insulate the analyses from each other (you can, and certainly should, make reference to the multiple alignment while discussing the ORF boundaries)
conclude without references to the results and observations
propose hypotheses without providing the supporting evidence
make approximate statements, such as citing BLAST homologs or conserved domains without citing their respective E-values

Concentrate on producing a scientific, structured, synthetic and rigorous argumentation that will hold up to peer scrutiny!

Evaluations

Due to lack of manpower, we are no longer able to offer evaluations of annotations outside of specific university teams!

Annotation evaluation check list

To help you anticipate potential annotation pitfalls, here is a (non comprehensive) list of the most common criticisms made about annotations submitted for evaluation:

Analysis	Category	Criticism
ORF	ORF	TB
ORF	analysis	Please discuss the specific issue of the N- and C-termini of your ORF (you can/must refer to the multiple sequence alignment).
ORF	analysis	An ORF found with "any codon" as initiation codon with a start position above 3bp can not be incomplete at 5' end (there is a STOP codon just before)!
ORF	analysis	Discuss if there were any other potentially significant ORFs in the metagenomic sequence
ORF	analysis	Errors in ORF definition (contains stop codons, larger ORF exists etc.)
ORF	analysis	Explain which (if any) of the other ORFs appear to be potential true positive protein coding genes (justify).
ORF	analysis	Please analyze the ORF results (nb of putative ORFs? 5'/3' incomplete?, which ORF did you select ORF?)
ORF	analysis	Unlikely to be non-coding considering the ORF size?
ORF	results	Incomplete results (missing strand or phases)
ORF	results	Missing protocol (strand, inititation codons, genetic code, min ORF size...)
ORF	results	Protocol: please include the URL of the website used to carry out the analysis.
blast	ORF	Discuss choice of E-values
blast	analysis	Incomplete description of BLAST results (nb of hits, E-value distribution, location of HSPs along query...)
blast	analysis	Incorrect analysis and interpretation of the BLAST results
blast	analysis	List under PROTOCOL the list of all protocols used (cf Rule Book)
blast	analysis	No analysis of functionnal information derived from homologues detected by BLAST
blast	analysis	Please describe & discuss the best pairwise alignments produced by BLASTp (similarities, identities, INDELS etc.)
blast	analysis	When using percentages to quantify alignment qualities (% identity or % similarity), alway provide the alignment lengths
blast	analysis	You are confusing "similarity" with "homology"!
blast	results	Incorrect presentation of results (incomplete sequence list, too few or too many alignments, copy&paste error...)
blast	results	Missing protocol (BLAST type, database)
blast	results	Some BLAST's are missing (SP/NR, BLASTx, modified parameters ...)
blast	results	Too many pairwise alignments!
blast	taxonomy	Discuss your choice of Study Group
blast	taxonomy	Incorrect description of the BLAST taxonomy Lineage Report
blast	taxonomy	Incorrect selection of external group
blast	taxonomy	Incorrect selection of homologues (non represented groups, and/or over-represented groups...)
blast	taxonomy	Please fully describe the set of sequences carried over to multiple alignement, with their BLAST scores and identifiers (cf Rule Book)
blast	taxonomy	Please provide a panoramic diversity overview of the taxonomic origins of the BLAST hits
blast	taxonomy	Provide the list of in/out-group sequences with their names and E-values, but NOT the full FASTA format
blast	taxonomy	To correctly identify an external group, you need to resubmit a BLAST asking for more than first 100 hits (250, 500 ou more)
blast	taxonomy	discuss the E-value log difference between in- and out-group sequences
conclusion	ORF	Specify if the ORF is complete, and if relevant estimate number of missing amino acids.
conclusion	domains	Incorrect comparison of functionnal info found through BLAST and INTERPRO
conclusion	hypotheses	Justify your selection of Gene Ontology terms!
conclusion	hypotheses	Lacks rigor. Cite evidence in support of your hypotheses! Refer to specific numeric values!
conclusion	hypotheses	No functionnal hypothesis
conclusion	hypotheses	a conclusion was expected discussing the putative relationship of the study virus with the human coronaviral pandemics.
divers	analysis	Plagiarism
divers	divers
domains	analysis	A number of domains listed under RAW RESULTS are not discussed at all?
domains	analysis	Discuss the predicted conserved domains E-values!
domains	analysis	Incorrect conserved domains identification (non annotated true positives, redundant domains, false positive domains selected...)
domains	analysis	Incorrect functionnal interpretation from conserved domains identified
domains	analysis	Missing conserved domain analysis
domains	analysis	Only enter in the specific "Domains Form" of the Annotathon just the selected predictions (not all the predictions)
domains	analysis	Please compare the conserved domains prediction functions with functional information derived from the BLAST results
domains	analysis	Please discuss why some predicted domains have been excluded (redundant, high E-value etc.)
domains	analysis	Please provide some details of the biological function of the predicted conserved domains.
domains	domains
domains	results	Incorrect presentation of domains
domains	results	Missing raw Interpro textual output (RAW OUTPUT button in Interpro results page)
domains	results	The protocol is incomplete (include method name, website URL and parameters).
molecular weight	results	Not calculated or not applicable (if partial ORF)
multiple aln	ORF	Discuss the start/end of the ORF compared to its homologs (eg number of residues missing, or putative location of initiation codon)
multiple aln	ORF	Error in the interpretation of ORF start position (too long or too short in 5')
multiple aln	analysis	Are all sequences in the multiple alignment of similar length?
multiple aln	analysis	Incorrect analysis of Multiple Alignment (conserved/divergent regions, coherence with INTERPRO conserved domains...)
multiple aln	analysis	Map important residues/regions annotated in homolog database records (in particular SWISSPROT) onto the multiple sequence alignment and check their conservation
multiple aln	analysis	You have not discussed your ORF's start position compared to its homologs
multiple aln	results	Alignment contains non-homologous sequences
multiple aln	results	Incorrect multiple alignement presentation (CLUSTAL format, legible sequence names...)
multiple aln	results	Incorrect/incomplete Protocol
multiple aln	results	Multiple alignement contains some sequences which are too partial (incomplete at one or both ends)
multiple aln	results	Several identical sequences
multiple aln	results	Where is your ORF?
ontologies	analysis	Incorrect Biological Process
ontologies	analysis	Incorrect Molecular Function
ontologies	analysis	No selection of Gene Ontology terms
phylogeny	analysis	Incorrect specification of Duplication/Speciation events on tree nodes
phylogeny	analysis	Incorrect tree interpretation (HGT missed, ORF assigned to wrong group etc...)
phylogeny	analysis	Missing discussion on tree topology? Congruence if more than one tree?
phylogeny	analysis	Missing most likely taxonomic classification of organism carrying ORF
phylogeny	analysis	you must discuss the branch/node robustness values
phylogeny	results	Add on the tree after the leave names, the taxonomic groups in the form [alpha-protéobactéries]
phylogeny	results	Incorrect presentation (leaves not reformated with Genus species format, eg 'Ecolix'...)
phylogeny	results	Missing alternative tree reconstruction method
phylogeny	results	Missing protocol (method type, ext group used...)
phylogeny	results	Please add to the protocol that you collapsed (or not) the branches with weak support values (in which case provide this threshold).
phylogeny	results	Please color the branches according to taxonomic classification (see Rule Book)
phylogeny	taxonomy	Assign the most probable taxonomic classification, not the full taxonomic classification of the closest homolog!
phylogeny	taxonomy	Select a most likely taxonomic group (Taxonomy field)
writing		Please respect the recommended presentation for RESULTS fields (cf Rule Book)
writing		Conclusion should be better structured
writing		Conclusion should be more concise
writing		Insufficient attention to spelling
writing		Lacks rigor. Cite evidence in support of your hypotheses! Refer to specific numeric values!