Alignment methods (Part 4: computer lab)
Previous class check-up
- We reviewed the algorithms for pairwise and multiple sequence alignments
Learning objectives
At the end of today’s session, you
- will learn to use different software options: ClustalW, T-Coffee and MUSCLE
No pre-class work.
Let’s take a look at data from last year’s projects here.
MSA software key insights
- No perfect method
- No automatic method
- All methods require manual work of comparing results from different alignment parameters and from different sofware
- Take notes of the choices you made and keep track of all comparisons to justify final choice
- We probably don’t spend as much time as we should on the alignment step of the phylogenomic pipeline
- We want a blackbox that does not exist yet!
Other algorithms
Genetic algorithm
- SAGA uses the WSP objective function but uses genetic algorithms instead of dynamic programming (individual=alignment)
- very accurate
- more scalable than T-coffee, but still not super scalable
Hidden markov model
Simultaneous estimation tree/alignment
Other considerations
- Nucleotide vs amino acid sequence
- when there is choice (protein-coding genes), amino acid alignments are easier to carry out and less ambiguous; also nucleotide alignments do not recognize codon as a unit and can break up the reading frame; typically, you align the amino acids and then generate the corresponding nucleotide sequence alignment
- when sequences are not protein coding, only choice is to align nucleotides
- Manual editing and visualizing alignments
- manual editing is scary because it is not reproducible
- but many times it is necessary because automatic alignment methods are not as accurate as they should be
Which program to choose?
- Not a clear answer
- Scalability vs accuracy
- HW reading: Alignathon
- Strategy:
- Run multiple programs and parameter choices
- Filtering is more important than the specific program used (more on filtering later)
- Read program papers and documentation carefully
- Take good note of choices and keep track of all comparisons to justify final choice
Software tools for alignment
ClustalW
DATA (sequences)
-INFILE=file.ext :input sequences.
-PROFILE1=file.ext and -PROFILE2=file.ext :profiles (old alignment).
VERBS (do things)
-OPTIONS :list the command line parameters
-HELP or -CHECK :outline the command line params.
-FULLHELP :output full help content.
-ALIGN :do full multiple alignment.
-TREE :calculate NJ tree.
-PIM :output percent identity matrix (while calculating the tree)
-BOOTSTRAP(=n) :bootstrap a NJ tree (n= number of bootstraps; def. = 1000).
-CONVERT :output the input sequences in a different file format.
PARAMETERS (set things)
***General settings:****
-INTERACTIVE :read command line, then enter normal interactive menus
-QUICKTREE :use FAST algorithm for the alignment guide tree
-TYPE= :PROTEIN or DNA sequences
-NEGATIVE :protein alignment with negative values in matrix
-OUTFILE= :sequence alignment file name
-OUTPUT= :GCG, GDE, PHYLIP, PIR or NEXUS
-OUTORDER= :INPUT or ALIGNED
-CASE :LOWER or UPPER (for GDE output only)
-SEQNOS= :OFF or ON (for Clustal output only)
-SEQNO_RANGE=:OFF or ON (NEW: for all output formats)
-RANGE=m,n :sequence range to write starting m to m+n
-MAXSEQLEN=n :maximum allowed input sequence length
-QUIET :Reduce console output to minimum
-STATS= :Log some alignents statistics to file
***Multiple Alignments:***
-NEWTREE= :file for new guide tree
-USETREE= :file for old guide tree
-MATRIX= :Protein weight matrix=BLOSUM, PAM, GONNET, ID or filename
-DNAMATRIX= :DNA weight matrix=IUB, CLUSTALW or filename
-GAPOPEN=f :gap opening penalty
-GAPEXT=f :gap extension penalty
-ENDGAPS :no end gap separation pen.
-GAPDIST=n :gap separation pen. range
-NOPGAP :residue-specific gaps off
-NOHGAP :hydrophilic gaps off
-HGAPRESIDUES= :list hydrophilic res.
-MAXDIV=n :% ident. for delay
-TYPE= :PROTEIN or DNA
-TRANSWEIGHT=f :transitions weighting
-ITERATION= :NONE or TREE or ALIGNMENT
-NUMITER=n :maximum number of iterations to perform
-NOWEIGHTS :disable sequence weighting
What are the default values of these options?
Another thing to notice:
==ITERATION==
A remove first iteration scheme has been added. This can be used to improve the final
alignment or improve the alignment at each stage of the progressive alignment. During the
iteration step each sequence is removed in turn and realigned. If the resulting alignment
is better than the previous alignment it is kept. This process is repeated until the score
converges (the score is not improved) or until the maximum number of iterations is
reached. The user can iterate at each step of the progressive alignment by setting the
iteration parameter to TREE or just on the final alignment by seting the iteration
parameter to ALIGNMENT. The default is no iteration. The maximum number of iterations can
be set using the numiter parameter. The default number of iterations is 3.
-ITERATION= :NONE or TREE or ALIGNMENT
-NUMITER=n :Maximum number of iterations to perform
Further reading: Read the ClustalW documentation. Is it clear what the algorithm is doing and how to best select the parameters involved? What is missing (if any) from this documentation?
T-Coffee
$ t_coffee sample_seq1.fasta
- When aligning, T-Coffee will always at least generate three files:
sample_seq1.aln
: Multiple Sequence Alignment (ClustalW format by default)sample_seq1.dnd
: guide tree (Newick format)sample_seq1.html
: colored MSA according to consistency (html format)
T-Coffee
also has a great documentation- Specifically, it has a good description of the parameters, see here
- Don’t forget to check out M-Coffee that combines the output of eight aligners (MUSCLE, ProbCons, POA, DIALIGN-T, MAFFT, ClustalW, PCMA and T-Coffee)
MUSCLE
- MUSCLE documentation
- The different parameter options are explained here
- The default settings are chosen for maximum accuracy (how?), but you can change the settings for more speed (see here)
In-class exercise
Instructions: Run and compare the results of all three MSA software on the toy dataset of primates (see the class repo data folder). You can use my reproducible script as guideline: notebook-log.md.
ClustalW
- Download ClustalW
- Download the
primatesAA.fasta
file from the Phylogenetic Handbook website (22 primate aminoacid sequences). The website stopped working at some point, so we have the file in the class repo data folder. - Run
ClustalW
, see docs
T-Coffee
MUSCLE
Homework: Check out the alignment HW here.
Learn more!
- Read Chapter 3 of the Phylogenetic Handbook (HB)
- Read Sections 9.1-9.5, 9.11, 9.12, 9.13 of Computational phylogenetics (Warnow)
- Read HAL 2.3 on a new alignment method
MACSE