Alignment methods (Part 2)
Previous class check-up
- We reviewed the algorithms for pairwise and multiple sequence alignments (Needleman-Wunsch algorithm)
Learning objectives
At the end of today’s session, you
- will be able to explain the most widely used algorithms for multiple sequence alignment
No pre-class work.
3. Multiple sequence alignment
- The Needleman-Wunsch is the magic algorithm that allows us to align two sequences
- We want to expand the pairwise sequence alignment to multiple sequence alignment
- Progressive alignment: the most widely used algorithm (e.g. ClustalW)
- Consistency-based scoring: improvement over progressive alignment by using a more strict score function (e.g. T-Coffee)
- Iterative refinement algorithm: improvement over progressive alignment by doing sequential alignments until convergence of score (e.g. mafft, muscle)
Progressive alignment
- Compute rooted binary tree (guide tree) from pairwise distances
- Build MSA from the bottom (leaves) up (root)
What is a rooted binary tree?
Figure 9.9 in Warnow (2018) Computational phylogenetics
Progressive alignment algorithm
- Align all pairs of sequences using the Needleman-Wunsch algorithm
- For every pairwise alignment, we calculate its cost based on the cost of gap (e.g. unit cost) and the cost of substitution (e.g. unit cost)
- We estimate the tree from distances: we will learn this in Lecture 8. Let’s pretend we already have the tree
- We build the alignments on the tree from the leaves to the root (bottom-up)
- For the leaves, we build the pairwise alignments for (a,b) and for (d,e) using the Needleman-Wunsch algorithm
- For internal nodes, we need to know how to align alignments
What are the ingredients that we need to know to perform MSA via progressive alignment?
- Perform pairwise sequence alignment via Needleman-Wunsch (check!)
- Calculate the cost of a pairwise sequence alignment (check!)
- Calculate a tree from distances (Lecture 8)
- Perform alignment of alignments (missing)
How to align alignments
We need a new concept called “profile”.
Aligning alignments
- Construct profiles
- Define the cost of putting $a_i, b_j$ together. We want to minimize the expected cost between profiles
- Use Needleman-Wunsch to align $P_1$ and $P_2$ based on the costs
Aligning alignments: defining the costs
Treat $a_i$ in $P_1$ and $b_j$ in $P_2$ as probability models: $P(x \vert a_i)$ is the probability of observing nucleotide $x$ in position $i$ on the profile $P_1$ (Example: What is $P(A \vert a_1)$?)
We define the cost as
In-class exercise: What is the $cost(a_3,b_2)$?
Homework
Instructions: Build the cost matrix for the two following profiles. This means that you want to calculate $cost(a_i,b_j)$ for all $i$ and $j$.
Aligning the alignments: we have the cost matrix, now what?
Assume we got the following cost matrix
a1 a2 a3 a4 a5
b1 [ 1/3 1 1 1 8/15 ]
b2 [ 1 1 1/4 2/3 1 ]
b3 [ 1 0 3/4 1/3 1 ]
b4 [ 1 1 1/4 2/3 1 ]
b5 [ 1 0 3/4 1/3 1 ]
b6 [ 1/3 1 9/12 8/9 11/15]
and we will use it to align the two profiles $P_1 = a_1 a_2 a_3 a_4 a_5$ and $P_2 = b_1 b_2 b_3 b_4 b_5 b_6$ with Needleman-Wunsch. The cost matrix above provides the costs of substitutions and we assume a cost of gap of 1.
The video on canvas has two errors: $cost(a3,b1)=1/4$ instead of 1 and $cost(a4,b6)=7/9$ instead of 8/9.
In-class activity: Let’s recall Needleman-Wunsch: we need the $F(i,j)$ matrix and then trace back the alignment. Let’s do here together some of the entries of the $F(i,j)$ matrix.
Homework
Instructions: Finish Needleman-Wunsch on the two profiles.
- Build the F matrix
- Trace back the alignment from the bottom right corner
Solution: You should get the following alignment which we can translate back to the original sequences.
MSA key insights Needleman-Wunsch lies at the core of MSA: if we have two sequences, we align them with Needleman-Wunsch; if we have two alignments, we first convert them to profiles, and then align the profiles with Needleman-Wunsch. The final alignment will depend on the assumptions on the cost of substitutions and costs of gaps
Homework recap here.
For next class: Read the paper corresponding to your group (in canvas): ClustalW, MUSCLE, T-Coffee