There is a rich diversity of species on earth today specialized to live in a wide range of environments. However, each of these species evolved from very primitive species billions of years ago. A good way to visualize how these spcecies evolved through time is to view them in a tree. The tree to the right displays evolutionary relationships among a small sample of different animal species that exist today. The leaf nodes at the tips of the tree are labeled and represent groups of descendant species and each branch node represents a common ancestor of all descendant children.
One metric commonly used to compare relatedness of species is to view their most recent common ancestor. For example, you can see how closely related humans are to lizards by looking at their most recent common ancestor . This also allows us to look into how two groups of species compare to humans. Notice that turkeys and humans are related just as much as lizards and humans are since these two pairings have the same most recent common ancestor. This can be confusing since there are more ancestral species between humans and turkeys than there are between humans and lizards. Notice that humans and platypi are more related than humans and turkeys or humans and lizards are, since the most recent common ancestor between humans and platypi is more recent than the one between humans and turkeys/lizards.
But how did we know where to place species on the tree? There are many ways to measure similarities between two species, but a popular way is to compare the DNA sequences of two species. Two species that are similar genetically contain very similar subsequences of DNA, and we will explore a method of analyzing the similarities between two strings.
A useful heuristic in computing the different between two strings is called the Levenshtein Distance. Informally, this distance is defined as the minimum number of single-character edits required to change one word to another, where each edit is one of the following:
An insertion is defined as adding a character to one of the strings being compared to make them more equal. For example, a T can be added to the end of "ATGT" to create "ATGTT" (dashes represent no character). Note that blue indicates that characters are aligned in a matching position and orange indicates that characters are not in matching positions.
A deletion is defined as deleting a character from one of the strings being compared to make them more equal. For example, a deletion of T from "ATGTT" can create "ATGT".
A substitution is defined as replacing a single character in one of the strings being compared with another character to make them more equal. For example, if the strings being compared are "ATGTT" and "GTGTT", then the "A" in "ATGTT" can be substituted for a "G" to create "GTGTT".
Since we can make many combinations of single-character edits (some of which are not as helpful as others), we can impose a cost for each edit and then try to minimize the total cost of some edits to find the optimal (or one of the optimal) ways to transform one string into another, and use this metric to figure out how similar two strings are. As the minimum total cost increases, the strings are more and more dissimilar to each other.
Consider two strings: TGTTACGG and AGGTTGACTA. Let’s analyze the total cost of converting TGTTACGG into AGGTTGACTA. Note that we use the same color scheme described above, but also introduce a dark blue color to indicate the most recently matched character(s).
Total Cost: 0.00
This resulted in a final total cost of 5. But can we do better in terms of the cost? It turns out that we can’t, but this is an interesting problem to attempt to optimize.
Notice that we are finding the optimal alignment taking the entire length of both of the strings into account. Often in biology, there are targeted substrings of two DNA sequences that provide some evolutionary significance.
In order to observe the difference between substrings we will first have to define a scoring function so that we can determine which substrings match more closesly than others. In this scoring function we will give any two matching characters a +2 if the characters at some index match and -1 otherwise. A single-character change still costs -1 like in the above example. Notice that if two characters are not aligned (contributing a score of -1 to our scoring function), but we can align them with a single-character change, we can, in effect, “pay” a cost, -1, in order to gain a +2 from the two-characters becoming aligned.
As an example, take the two Strings ACTG and AGTG. If we were to aggregate their character scores it would look something like:
Resulting in a total score of 2 - 1 + 2 + 2 = 5.
Note that a scoring from one specific state of two strings could be improved by performing a single-character edit (shift, deletion, insertion), but performing each of these operations will still cost us -1 (in other words, cause our score to decrease by 1).
Our goal is to find an optimal alignment between two substrings that maximizes the scoring function.
One approach to finding an optimal alignment would be to generate every single possible substring of the two strings and compare them against each other. A substring can be created from a longer string by choosing a starting and ending position within the longer string, and treating that as a shorter string. Creating all possible substrings of a string is computationally expensive, as for every starting position we choose, we can choose any ending position along the string. The longer the original string, the more possible substrings can be created from it. In addition to generating every possible substring, it would also be necessary to generate every possible variation of each of these substrings, in which one or more dashes are inserted in order to simulate the insertion/deletion of a character, to match one string against another. Once we have all of these potential substrings generated, we then have to go through all possible pairings of two substrings to calculate their alignment score, according to the scoring function we use. Overall, we can see that calculating all of these alignment scocres would be computationally expensive because of the sheer amount of potential substrings and combinations.
The essence of the Smith-Waterman Algorithm is to improve this naive approach using dynamic programming. This algorithm takes the two strings that we want to find optimal substring alignments for and aligns them into a matrix. This might seem strange at first, but this approach actually makes the problem much more efficient.
This graph displays a comparison between a naive approach to find optimal substring alignment and the Smith-Waterman algorithm. The naive, brute-force approach entails comparing every single possible edit (i.e. insertion of a dash) of every single possible substring of the two strings against each other. When we consider the lengths of the two strings to be n and m, the naive approach has a runtime of O(n3m3), while the Smith-Waterman algorithm has a runtime O(nm).
Enter two strings to find their optimal alignment.