Parsimony implies that simpler hypotheses are preferable to more complicated ones.
Maximum parsimony is a character-based method that infers a phylogenetic tree by minimizing the total number of evolutionary steps required to explain a given set of data, or in other words by minimizing the total tree length.
The steps may be base or amino-acid substitutions for sequence data, or gain and loss events for restriction site data.
Maximum parsimony, when applied to protein sequence data either considers each site of the sequence as a multistate unordered characterd with 20 possible states (the amino-acids) (Eck and Dayhoff, 1966), or may take into account the genetic code and the number of mutations, 1, 2 or 3, that is required to explain an observed amino-acid substitution. The latter method is implemented in the PROTPARS program (Felsenstein, 1993).
The maximum parsimony method searches all possible tree topologies for the optimal (minimal) tree. However, the number of unrooted trees that have to be analysed rapidly increases with the number of OTUs. The number of rooted trees (Nr) for n OTUs is given by:
The number of unrooted trees (Nr) for n OTUs is given by:
This is shown in the following table:
Number of OTUs |
Number of unrooted trees |
Number of rooted trees |
2 |
1 |
1 |
3 |
1 |
3 |
4 |
3 |
15 |
5 |
15 |
105 |
6 |
105 |
945 |
7 |
954 |
10,395 |
8 |
10,395 |
135,135 |
9 |
135,135 |
34,459,425 |
10 |
34,459,425 |
2.13E15 |
15 |
2.13E15 |
8.E21 |
This rapid increase in number of trees to be analysed may make it impossible to apply the method to very large datasets. In that case the parsimony method may become very time consuming, even on very fast computers.
An example of the maximum parsimony method for a dataset of 4 nucleic-acid sequences is given below.
Consider the following set of homologous sequences:
Site _________________________ Sequence 1 2 3 4 5 6 7 8 9 1 A A G A G T G C A 2 A G C C G T G C G 3 A G A T A T C C A 4 A G A G A T C C G
For four OTUs there are three possible unrooted trees. The trees are then analysed by searching for the ancestral sequences and by counting the number of mutations required to explain the respective trees as shown below:
(1) AAGAGTGCA AGATATCCA (3) \4 2/ Number of mutations \ 4 / AGCCGTGCG --- AGAGATCCG Tree I: 11 / \ /0 0\ (2) AGCCGTGCG AGAGATCCG (4) (1) AAGAGTGCA AGCCGTGCG (2) \1 3/ \ 5 / AGGAGTGCA --- AGAGGTCCG Tree II: 14 / \ /4 1\ (3) AGATATCCA AGAGATCCG (4) (1) AAGAGTGCA AGCCGTGCG (2) \1 3/ \ 5 / AGGAGTGCA --- AGATGTCCG Tree III: 16 / \ /5 2\ (4) AGAGATCCG AGATATCCA (3)
Tree I has the topology with the least number of mutations and thus
is the most parsimonious tree.
NB: The above analysis is based on all the sites in the sequence
alignment . However, a number of the sites are non-informative and,
therefore, do not have to be included in the analysis. When only
informative sites are included a much lesser number of sites can be
analysed, which means in the case of large datasets a considerable
gain in CPU time.
The definition of an informative site is as follows. A site is informative only when there are at least two different kinds of nucleotides at the site, each of which is represented in at least two of the sequences under study.
To illustrate the distinction between informative and non-informative sites, lets have a look the same four hypothetical sequences as above.
Site _________________________ Sequence 1 2 3 4 5 6 7 8 9 1 A A G A G T G C A 2 A G C C G T G C G 3 A G A T A T C C A 4 A G A G A T C C G * * *
There are three possible unrooted trees for four OTUs (tree I, II and III, see figure below). Site 1 is not informative because all sequences at this site have A, so that no change is required in any of the three possible trees. At site 2, sequence 1 has A while all other sequences have G, and so a simple assumption is that the nucleotide has changed from G to A in the lineage leading to sequence 1. Thus, this site is also not informative, because each of the three possible trees requires 1 change. As shown in the figure, for site 3 each of the three possible trees requires 2 changes and so it is also not informative. Note that if we assume that the nucleotide at the node connecting OTUs 1 and 2 in tree I is C (or A) instead of G, the number of changes required for the tree remains 2. The figure shows that for site 4 each of the three trees requires 3 changes and thus site 4 is also non-informative. For site 5, tree I requires only 1 change, whereas trees II and III require 2 changes each (Figure c). Therefore, this site is informative.
From these examples, we see that, as far as molecular data are concerned, a site is informative only when there are at least two different kinds of nucleotides at the site, each of which is represented in at least two of the sequences under study. In the above example, informative sites are indicated by an asterisk (*).
Below you see the four sequences and their corresponding three possible trees made with only the informative sites :
1 GGA 2 GGG 3 ACA 4 ACG *** (1) GGA ACA (3) \1 1/ Number of mutations \ 2 / GGG --- ACG Tree I: 4 / \ /0 0\ (2) GGG ACG (4) (1) GGA GGG (2) \1 1/ \ 1 / GCA --- GCG Tree II: 5 / \ /1 1\ (3) ACA ACG (4) (1) GGA GGG (2) \2 1/ \ 0 / GCG --- GCG Tree III: 6 / \ /1 2\ (4) ACG ACA (3)
To infer a maximum parsimony tree, for each possible tree we
calculate the minimum number of substitutions at each informative
site. In the above example, for sites 5, 7, and 9, tree I requires in
total 4 changes, tree II requires 5 changes, and tree III requires 6
changes. In the final step, we sum the number of changes over all the
informative sites for each tree and choose the tree associated with
the smallest number of substitutions. In our case, tree I is chosen
because it requires the smallest number of changes (4) at the
informative sites.
In the case of four OTUS, an informative site favours only one of the three possible alternative trees. For example, site 5 favours tree I over trees II and III, and is said to support tree I. It is easy to see that the tree supported by the largest number of informative sites is the maximum parsimony tree. For instance, in the above example, tree I is supported by 2 sites, tree II by one site, and tree III by none.
Maximum parsimony searches for the optimal (minimal) tree. In this process more than one minimal trees may be found. In order to guarantee to find the best possible tree an exhaustive evaluation of all possible tree topologies has to be carried out. However, this becomes impossible when there are more than 12 OTUs in a dataset.
Branch and Bound: is a variation on maximum parsimony that garantees to find the minimal tree without having to evaluate all possible trees. This way a larger number of taxa can be evaluated but the method is still limited.
Heuristic searches is a method with step-wise addition and rearrangement (branch swapping) of OTUs. Here it is not guaranteed to find the best tree.
Since, in view of the size of the dataset, it is often not
possible to carry out an exhaustive or other search for the best
tree, it is adviced to change the order of the taxa in the dataset
and to repeat the analysis, or to indicate to the program to do this
for you by providing a so-called jumble factor to the
program.
Since the Maximum Parsimony method may result in more than one equally parsimonious tree, a consensus tree should be created. For the creation of a consensus tree see bootstrapping.
Let's assume that we have a set of 3 possible trees for 4 OTUs that relate to only one site and that all describe the same final state by assuming a total of 3 steps. However, each final state is arrived at via a different route. It is immediately obvious that each of the three trees is equally valid, but that the number of steps along the indiviual branches (or the length of each branch) is not deteremined. For this reason branch lengths are not given in parsimony, but only the total number of steps for a tree.
(1) G A (3) \1 0/ \ 1 / C -----A / \ /0 1\ (2) C T (4) (1) G A (3) \0 1/ \ 1 / G -----T / \ /1 0\ (2) C T (4) (1) G A (3) \1 1/ \ 1 / C -----A / \ /0 0\ (2) C A (4)