Maximum Parsimony analysis

Parsimony implies that simpler hypotheses are preferable to more complicated ones.

Maximum parsimony is a character-based method that infers a phylogenetic tree by minimizing the total number of evolutionary steps required to explain a given set of data, or in other words by minimizing the total tree length.

The steps may be base or amino-acid substitutions for sequence data, or gain and loss events for restriction site data.

Maximum parsimony, when applied to protein sequence data either considers each site of the sequence as a multistate unordered characterd with 20 possible states (the amino-acids) (Eck and Dayhoff, 1966), or may take into account the genetic code and the number of mutations, 1, 2 or 3, that is required to explain an observed amino-acid substitution. The latter method is implemented in the PROTPARS program (Felsenstein, 1993).

The maximum parsimony method searches all possible tree topologies for the optimal (minimal) tree. However, the number of unrooted trees that have to be analysed rapidly increases with the number of OTUs. The number of rooted trees (Nr) for n OTUs is given by:

The number of unrooted trees (Nr) for n OTUs is given by:

This is shown in the following table:

 Number of OTUs

 Number of unrooted trees

 Number of rooted trees

 2

 1

 1

 3

 1

 3

 4

 3

 15

 5

 15

 105

 6

 105

 945

 7

 954

 10,395

 8

 10,395

 135,135

 9

 135,135

 34,459,425

 10

 34,459,425

 2.13E15

 15

 2.13E15

 8.E21

This rapid increase in number of trees to be analysed may make it impossible to apply the method to very large datasets. In that case the parsimony method may become very time consuming, even on very fast computers.


An example of the maximum parsimony method for a dataset of 4 nucleic-acid sequences is given below.

Consider the following set of homologous sequences:

          Site

          _________________________ 


Sequence  1  2  3  4  5  6  7  8  9  

  1       A  A  G  A  G  T  G  C  A  

  2       A  G  C  C  G  T  G  C  G  

  3       A  G  A  T  A  T  C  C  A  

  4       A  G  A  G  A  T  C  C  G  

For four OTUs there are three possible unrooted trees. The trees are then analysed by searching for the ancestral sequences and by counting the number of mutations required to explain the respective trees as shown below:

 

(1) AAGAGTGCA         AGATATCCA (3)
           \4        2/               Number of mutations
            \    4   /
       AGCCGTGCG --- AGAGATCCG         Tree I:   11
            /        \
           /0        0\ 
(2) AGCCGTGCG         AGAGATCCG (4)


(1) AAGAGTGCA         AGCCGTGCG (2) 
           \1        3/               
            \    5   /
      AGGAGTGCA --- AGAGGTCCG          Tree II:  14
            /        \
           /4        1\ 
(3) AGATATCCA         AGAGATCCG (4) 


(1) AAGAGTGCA         AGCCGTGCG (2) 
           \1        3/               
            \    5   /
      AGGAGTGCA --- AGATGTCCG          Tree III: 16
            /        \
           /5        2\ 
(4) AGAGATCCG         AGATATCCA (3)


Tree I has the topology with the least number of mutations and thus is the most parsimonious tree.

NB: The above analysis is based on all the sites in the sequence alignment . However, a number of the sites are non-informative and, therefore, do not have to be included in the analysis. When only informative sites are included a much lesser number of sites can be analysed, which means in the case of large datasets a considerable gain in CPU time.

Informative sites

The definition of an informative site is as follows. A site is informative only when there are at least two different kinds of nucleotides at the site, each of which is represented in at least two of the sequences under study.

To illustrate the distinction between informative and non-informative sites, lets have a look the same four hypothetical sequences as above.

 

          Site

          _________________________ 


Sequence  1  2  3  4  5  6  7  8  9  

  1       A  A  G  A  G  T  G  C  A  

  2       A  G  C  C  G  T  G  C  G  

  3       A  G  A  T  A  T  C  C  A  

  4       A  G  A  G  A  T  C  C  G  


                      *     *     *

There are three possible unrooted trees for four OTUs (tree I, II and III, see figure below). Site 1 is not informative because all sequences at this site have A, so that no change is required in any of the three possible trees. At site 2, sequence 1 has A while all other sequences have G, and so a simple assumption is that the nucleotide has changed from G to A in the lineage leading to sequence 1. Thus, this site is also not informative, because each of the three possible trees requires 1 change. As shown in the figure, for site 3 each of the three possible trees requires 2 changes and so it is also not informative. Note that if we assume that the nucleotide at the node connecting OTUs 1 and 2 in tree I is C (or A) instead of G, the number of changes required for the tree remains 2. The figure shows that for site 4 each of the three trees requires 3 changes and thus site 4 is also non-informative. For site 5, tree I requires only 1 change, whereas trees II and III require 2 changes each (Figure c). Therefore, this site is informative.

From these examples, we see that, as far as molecular data are concerned, a site is informative only when there are at least two different kinds of nucleotides at the site, each of which is represented in at least two of the sequences under study. In the above example, informative sites are indicated by an asterisk (*).

Below you see the four sequences and their corresponding three possible trees made with only the informative sites :

 

1 GGA 
2 GGG 
3 ACA 
4 ACG
  ***


(1)  GGA         ACA (3)
       \1        1/               Number of mutations
        \    2   /
        GGG --- ACG                 Tree I:   4
        /        \
       /0        0\ 
(2)  GGG         ACG (4)  


(1)  GGA         GGG (2)  
       \1        1/             
        \    1   /
        GCA --- GCG                 Tree II:  5
        /        \
       /1        1\ 
(3)  ACA         ACG (4)  


(1)  GGA         GGG (2)  
       \2        1/               
        \    0   /
        GCG --- GCG                 Tree III: 6
        /        \
       /1        2\ 
(4)  ACG         ACA (3)


To infer a maximum parsimony tree, for each possible tree we calculate the minimum number of substitutions at each informative site. In the above example, for sites 5, 7, and 9, tree I requires in total 4 changes, tree II requires 5 changes, and tree III requires 6 changes. In the final step, we sum the number of changes over all the informative sites for each tree and choose the tree associated with the smallest number of substitutions. In our case, tree I is chosen because it requires the smallest number of changes (4) at the informative sites.

In the case of four OTUS, an informative site favours only one of the three possible alternative trees. For example, site 5 favours tree I over trees II and III, and is said to support tree I. It is easy to see that the tree supported by the largest number of informative sites is the maximum parsimony tree. For instance, in the above example, tree I is supported by 2 sites, tree II by one site, and tree III by none.

Maximum parsimony searches for the optimal (minimal) tree. In this process more than one minimal trees may be found. In order to guarantee to find the best possible tree an exhaustive evaluation of all possible tree topologies has to be carried out. However, this becomes impossible when there are more than 12 OTUs in a dataset.

Branch and Bound: is a variation on maximum parsimony that garantees to find the minimal tree without having to evaluate all possible trees. This way a larger number of taxa can be evaluated but the method is still limited.

Heuristic searches is a method with step-wise addition and rearrangement (branch swapping) of OTUs. Here it is not guaranteed to find the best tree.

Since, in view of the size of the dataset, it is often not possible to carry out an exhaustive or other search for the best tree, it is adviced to change the order of the taxa in the dataset and to repeat the analysis, or to indicate to the program to do this for you by providing a so-called jumble factor to the program.

Consensus tree

Since the Maximum Parsimony method may result in more than one equally parsimonious tree, a consensus tree should be created. For the creation of a consensus tree see bootstrapping.



Parsimony and branch lengths

Let's assume that we have a set of 3 possible trees for 4 OTUs that relate to only one site and that all describe the same final state by assuming a total of 3 steps. However, each final state is arrived at via a different route. It is immediately obvious that each of the three trees is equally valid, but that the number of steps along the indiviual branches (or the length of each branch) is not deteremined. For this reason branch lengths are not given in parsimony, but only the total number of steps for a tree.



(1)  G             A (3)
       \1        0/               
        \    1   /
         C -----A               
        /        \
       /0        1\ 
(2)  C             T (4)  


(1)  G             A (3)
       \0        1/               
        \    1   /
         G -----T               
        /        \
       /1        0\ 
(2)  C             T (4)  


(1)  G             A (3)
       \1        1/               
        \    1   /
         C -----A               
        /        \
       /0        0\ 
(2)  C             A (4)  


Some final notes on Maximum Parsimony


Last updated: 9 September 1997.

created by :Fred Opperdoes