Bootstrapping

Bootstrapping is a way of testing the reliability of the dataset. It is the creation of pseudoreplicate datasets by resampling. Bootstrapping allows you to assess whether the distribution of characters has been influenced by stochastic effects. In phylogenetic analyses nonparametric bootstrapping is the most commonly used method. The pseudoreplicate datasets are generated by randomly sampling the original character matrix to create new matrices of the same size as the original. The frequency with which a given branch is found is recorded as the bootstrap proportion. These proportions can be used as a measure of the reliability (within limitations) of individual branches in the optimal tree.

Thus bootstrap analysis:

NB: If the entire dataset is compatible and has not been biased by stochastic effects, all bootstarp trees should in principle have the same topology!

However, if the original dataset is biased, a cluster may be regarded as statistically significant, even if it is a wrong one !

How is bootstrapping and the construction of a consensus tree carried out in practice?

Take a dataset consisting of in total n sequences with m sites each (see below). A number of resampled datasets of the same size (n x m) as the original dataset is produced. However, each site is sampled at random and no more sites are sampled than there were original sites. In order to be statistically significant the number of the datasets should should be high and equal or higher than the number of individual sites present in the dataset.

Our example dataset consists of in total 4 sequences with 10 sites each (see below). When three new datasets are prepared by random sampling of sites, the following three sample sets of data can be obtained:

Sample 1 0 1 2 0 3 0 1 2 0 1  (<- number of times each site is sampled)      
          ___________________
  
  A       A G G C U C C A A A       A     G G G U U U C A A A
  B       A G G U U C G A A A       B     G G G U U U G A A A
  C       A G C C C C G A A A       C     G C C C C C G A A A
  D       A U U U C C G A A C       D     U U U C C C G A A C
             

 

 A

 B

 C

 B

 1

 

 

 C

 6

 5

 

 D

 8

 7

 4

 

 

 


Sample 2 1 0 0 0 2 2 2 0 0 3        
          ___________________
  
  A       A G G C U C C A A A       A     A U U C C C C A A A
  B       A G G U U C G A A A       B     A U U C C G G A A A
  C       A G C C C C G A A A       C     A C C C C G G A A A
  D       A U U U C C G A A C       D     A C C C C G G C C C
                        


 

 A

 B

 C

 B

 2

 

 

 C

 4

 2

 

 D

 7

 5

 3

 


Sample 3 1 0 0 0 2 2 2 0 0 3        
          ___________________
  
  A       A G G C U C C A A A       A     A U U C C C C A A A
  B       A G G U U C G A A A       B     A U U C C G G A A A
  C       A G C C C C G A A A       C     A C C C C G G A A A
  D       A U U U C C G A A C       D     A C C C C G G C C C


 

 A

 B

 C

 B

 1

 

 

 C

 3

 2

 

 D

 6

 3

 4

 


A large number of datasets (between hundred and thousand, depending on computer power) and the same number of different trees are so generated. In this specific case taxa A and B form a cluster in all three trees, while C clusters with D in only one tree. There exist specialised programs, such as the program Consense in the Phylip package of Joe Felsenstein, that are able to analyse all the resulting trees and prepare the most likely tree or consensus tree from those data.

The resulting consensus tree for our small dataset is shown below. The number of times each branch point or node occured (the so-called bootstrap proportion) is indicated at each node.

Result         
 
  
  A       A G G C U C C A A A       
  B       A G G U U C G A A A      
  C       A G C C C C G A A A       
  D       A U U U C C G A A C      
                        

 

 A

 B

 C

 B

 2

 

 

 C

 3

 3

 

 D

 6

 4

 4


Last updated: 8 August 1997.
created by :Fred Opperdoes