PROTML: Maximum Likelihood Inference of Protein Phylogeny

Jun Adachi 1) and Masami Hasegawa 2)

1) Department of Statistical Science,
The Graduate University for Advanced Study
4-6-7 Minami-Azabu, Minato-ku, Tokyo 106, Japan

2) The Institute of Statistical Mathematics
4-6-7 Minami-Azabu, Minato-ku, Tokyo 106, Japan



PROTML is a PASCAL program for inferring evolutionary treesfrom protein (amino acid) sequences by using maximum likelihood.

A maximum likelihood method for inferring trees from DNA orRNA sequences was developed by Felsenstein (1981). The methoddoes not impose any constraint on the constancy of evolutionaryrate among lineages. He wrote a program (DNAML) implementing themethod, and included it in his program package, PHYLIP. Theprogram has been used extensively and has proved of great use inphylogenetic studies (Hasegawa and Yano, 1984a; Hasegawa et al.,1985, 1990a; Hasegawa and Kishino, 1989; Kishino and Hasegawa,1989; Zillig et al., 1989; Hasegawa, 1990, 1991; Golenberg etal., 1990; Adkins and Honeycutt, 1991; Doebley et al., 1991;Edwards et al., 1991; Les et al., 1991; Ruvolo et al., 1991;Disotell et al., 1992; Lockhart et al., 1992). Computersimulations have demonstrated that the method is highly efficientin estimating a true tree under various situations such as aviolation of rate constancy among lineages (Hasegawa and Yano,1984b; Hasegawa et al., 1991).

DNAML, however, is confined to DNA or RNA sequence data, andis not applicable to protein sequence data. In phylogeneticstudies of deep branchings, such as those among the three majorkingdoms of eukaryotes, archaebacteria and eubacteria, and thosein the early evolution of eukaryotes, ribosomal RNA sequence datahas been used widely (e.g., Woese, 1989; Sogin et al., 1989). Inspite of many works on the ribosomal RNAs, the universal root ofall contemporary organisms on the earth including eukaryotes,archaebacteria and eubacteria remained uncertain. Miyata and hiscoworkers demonstrated the usefulness of using amino acidsequence data encoded by duplicated genes (duplicated prior tothe divergence among the major kingdoms) in establishing theuniversal root (Iwabe et al., 1989; Hasegawa et al., 1990b;Miyata et al., 1991). Furthermore, an evolutionary tree inferredfrom ribosomal RNA data is sometimes misleading when basecomposition differs widely among lineages, and a tree inferredfrom protein sequences is more reliable in such cases (Loomis andSmith, 1990; Hasegawa et al., 1993).

Because no program was available for inferring a proteintree by maximum likelihood based on a reasonable model of aminoacid substitutions, many authors used DNAML in analyzingprotein-encoding DNA sequences. As is well known, the thirdposition of codons evolve more rapidly than other positions, andtherefore DNAML was designed so that a user could specify therelative rates of substitutions in several categories ofpositions. This approach seems to be good in many cases when oneis interested in phylogenetic relationships among closely relatedspecies.

Even if the rate difference among positions in a codon aretaken into account, however, inclusion of the third positions inthe analysis can sometimes be misleading when the pattern ofcodon usage differs among lineages. Furthermore, the assumption(in DNAML) of independent evolution among three positions of acodon can be a serious defect when one is interested in tracingdeep branchings, because a (negative) selection is likely to beoperating at the codon level, rather than at the individual sitesin the codon. Even if nucleotide frequencies of protein-encodinggenes differ among lineages, amino acid frequencies may notdiffer significantly (Adachi and Hasegawa, 1992). Therefore, ifthe amino acid substitution process can be represented by anappropriate model, it seems to be better to handle amino acidsequences rather than nucleotide sequences in estimating ordersof deep branchings from data of a protein-encoding gene, andthere is an increasing demand for a maximum likelihood program toinfer protein phylogenies.

Kishino et al. (1990) developed a maximum likelihood methodfor inferring protein phylogenies that takes account of unequaltransition probabilities among pairs of amino acids by using anempirical transition matrix compiled by Dayhoff et al. (1978),and the model is called the Dayhoff model (Hasegawa et al.,1992). Although the transition matrix was constructed from alimited data set (accumulated up to 1978) of proteins encoded innuclear DNA, the Dayhoff model is not necessarily specific onlyto those proteins, but is appropriate in approximating the aminoacid substitutions of wider protein species includingmitochondrial ones (Hasegawa et al., 1993; Adachi and Hasegawa,1992; Adachi et al., 1992).

The original program for private use in Kishino et al.(1990), Mukohata et al. (1990), Hasegawa et al. (1990b), Iwabe etal. (1991), and Miyata et al. (1991) was written in FORTRAN andthe number of species in the maximum likelihood analysis wasconfined to five. In writing this program "PROTML" for publicuse, we took advantage of another computer language, PASCAL, torepresent the tree structure of the data. In this program, thereis no limit on the number of species, provided the computer isbig enough.

Since the number of possible tree topologies increasesexplosively as the number of species increases (Felsenstein,1978a), it is a serious problem to find the best tree among thehuge number of alternatives. We have developed a novel algorithmfor searching tree topologies, called "star decomposition", whichseems to be effective in finding the best tree.

The parsimony method has been used widely in molecularphylogenetics, but it may be positively misleading when theevolutionary rate differs among lineages (Felsenstein, 1978b).PROTML has proved of great use in inferring evolutionary treeseven in such situations (Hasegawa et al., 1992), and has beenapplied to several phylogenetic problems (Hasegawa et al., 1993;Adachi and Hasegawa, 1992; Adachi et al., 1992; Hashimoto et al.,1993).

The overall structure of PROTML is similar to that ofFelsenstein's DNAML. We owe very much to DNAML in the writingPROTML, and have adopted several fundamental routines from theDNAML program. Furthermore, the input format of PROTML is quitesimilar to that of DNAML. Features where PROTML differs fromDNAML (up to version 3.4) are as follows:

(1) Amino acid sequence data are analyzed based on Dayhoff's model(1978). (2) The likelihood of multifurcating trees can be estimated. (3) A novel method of topology search ("star decomposition") is adopted. (4) The Newton method is adopted in the maximization of likelihood. (5) Bootstrap probabilities of candidate trees can be estimated.



Topological Data Structure

Felsenstein considered a data structure representing theunrooted tree, where each internal node (excluding external nodesor tips) is decomposed into elements, the number of whichcoincides with those of branches stemming from the node. Theelements are connected circularly through the pointers.

By adopting such a data structure, we can store a partiallikelihood of a sub-tree stemming from the node. This means that,when we estimate the likelihood of the tree, we need notcalculate likelihood through iteration of a loop by the times ofthe number of nodes in revising the estimate of each branchlength, but need only revise the partial likelihoods of two nodesof each branch.

We extend this data structure so that a multifurcating treecan be represented. Since branches are connected dynamically bypointers, the data structure can easily be revised when adifferent tree topology is adopted, and furthermore not onlybifurcating trees but also multifurcating trees can berepresented quite easily. The extreme limit of a multifurcatingtree is the star-like tree.

Automatic Topology Search by Star Decomposition

The straightforward approach to inferring a tree would be toevaluate all possible tree topologies one after another and pickthe one which gives the highest likelihood. This would not bepossible for more than a small number of species, since thenumber of possible tree topologies is enormous (Felsenstein,1978a).

The strategy that Felsenstein's DNAML employs is as follows:the species are taken in the order in which they appear in theinput file. The first three are taken and an unrooted tree isconstructed containing only those three. Then, the fourth speciesis taken, and it is evaluated to see where it might best be addedto the tree. All possibilities (bifurcating trees) for adding thefourth species are examined. The best one under the likelihoodcriterion is chosen to be the basis for further operations. Thenthe fifth species is added, and again the best placement of it ischosen, and so on. At each step, local rearrangements of a treeare examined. This procedure is continued until a bifurcatingtree connecting all the species is obtained (Felsenstein, 1992).

The resulting tree from this procedure generally depends onthe order of the input species. Hence, Felsenstein recommendsperforming a number of runs with different orderings of the inputspecies.

The alternative strategy which we employ in the automaticand semi-automatic search options of PROTML is called "stardecomposition". This is similar to the procedure employed in theneighbor-joining method using a distance matrix (Saitou and Nei,1987). This method starts with a star-like tree. Decomposing thestar-like tree step by step, we finally obtain a bifurcating treeif all multifurcations can be resolved with statisticalconfidence. Since the information from all of the species underanalysis is used from the beginning, the inference of the treetopology is likely to be stable by this procedure.

Let be the number of species under analysis. At first, astar-like tree containing species is constructed. Then, a pair ofspecies is separated from others. Among all possible pairwisecombinations of species, a pairing that gives the highestlikelihood is chosen. The resulting tree can be regarded as astar-like tree with groups (a single species may form a group),if the selected pair is regarded to form a group. This procedureis continued until all multifurcating nodes are resolved intobifurcating ones.

When the information content of the data is not large enoughto discriminate among alternative branching orders, it might bemisleading to resolve all the multifurcations into bifurcations.Hence, by using "Akaike Information Criteria (AIC)" (Akaike,1974), the program decides whether the multifurcation shouldfurther be resolved or not.




The program allows various options that alter the amount ofinformation the program is provided or what it is to be done withthe information. The program is notified that an option has beeninvoked by the presence of one or more letters after the lastnumber on the first line of the input file. These letters may ormay not be separated from each other by blanks, though it isusually necessary to separate them from the number by a blank.They can be in any order. Thus to invoke options U, W and B, theinput file starts with the line:

19 409 UWB


19 409 W U B

This program has three mode of topology search; i.e.,Automatic mode, Semi-automatic mode and User tree (manual) mode.

Automatic mode. Unless specified otherwise, the procedure uses automatic mode, so it starts with a star-like tree.

"S" : Semi-automatic mode. Semi-automatic mode starts with a multifurcating tree that a user designates.

"U" : User tree mode. User tree (manual) mode is similar to the "U" option in Felsenstein's DNAML. This mode calculates the likelihood of all user defined topologies. Different from DNAML, this program allows multifurcating trees as user trees.

"W" : Write option. Using this option, the program will produce more information than it dose for standard output.

"B" : Bootstrap option. This option gives the approximate bootstrap probabilities of candidate trees by a resampling of estimated likelihood (RELL) method (Kishino et al., 1990).

Format of input data file

We have tried to adhere to a rather stereotyped input formatsimilar to that of Felsenstein's programs. The simplest versionof the input file looks something like this:


The first line of the input file contains the number ofspecies and the length of amino acid sequences, in free format,separated by blanks. The information for each species follows,starting with a ten-character species name (which can includepunctuation marks), and continuing with the characters for thatspecies.

An input file has three parts of data; i.e., arguments,sequences and topologies.

1. Arguments The first line of the file gives number of species, sequence length, and options.

2. Sequences The following lines give species names and amino acid sequence data. The amino acids must be specified by the one letter codes adopted by IUPAC-IUB Commission on Biochemical Nomenclature (1968). The amino acid code must be one of the twenty.

3. Topologies If one specifies User or Semi-automatic options, one mast specify the number of topologies followed by the topologies themselves.

This program allows the option U, which signals that user-defined tree(s) are provided. The topologies of these trees aresupplied AFTER the species and sequence data, rather than beforethem. The letter U appears on the first line of the file. Afterthe species and sequence data, a line containing the number ofuser-defined trees appears. Each user-defined tree starts on anew line. Here is an example with three user-defined trees:

     5   40   U   B 
3 (((species1,species2),species3),species4,species5) 
An example of semi-auto mode is as follows:
     5   40   S 

The tree topology is specified by nested pairs ofparentheses, enclosing species names and separated by commas.Trailing blanks in the name may be omitted. The pattern of theparentheses indicates the pattern of the tree by having each pairof parentheses enclose all the members of a monophyletic group.The entire tree is enclosed in an outermost pair of parentheses.Note that the tree is an unrooted one, and therefore its basemust be multifurcation with a multiplicity of greater than orequal to three. A specification of a tree ends with a semicolonwhich may be omitted.

Program Constants

The CONSTants in program that may be changed by a user are:

CONST maxsp : maximum number of species maxnode : maxsp * 2 - 3 maxpair : maxsp * (maxsp-1) / 2 maxsite : maximum number of sites maxptrn : maximum number of different site patterns maxtree : maximum number of user trees maxsmooth : number of smoothing algorithm maxiterat : number of iterates of Newton method epsilon : stopping value of error minarc : lower limit on number of substitutions per branch maxarc : upper limit on number of substitutions per branch prprtn : proprtion of branch length maxboot : number of bootstrap replications maxexe : number of jobs maxline : length of sequence output per line maxname : maximum number of characters in species name maxami : number of amino acids minreal : if job is in underflow error, increase this value seqfname : input file of sequence data tpmfname : input file of transition probability lklfname : output file of log-likelihood

Output Format

The output usually consists of (1) the name of the program and its version number, (2) the input information printed out, and (3) a series of trees,some with associated information indicating how much change therewas in each character or on each part of the tree.

The tree grows from left to right and has branches that areapproximately proportional in length to the lengths that theprogram estimates. In some cases when branches are estimated tobe very short, the output makes them three spaces long so thatthe topology is clearly shown. Here is what a typical tree lookslike:

   :-----------1.Tabac.chl  0:   :        :-------2.Prochloro   :   :----6   :   :    :---3.Anacystis   :---7   :   :------------------5.Synechocy   :   :------4.Fremyella
  No.            number   Length       S.E.  ----------------------------------------------      Tabac.chl     1     9.44861  (  1.63423 )      Prochloro     2     5.69634  (  1.30862 )      Anacystis     3     1.57704  (  0.74325 )      Fremyella     4     4.92061  (  1.24297 )      Synechocy     5    16.05818  (  2.24639 )                    6     2.13260  (  0.86082 )                    7     1.01070  (  0.63908 )  ----------------------------------------------   ln L : -1813.614 (  66.205 )  AIC : 3641.229  ----------------------------------------------

Length refers to the estimated number of substitutions per100 amino acid sites along the branch leading to the node (orleaf) indicated by the number, and S.E. refers to its standarderror estimated by the formula of Kishino and Hasegawa (1989).


Installing PROTML and Executive Environment

Personal computer with MS-DOS + Turbo Pascal(Borland): e.g.IBM PCs and compatibles, NEC PC-98x. Please remove or changecomments marked as shown below:

(* <statements> TURBO Pascal *)

UNIX Workstation + standard Pascal compiler: e.g. SUN.Please remove or change comments marked as shown below:

(* <statements> SUN Pascal *)

Mainframe computer (IBM and compatibles) + standard Pascalcompiler. For example, JCL (Job Control Language) of batch job.



How to contact developers

The best way to contact developers is to send an E-mail.


If you prefer, write a letter with your comments and send it to

Jun Adachi Department of Statistical Science, The Graduate University for Advanced Study, 4-6-7 Minami-Azabu, Minato-ku, Tokyo 106, Japan FAX: +81-3-3446-1695

Please send a mail with the following information

1. Computer brand, model. 2. The brand and version number of Pascal compiler. 3. Operating system and version number. 4. The input file of sequence data. 5. The output file.



We are particularly grateful to Dr. H. Kishino forinvaluable advices during the course of this work, and to Dr. J.Felsenstein for generously permitting us to use routines inDNAML. We also thank Drs. T. Hashimoto, T. Miyata and T. Yano fordiscussions and comments. This work was carried out under theInstitute of Statistical Mathematics Cooperative Research Program(90-ISM-57, 91-ISM-69), and was supported by grants from theMinistry of Education, Science, and Culture of Japan.



Adachi, J., Hasegawa, M. (1992) Amino acid substitution of proteins coded for in mitochondrialDNA during mammalian evolution. Jpn. J. Genet., 67:187-197.

Adachi, J., Cao, Y., Hasegawa, M. (1993) Tempo and mode of mitochondrial DNA evolution in vertebrates atthe amino acid sequence level: rapid evolution in warm-bloodedvertebrates. J. Mol. Evol., (in press).

Adkins, R.M., Honeycutt, R.L. (1991) Molecular phylogeny of the superorder Archonta. Proc. Natl.Acad. Sci. US., 88:10317-10321.

Akaike, H. (1974) A new look at the statistical model identification. IEEE Trans.Autom. Contr., 19:716-723.

Dayhoff, M.O., Schwartz, R.M., Orcutt, B.C. (1978) A model of evolutionary change in proteins. In: Dayhoff, M.O.(ed.) Atlas of Protein Sequence Structur., Vol~5, Suppl~3.National Biomedical Research Foundation, Washington DC, pp.~345-352.

Disotell, T.R., Honeycutt, R.L., Ruvolo, M. (1992) Mitochondrial DNA phylogeny of the Old-World monkey tribePapionini. Mol. Biol. Evol., 9:1-13.

Doebley, J., Durbin, M., Golenberg, E.M., Clegg, M.T., Ma D.-P.(1990) Evolutionary analysis of the large subunit of carboxylase ( rbcL) nucleotide sequence among the grasses (Gramineae). Evolutio.,44:1097-1108.

Edwards, S.V., Arctander, P., Wilson, A.C. (1991) Mitochondrial resolution of a deep branch in the genealogicaltree for perching birds. Proc. Roy. Soc. Londo., B243:99-107.

Felsenstein. J. (1978a) The number of evolutionary trees. System. Zool., 27:27-33.

Felsenstein. J. (1978b) Cases in which parsimony and compatibility methods will bepositively misleading. System. Zool., 27:401-410

Felsenstein, J. (1981) Evolutionary trees from DNA sequences: a maximum likelihoodapproach. J. Mol. Evol., 17:368-376

Felsenstein, J. (1985) Confidence limits on phylogenies: an approach using thebootstrap. Evolutio., 39:783-791.

Felsenstein, J. (1992) Phylogenies from restriction sites: a maximum-likelihoodapproach. Evolutio., 46:159-173.

Golenberg, E.M., Giannasi, D.E., Clegg, M.T., Smiley, C.J.,Durbin, M., Henderson, D., Zurawski, G. (1990) Chlorolplast DNA sequence from a Miocene Magnolia species.Natur., 344:656-658.

Hasegawa, M., Yano, T. (1984a) Phylogeny and classification of Hominoidea as inferred from DNAsequence data. Proc. Japan Acad., B60:389-392.

Hasegawa, M., Yano, T. (1984b) Maximum likelihood method of phylogenetic inference from DNAsequence data. Bull. Biomet. Soc. Jp., 5:1-7.

Hasegawa, M., Iida, Y., Yano, T., Takaiwa, F., Iwabuchi, M.(1985) Phylogenetic relationships among eukaryotic kingdoms inferredfrom ribosomal RNA sequences. J. Mol. Evol., 22:32-38.

Hasegawa, M., Kishino, H. (1989) Confidence limits on the maximum-likelihood estimate of thehominoid tree from mitochondrial- DNA sequences. Evolutio.,43:672-677.

Hasegawa, M. (1990) Phylogeny and molecular evolution in primates. Jpn. J. Genet.,65:243-265.

Hasegawa, M. (1991) Molecular phylogeny and man's place in Hominoidea. J. Anthrop.Soc. Nippo., 99:49-61.

Hasegawa, M., Kishino, H., Hayasaka, K., Horai, S. (1990a) Mitochondrial DNA evolution in primates: Transition rate hasbeen extremely low in lemur. J. Mol. Evol., 31:113-121.

Hasegawa, M., Iwabe, N., Mukohata, Y., Miyata, T. (1990b) Close evolutionary relatedness of archaebacteria, Methanococcusand Halobacteriu., to eukaryotes demonstrated by compositephylogenetic trees of elongation factors EF-Tu and EF-G: eocytetree is unlikely. Jpn. J. Genet., 65:109-114.

Hasegawa, M., Kishino, H., Saitou, N. (1991) On the maximum likelihood method in molecular phylogenetics. J.Mol. Evol., 32:443-445.

Hasegawa, M., Cao, Y., Adachi, J., Yano, T. (1992) Rodent polyphyly? Natur., 355:595-595.

Hasegawa, M., Hashimoto, T., Adachi, J., Iwabe, N., Miyata, T.(1993) Early divergences in the evolution of eukaryotes: ancientdivergence of Entamoeba that lacks mitochondria revealed byprotein sequence data. J. Mol. Evol., (in press).

Hashimoto, T., Otaka, E., Adachi, J., Mizuta, K., Hasegawa, M.(1993) The giant panda is most close to a bear, judged by - and-hemoglobin sequences. J Mol Evol. (in press).

IUPAC-IUB Commission on Biochemical Nomenclature (1968) A one-letter notation for amino acid sequences, tentative rules.J. Biol. Chem., 243:3557-3559.

Iwabe, N., Kuma, K., Hasegawa, M., Osawa, S., Miyata, T. (1989) Evolutionary relationship of archaebacteria, eubacteria, andeukaryotes inferred from phylogenetic trees of duplicated genes.Proc. Natl. Acad. Sci. US., 86:9355-9359.

Iwabe, N., Kuma, K., Kishino, H., Hasegawa, M., Miyata, T. (1991) Evolution of RNA polymerases and branching patterns of the threemajor groups of archaebacteria. J. Mol. Evol., 32:70-78.

Kishino, H., Hasegawa, M. (1989) Evaluation of the maximum likelihood estimate of theevolutionary tree topologies from DNA sequence data, and thebranching order in Hominoidea. J. Mol. Evol., 29:170-179.

Kishino, H., Miyata, T., Hasegawa, M. (1990) Maximum likelihood inference of protein phylogeny and the originof chloroplasts. J. Mol. Evol., 30:151-160.

Les, D.H., Garvin, D.K., Wimpee, C.F. (1991) Molecular evolutionary history of ancient aquatic angiosperms.Proc. Natl. Acad. Sci. US., 88:10119-10123.

Lockhart, P.J., Howe, C.J., Bryant, D.A., Beanland, T.J., Larkum,A.W.D. (1992) Substitutional bias confounds inference of cyanelle origins fromsequence data. J. Mol. Evol., 34:153-162.

Loomis, W.F., Smith, D.W. (1990) Molecular phylogeny of Dictyostelium discoideum by proteinsequence comparison. Proc. Natl. Acad. Sci. US., 87:9093-9097.

Miyata, T., Iwabe, N., Kuma, K., Kawanishi, Y., Hasegawa, M.,Kishino, H., Mukohata, Y., Ihara, K., Osawa, S. (1991) Evolution of archaebacteria: Phylogenetic relationships amongarchaebacteria, eubacteria, and eukaryotes. In: Osawa, S., Honjo,T. (eds.) Evolution of Life: Fossils, Molecules, and Culture .Springer-Verlag, Tokyo, pp. 337-351.

Mukohata, Y., Ihara, K., Kishino, H., Hasegawa, M., Iwabe, N.,Miyata, T. (1990) Close evolutionary relatedness of archaebacteria witheukaryotes. Proc. Japan Acad., B66:63-67.

Saitou, N, Nei, M. (1987) The neighbor-joining method: a new method for reconstructingphylogenetic trees. Mol. Biol. Evol., 4:406-425.

Ruvolo, M., Disotell, T.R., Allard, M.W., Brown, W.M., Honeycutt,R.L. (1991) Resolution of the African hominoid trichotomy by use of amitochondrial gene sequence. Proc. Natl. Acad. Sci. US.,88:1570-1574.

Sogin, M.L., Edman, U., Elwood, H. (1989) A single kingdom of eukaryotes. In: Fernholm, B., Bremer, K., J rnvall, H. (eds.) The Hierarchyof Life . Elsevier Science Publisher, Amsterdam, pp. 133-143.

Woese, C.R. (1989) Bacterial evolution. Microbiol. Rev., 51:221-271.

Zillig, W., Klenk, H.-P., Palm, P., Leffers, H., P hler, G.,Gropp, F., Garrett, R.A. (1989) Did eukaryotes originate by a fusion event? Endocytobiosis CellRes., 6:1-25

Last updated: 8 August 1997.
created by :Fred Opperdoes