Crest

FERGUS(S)ON DNA PROJECT

Tom, Dick & Harry's Phylogentic Tree

dna

PhylogramIntroduction

The tree at right serves to define a phylogentic tree. In this instance it is called a phylogram because the branch lengths are measures of time.
The genetic genealogy community at large makes much use of cladograms which are phylogentic trees designed simply to show clusters of related people known as clades. The branch lengths have no correlation with time. The same software that can be used to create cladograms can also be used to create phylograms but this feature has not widely used. The underlying reason for that seems to be a lack of an online introduction to the subject and the software generates labels that are not self explanatory.
The main purpose of this page is to define the labels in the outfile generated by the PHYLIP executable file kitch.exe and show how it can be used to estimate the age of a progenitor or clade thereof.

TMRCA Tables

Here's where we begin. This HTML table and the PHYLIP compatible TMRCA table were generated using Dean McGee's Y-DNA Comparison Utility
Time to Most Recent Common Ancestor (Years)
ID T
h
o
m
a
s
R
i
c
h
a
r
d
H
a
r
r
y
Thomas 67 360 180
Richard 360 67 180
Harry 180 180 25
0-270 Years 300-570 Years 600-870 Years 900-1170 Years
- Infinite allele mutation model is used
- Average mutation rate: 0.0024
- Values on the diagonal indicate number of markers tested
- Probability is 50% that the TMRCA is no longer than indicated
- Average generaton: 30 years
PHYLIP compatible TMRCA table
3
    Thomas 0 360 180
   Richard 360 0 180
     Harry 180 180 0

Phylogenetic Tree

The PHYLIP compatible TMRCA table is stored as a text file with no extension named infile as is read by the PHYLIP executable file kitch.exe. This program uses Fitch-Margoliash and Least Squares Methods with an Evolutionary Clock. The clock constrains the resultant tree so the sum of branch lengths from the root node to all persons is the same. In using this method one is forced to assume that all persons in a surname project share the same birthdate. The program generates a text file named outfile which is shown below.

Outfile

   3 Populations

Fitch-Margoliash method with contemporary tips, version 3.66

                  __ __             2
                  \  \   (Obs - Exp)
Sum of squares =  /_ /_  ------------
                                2
                   i  j      Obs

negative branch lengths not allowed


            +-------------------------------------------------  Richard 
  +---------1 
--2         +-------------------------------------------------    Thomas
  ! 
  +----------------------------------------------------------    Harry 


Sum of squares =      0.400

Average percent standard deviation =  31.62278

From     To            Length          Height
----     --            ------          ------

   1     Richard      90.00000       108.00000
   2      1           18.00000        18.00000
   1       Thomas     90.00000       108.00000
   2       Harry     108.00000       108.00000


Outfile Discussion

A discussion is necessary because its not obvious what the output data represents.

Length, Height

Length would more properly be called branch length and represents the time in years between nodes named in the "to" and "from" columns. The caluculations below illustrates how sum branch lengths from the root node 2 to the present for any individual. Notice that the answer in each case is the same as imposed by the evolutionary clock constraint.
The simple calculation above illustrates what is meant by height; it is the sum of branch lengths from the root node to the node named in the "to" column.

Ages

This is the sum of the branch lengths from the root node to any of the inviduals multiplied by two. Hence in the present example the age of the common ancestor to Tom, Dick and Harry is 216 years since they were born. If Tom, Dick and Harry are assumed to be 50 years old then finally one concludes that the progenitor was born 266 years ago.
Tom and Dick are a clade. The birth year of the clade progenitor is the twice the branch length from node1 to Richard plus Richard's age; (2 x 90) + 50 = 230 years ago.

Why the factor of two?

The factor of two is necessary because the distances between terminus nodes of the present generation are defined in PHYLIP as the distance to the common ancestor and back again; i.e twice the branch length and twice the TMRCA. Consider the simple tree below:
Simple Tree
The Fitch and Margoliash method works with distances defined as follows:
X = ( DAB + DAC - DBC ) / 2
Y = ( DAB + DBC - DAC ) / 2
Z = ( DAC + DBC - DAB ) / 2

With these definitions the distance from A to B is twice the TMRCA for A and B; i.e. DAB= X + Y. Since the the PHYLIP compatible TMRCA table output by the Y-DNA Comparison Utility is supposed to be a matrix of the distances between all terminus nodes it should have entered values that are twice the TMRCA between those nodes; since it didn't one must multiply branch lengths by two in the resultant output file in order to interpret them as times.

Standard Deviation

The criterion for finding the best tree used in the Fitch-Margoliash method is to minimize the sum of sqaures which is equivalent to minimizing the percent standard deviation both which are defined as follows:
Sum of Squares
In these forumale n is the number of persons or leaves in the tree. The branch lengths Dij are as defined in the simple tree discussed earlier and are the observed values; i.e. twice the TMRCA between person i and person j. The branch lengths Eij are also as defined in the simple tree except now they are the estimated values; i.e. the branch lengths in any proposed tree drawn to represent the relationship between person i and person j.
In a perfect tree the sum of squares is zero because the estimated branch lengths are equal to the observed branch lengths. The Fitch-Margoliash method assumes the best tree is one in which the sum of the squares between the observed and estimated branch lengths is as small as possible. One way to discover the best tree would be to construct each and every possible tree and then choose the one with the smallest sum of sqaures. The Fitch-Margoliash method in particular is a mathematical construct to make an estimate of the best tree without having to calculate each and every possible tree which for large datasets would be too time consuming.
The standard deviation provides a measure by which one can state the uncertainty in a contructed tree. According to Chebyshev's inequality, at least 50% of the estimated branch lengths are within 1.41 standard deviations from their observed values and at least 89% are within 3 standard deviations. In a tree typical of the Fergus(s)on project that standard deviation is about 20% and 1.41 times that is 28%. For a time scale calculated to be 300 years this equates to 85 years. Hence if one calculated the age of a clade progenitor to be 300 years one can say the probability the actual age falls within 300 ± 85 years is at least 50%. Its important to note that this is the uncertainty that can be attributed to the phylogenetic tree as its fits the input TMRCA only, those TMRCA are themselves uncertain for variety of reasons foremost of which is the treatment of palindromic markers and that is a subject unto iself.
See also TMRCAs in Perspective.


Home

-->

Copyright © 2008 Fergus(s)on DNA Project