A Generalized Approach

These are the details on how the Big Tree can be used to count mutations in principle consistent with the method described by Adamov et al (2015). The diagram below is a simlified example for the purposes of discussion. There are four individuals in the diagram whose most recent ancestor is represented by the parent clade. The numbers in red represent the countable SNP in any block. The subclade blocks represent shared SNP counts whereas the gray blocks are the private SNP counts unique to any one individual. The individual tester results may not actually show a result for all the shared SNP due to imperfect testing. To account for the imperfect test, all the SNP in all the subclade blocks are counted for each individual.

SNP Count
This table shows mutation Counts and computed ages corresponding to above diagram.
ID private Count private Age shared Count shared Age total Count sum of private and shared Age
1 0 0 3 439 3 439
2 3 469 3 439 6 908
3 2 313 1 146 3 459
4 5 782 0 0 5 782

When YFull does an age computation they exclude any SNP found outside a region they define as the combBED region. This eliminates from the calculation most of the problematic SNP. The coverage of that region varies from test to test. Subclade blocks and FGC tests are treated as covering 100% of the base pairs in the region whereas BigY tests are observed to cover on average 93% of the region. The specfic numbers input into a computatIon are given in the table below.

This table shows assumed input parameters.
private Coverge
(base pairs)
shared Coverage
(base pairs)
Rate
(mutations per basepair-generation)
Years per
Generation
7900000 8450000 2.55E-08 31.5
93% of the comBED region 100% of the combBED region Used by YFull Used by YFull

The table below shows the results of the computation. The standard error about the mean is computed using 1.96* *TMRCA/sqrt(N-1)

Results
N = Average Mutations
per tester
= Standard Deviation in Mutation Count Standard Error
(± years)
TMRCA
(years)
4.25 1.30 215 647

Countable SNP

The YFull method uses eight criteria to define countable SNP for any one individual:

  1. “Reg” criterion - The coordinates of the SNPs must fall within the combBED regions
  2. “Indel” criterion - Insertions and deletions (called "Indels") are excluded, as are multiple nucleotide polymorphisms (SNPs with more than one base position).
  3. “Locs” criterion - Variants detected in more than five different "localizations" are excluded.
  4. “Reads” criterion - SNPs with only one or two "reads" are excluded.
  5. “Qual” criterion - SNPs are excluded if the "read quality" is less than 90% according to YFull's proprietary SNP rating system.
  6. “Post mortem” criterion - Used for ancient samples and not applicable here
  7. “Single SNP” criterion - Exclusion of variants with Double Nucleotide Polymorphisms (DNP)
  8. “Trash” criterion - In general, these are variants in palindromic segments and segments with repetitive copies at other Y-chromosome segments.

The nature of the tree is that one can assume with complete coverage and high quality reads an individual's test result would have all the SNP that define a subclade block. Thus criteria 4 and 5 do not apply when using subclade blocks but do apply to private

It is assumed that these four criterion are sufficient for the Big Tree blocks (A spreadsheet is available to facilitate the count):

  1. “Reg” criterion - Exclude any variant outside the combBED region.
  2. “Indel” criterion - Exclude any variant that alters the length of the REF allele.
  3. “Locs” criterion - Use the YFull SNP search engine and exclude any variant showing more than five localizations.
  4. “Trash” criterion - Exclude any variant that falls within Problem regions found in MikeW's Discovery V1 spreadsheet

Example - M222 Block

A comparison of YFULL M222 Block to the the Big Tree M222 Block. Despite differences in the total SNP listed, presumably due to different Reads and Quality criteria in the two trees, the exclusion criteria mitigated this effect lead to the same number of countable SNP.
R-M222 YFull YTree v4.08 Big Tree 9/23/16
Total SNP 38(1) 45(2)
SNP in combBed Region 23 27
Countable SNP 22 22
Samples 96 238
  1. Includes 4 not listed in the Big Tree.
  2. Includes 11 not listed in the YFull YTree. Two of these are M11666/CTS11001
    and M2626/S629 which are excluded by the YFULL “Locs” criterion.

It is worth noting that the block age in the YFull YTree is the difference between the TMRCA of the block and all other brother blocks (aka "formed") and the TMRCA of all the sons of the block. In this case that difference 4300-1960 = 2340 years, this despite 22 countable SNP in the block which should have equated to 3540 years. The reasons for this include a convoluted constraint that a parent clade cannot be younger than a child clade. Hence any time they compute for a block that TMRCA > "formed" they then set the TMRCA = "formed" and add a footnote to the "info table".

Inspection of the info table for R-DFZ2961 reveals the contribution of R-M222 to the TMRCA of the sons of R-DFZ2961 is 4905 years with an average SNP count of 31.2. The TMRCA of the sons of R-M222 is 2000 years with an average SNP count of 12.6. Finally the average SNP count in the M222 block is 31.2-12.6 = 18.6. They missed on average 3.4 SNP within that block by computing ages as the average of individuals within a block and not adding a constraint that everyone in a block have the same number of SNP dictated by the haplotree.

Example TMRCA under R-A223

Count

The figure above shows SNP counts for under different blocks in the haplogroup R-A223 in the BigTree.

ID private Count private Age shared Count shared Age total Count sum of private and shared Age
23702 5 782 5 731 10 1513
H1936 0 0 5 731 5 731
21609 3 469 5 731 8 1200
235983 2 313 5 731 7 1044
B41860 6 938 3 439 9 1377
7213 2 313 5 731 7 1044
181988 5 782 5 731 10 1513
9399 4 625 2 292 6 918
229321 4 625 2 292 6 918
N118456 4 625 4 585 8 1210
223269 5 782 4 585 9 1367
12701 4 625 2 292 6 918
229652 2 313 2 292 4 605
294717 1 156 4 585 5 741
56323 6 938 6 877 12 1815
366286 8 1251 1 146 9 1397
N113724 7 1095 0 0 7 1095
160628 10 1564 0 0 10 1564

For comparison sake results are compared with to those in the YFull YTree in the table below. Note there is no reason to expect to match perfectly because the data sets differ, they do overlap within the standard errors. It is suggested that the YFull TMRCA is underestimated because the computation is not filling in missing SNP deduced from an individual's position within the haplotree.

Results
  Samples N = Average Mutations
per tester
= Standard Deviation in Mutation Count Standard Error
(± years)
TMRCA
(years)
This Approximation 18 7.67 2.08 240 1165
YFull YTree v4.08 9 6 1.98 275 950

References

  1. Defining a New Rate Constant for Y-Chromosome SNPs based on Full Sequencing Data by Adamov, Guryanov, Korzhavin, Tagankin, Urasin (2015)


[ Home ] [ Analysis by Haplogroup ]


Copyright © 2006 Fergus(s)on Y-DNA Project