Technical Report 2012-026

Inferring Haplotypes and Parental Genotypes in Larger Full Sib-Ships and Other Pedigrees with Missing or Erroneous Genotype Data

Carl Nettelblad

September 2012


Background: In many contexts, pedigrees for individuals are known even though not all individuals have been fully genotyped. In one extreme case, the genotypes for a set of full siblings are known, with no knowledge of parental genotypes. We propose a method for inferring phased haplotypes and genotypes for all individuals, even those with missing data, in such pedigrees, allowing a multitude of classic and recent methods for linkage and genome analysis to be used more efficiently.

Results: By artificially removing the founder generation genotype data from a well-studied simulated dataset, the quality of reconstructed genotypes in that generation can be verified. For the full structure of repeated matings with 15 offspring per mating, 10 dams per sire, 99.89% of all founder markers were phased correctly, given only the unphased genotypes for offspring. The accuracy was reduced only slightly, to 99.51%, when introducing a 2% error rate in offspring genotypes. When reduced to only 5 full-sib offspring in a single sire-dam mating, the corresponding percentage is 92.62%, which compares favorably with 89.28% from the leading Merlin package. Furthermore, Merlin is unable to handle more than approximately 10 sibs, as the number of states tracked rises exponentially with family size, while our approach has no such limit and handles 150 half-sibs with ease in our experiments.

Conclusions: Our method is able to reconstruct genotypes for parents when genotype data is only available for offspring individuals, as well as haplotypes for all individuals. Compared to the Merlin package, we can handle larger pedigrees and produce superior results, mainly due to the fact that Merlin uses the Viterbi algorithm on the state space to infer the genotype sequence. Tracking of haplotype and allele origin can be used in any application where the marker set does not directly influence genotype variation influencing traits. Inference of genotypes can also reduce the effects of genotyping errors and missing data. The cnF2freq codebase implementing our approach is available under a BSD-style license.

Available as PDF (738 kB, no cover)

Download BibTeX entry.