In genome-wide association studies, results have been improved through imputation of a denser marker set based on reference haplotypes and phasing of the genotype data. To better handle very large sets of reference haplotypes, pre-phasing with only study individuals has been suggested. We present a possible problem which is aggravated when pre-phasing strategies are used, and suggest a modification avoiding these issues with application to the MaCH tool.
We evaluate the effectiveness of our remedy to a subset of Hapmap data, comparing the original version of MaCH and our modified approach. Improvements are demonstrated on the original data (phase switch error rate decresasing by 10%), but the differences are more pronounced in cases where the data is augmented to represent the presence of closely related individuals, especially when siblings are present (30% reduction in switch error rate in the presence of children, 47% reduction in the presence of siblings). When introducing siblings, the switch error rate in results from the unmodified version of MaCH increases significantly compared to the original data.
The main conclusions of this investigation is that existing statistical methods for phasing and imputation of unrelated individuals might give subpar quality results if a subset of study individuals nonetheless are related. As the populations collected for general genome-wide association studies grow in size, including relatives might become more common. If a general GWAS framework for unrelated individuals would be employed on datasets where sub-populations originally collected as familial case-control sets are included, caution should also be taken regarding the quality of haplotypes.
Our modification to MaCH is available on request and straightforward to implement. We hope that this mode, if found to be of use, could be integrated as an option in future standard distributions of MaCH.
Available as PDF (150 kB, no cover)
Download BibTeX entry.