The goal of this project was to select from the millions of SNPs already identified in the human genome a small subset of SNPs that can predict ancestry with a minimal error rate.
An individual’s genotypes at a group of Single Nucleotide Polymorphisms (SNPs) can be used to predict that individual’s ethnicity or ancestry. In medical studies, knowledge of a subject’s ancestry can minimize possible confounding; and in forensic applications, such knowledge can help direct investigations. The general form for the tested variable selection procedure was to estimate the expected error rates for sets of SNPs using a training dataset and consider those sets with the lowest error rates, given their size. The quality of the estimate for the error rate determined the quality of the resulting SNPs. Since the apparent error rate performs poorly when either the number of SNPs or the number of populations is large, this project proposes a new estimate, the “Improved Bayesian Estimate.” This project demonstrates that selection procedures based on this estimate produce small sets of SNPs that can accurately predict ancestry. A list is provided of the 100 optimal SNPs for identifying ancestry. (publisher abstract modified)