Short Tandem Repeats (STRs) are ideal for human identification, for not only do they vary among individuals more than other genomic regions, but they can be classified without needing to obtain an actual sequence. As different versions of the same STR vary in the number of times its underlying short sequence is repeated, versions can be identified by length. This is easily accomplished in the lab once enough copies of the STRs are available, which is done using polymerase chain reaction (PCR). Making copies of a particular region of the DNA using PCR is one of the most reliable laboratory processes used by genomic scientists. However, the copying process is known to produce artifacts, making them difficult to read. Indeed, the repetitive nature of STRs can cause a small portion of the PCR product to generate “stutter” — one less or one more of the motif repeats — thus complicating the interpretation of the DNA sample.
Besides the introduction of stutter during PCR, traditional STR analysis has other challenges. The more STRs analyzed, the more discriminating the final profile, but doing separate analyses for each STR uses precious amounts of the sample, and can be cost- and time-prohibitive. Thus, many different STRs are copied and sorted in the same reaction. There is a limit to the number of STRs that can be tested at once, and scientists typically restrict their analyses to about 20 to 30 STRs. This is a low number of markers to analyze by modern standards, but it produces a manageable data set that is powerful enough to identify individuals. All of this changes, however, when samples contain mixtures from more than one person. Untangling their profiles and avoiding stutter errors can be exceedingly difficult.
Next-generation sequencing (NGS) is a relatively new method used for sequencing genomes, or portions of genomes, with a high degree of accuracy. Millions of reads, or snippets of sequence, are generated across as much of the genome as desired, including many thousands of different STRs. Reads — often no more than a couple of hundred bases — are recorded randomly across the genome and then assembled algorithmically based upon their overlap. Any one region of the genome can be covered by hundreds of reads, enough to provide reasonable certainty in the sequence, despite artifacts introduced during PCR and sequencing.
A common solution to the problems of sequencing STRs using NGS is to target and enrich STR regions before sequencing. Typically, genomic DNA extracted from a sample is randomly sheared into fragments of a suitable size for the sequencing technology, and if fragments containing STRs can be isolated first, then only those fragments can be sequenced. When samples contain mixtures of individuals, the sequences of minor contributors may become so scarce that it is challenging to distinguish them from artifacts introduced during amplification and sequencing. (Minor contributors are donors contributing less than 50% of a DNA sample.)
However, current enhancement techniques also have problems. Randomly fragmenting the genome rarely results in full STR sequences (needed because fragments of repetitive DNA are impossible to align and assemble accurately), and molecular probes used to separate STRs sequences before sequencing are often misled by the repetitive sequence, leading to increased errors.
Supported by a 2013 National Institute of Justice grant, Hanlee Ji, Associate Professor of Medicine at Stanford University, and his colleagues tackled this suite of problems by developing a method for isolating, enriching, and sequencing STRs. They began by using CRISPR-Cas9 technology, which is a powerful molecular tool that allows scientists to cut DNA molecules at any location they choose. By designing probes that align to the flanking regions of each targeted STR, they were able to cut out full STRs, attach unique molecular tags, and isolate them before sequencing. They then sequenced the STRs without amplifying them beforehand, thus dramatically reducing the introduction of stutter artifacts.
Ji also added a key step to aid in identifying minor contributors to DNA mixtures. This is important because the amplification-free method still has stutter introduced by the sequencing technology. Identifying reads from minor contributors is impossible when erroneous reads are present in similar numbers. Ji’s team illustrate this problem with the example of a mixed sample from two individuals in which one of them has contributed only 1 out of 1,000 DNA molecules. After sequencing, a particular STR for the main contributor has 700 and 500 reads for the two versions from each parent, and the minor contributor has less than 5 reads for each of its versions, similar to the number of stutter reads. The step the researchers added was to include, with the excised STR sequence, a Single Nucleotide Polymorphism (SNP) located 100 bases or fewer from the STR. This was a single site that tended to vary, and inasmuch as these SNPs differed among the contributors to a mixed sample, their STR sequences could be distinguished in the final reads. Indeed, Ji’s team was able to successfully identify STR variants from 1-in-1,000 contributors that would otherwise be discarded as stutter.
This new method for genotyping STRs using NGS, dubbed “STR-Seq,” was not only successful at distinguishing minor contributors at the 0.1% level, but it also outperformed other NGS methods using the same sequencing platform in terms of the number of STRs sequenced and the accuracy rate in their identification. Capable of simultaneously characterizing variants for over 2,500 different STRs with over 83% accuracy, combined with powerful tools for mixture analysis, STR-Seq marks a major step forward in using NGS sequencing for human identification. The method was described in the February 7, 2017 edition of Nature Communications, 8, 14291 (2017), https://nature.com/articles/ncomms14291.
About This Article
The research described in this article was funded by NIJ grant 2013-DN-BX-K010 awarded to Stanford University. This article is based on the final summary overview for the award, which was entitled “Highly Parallel Analysis of Complex Genetic Mixtures,” by Hanlee P. Ji, principal investigator, Department of Medicine, Stanford University.