Description of original award (Fiscal Year 2019, $141,856)
There has been a great deal of recent interest in using ""black box and ""white box"" techniques to evaluate decisions made in a variety of forensic disciplines. This proposal is to prepare a publishable manuscript and updated briefing that will dissect the details of conducting such evaluations, based on the lessons learned from the design, data selection, test administration, and analysis of eight evaluations of forensic examiners, in five disciplines: latent print examination, footwear examination, handwriting examination, bullet (mark) comparison, and bloodstain pattern analysis. The paper will have broader implications for other forensic science disciplines as well. The primary purpose of black box studies is to demonstrate the accuracy and reliability of a given process. Black box evaluations are conducted by assessing the examiner's decision without regard to how those decisions are made, and can provide a useful overall understanding of the accuracy, reproducibility, and repeatability of the decisions made in response to a given task. Such evaluations do not attempt to assess how a specific examiner performs on specific data - but black box evaluations are a necessary first step towards such detailed tests. Black box evaluations provide a means of quantifying forensic examinations for which quantitative models do not (yet) exist and, therefore, provide both an interim solution while such models are under development, as well as a means of validating such models. Conversely, white box evaluations are conducted to gain an understanding of how and why examiners make decisions. White box evaluations are detailed assessments of the bases of examiners' decisions, focused not just on the end decisions but the features and attributes used by the examiners in rendering conclusions. While analyses of black box results deal with the inter-examiner variability of decisions, white box analyses also deal with inter-examiner variability of the detection of features and other attributes. Each discipline adds its own complexities in terms of data selection and test design. The paper will discuss these discipline-specific issues and their resolutions. NIJ funding will be used to prepare the publishable manuscript and to update a previously presented briefing with new lessons learned and deliver it at AAFS.
Forensic latent print examiners do not always reproduce each other's conclusions. This proposal uses the results from the Latent Print Examiner White Box Eye Tracking Study, and new analyses of the results from the Latent Print Examiner Black Box Study, to explore why latent print examiners disagree. We show the extent to which differing conclusions can be explained in terms of the images, in terms of the examiners' tendencies or biases towards different types of conclusions, and as a result of using categorical conclusion scales. The level of agreement among examiners can be explained in large part by the fingerprint images themselves. Examiners generally agree with the conclusion of the majority of examiners for a given image (or image pair). The quality of the latent print is a strong indicator of the level of agreement, and particularly whether examiners' determinations are unanimous. Conversely, some fingerprint images or image pairs are disproportionately associated with disagreements among examiners, and (in some case) increased error rates. Examiners often show tendencies or biases towards different types of conclusions, which we describe as implicit individual decision thresholds. We demonstrate that these thresholds are measurable and vary notably among examiners. Much of the remaining variability relates to inconsistency of the examiners themselves: conclusions close to personal thresholds often were not repeated by the examiners themselves, indicating that as examiners become less certain in their assessment of a latent print or pair of prints, the probability of repeating an assessment decreases. Conclusions close to personal thresholds tend to be rated by those examiners as difficult, and are associated with slower analysis times.
A few examiners have significantly higher error rates than most: pooled error rates among many examiners are not necessarily representative of individual examiners, and measurement of rare events (notably erroneous identifications) can be disproportionately affected by individual examiners. These findings are operationally relevant to staffing (to understand differences among examiners), quality assurance (to mitigate the effects of differences), and the legal community (to understand the implications of disagreements among experts in court). NIJ funding will be used for completion/publication of the manuscript currently underway that details the results synopsized herein. The manuscript is currently in draft form and also requires additional statistical data analyses, charts/graphics, and supporting detailed information, as have been previously published with our Latent Print Black Box and White Box manuscripts (see several manuscripts by Ulery et al. and Hicklin et al. from 2011-2019).