Best Practices for Improving the Use of Criminal Justice Risk Assessments: Insights From NIJ’s Recidivism Forecasting Challenge Winners Symposium

January 11, 2024

Although the practice of forecasting recidivism is not new in criminal justice, there is growing interest in incorporating statistical algorithms to predict future criminal behavior among individuals involved in the system. Such predictions can guide decision-making about the appropriateness of pre-trial release, security level and access to programming during incarceration, and levels of supervision following release. [1] These statistical algorithms are referred to as actuarial risk assessments. Compared to clinical or subjective judgments, which are more prone to error, actuarial risk assessments produce more precise predictions and have higher validity. [2] Specifically, an individual’s risk prediction is calculated based on factors such as their drug abuse, criminal history, employment status, and participation in certain correctional programs. Individuals are then classified according to their risk of re-offending using a continuum ranging from high- to low-risk. The ability of these instruments to identify factors that potentially contribute to recidivism may also help reduce mass imprisonment, enhance public safety, and decrease crime. [3]

Despite their promise, actuarial risk assessments pose lingering questions and concerns, particularly as methods and measures used to predict recidivism have evolved. Critics argue that the accuracy of these tools may be exaggerated at times and that the tools often lack transparency and fairness. [4] Additionally, both researchers and practitioners debate about the best methods and data with which to predict recidivism, define recidivism, and reduce racial and gender disparities. With the increased use of actuarial risk assessments in criminal justice operations, agencies must determine which processes and information to incorporate into the risk assessment tool and decide how they should be implemented, including when human overrides are acceptable. A further challenge is tempering public expectations of risk assessments, which can be either overly optimistic or unrealistically exacting.

In 2021, the National Institute of Justice (NIJ) created the “Recidivism Forecasting Challenge” to advise on these issues related to risk assessment. [5] Following the competition, NIJ held a symposium for Challenge winners to share strategies for developing, implementing, and refining risk assessments to address biases and uncertainties. This article summarizes these insights as guidance for tool developers, practitioners, policymakers, and the interested public.

NIJ Recidivism Forecasting Challenge Winners Symposium

In December 2021, NIJ hosted the virtual “Recidivism Forecasting Challenge Winners Symposium.” The main goals of the symposium were to provide winners a chance to share their prediction methods and challenges encountered during their participation (for example, limitations of the dataset and algorithmic risk assessments) and translate the challenge results into guidance for practitioners and policymakers. The following topics were covered over two days:

Forecasting methods.
Racial bias in risk predictions.
Gender-specific needs.
Implications of lessons learned for practitioners.

Through this well-attended virtual symposium, the group gleaned lessons about reentry, bias/fairness, measurement, algorithm advancement, and practical applications of algorithmic risk assessment.

Key Findings From Day One

During day one of the symposium, three winning teams presented on the methods they used to predict recidivism for the challenge. Then, specific teams participated in two different roundtable discussions. The first roundtable focused on racial bias in risk assessment, and the second roundtable focused on gender responsivity in risk assessment. Below is a summary of the themes from the winner’s presentations and roundtable discussions.

Racial Bias in Risk Assessment

Eliminating Bias : It can be difficult to address fairness in risk assessments due to the complexity of identifying and quantifying bias. For example, algorithms cannot sufficiently reduce racial bias by simply excluding race as a variable because the dataset may contain other variables that capture bias from former criminal legal processes, such as criminal history. To this end, eliminating the effects of historical and systemic racism may not be possible when building algorithmic risk prediction tools. Although it may not be possible to eliminate bias from risk assessments completely, methods may be used to reduce bias, which is important to ensure that bias is mitigated and not perpetuated in the later stages of the criminal legal process.
Improving Fairness in Risk Assessment : In pursuit of achieving fairness in actuarial risk assessment, researchers must examine whether the tool or the dataset may be biased or if both exhibit biases.
- Balancing the Ethical Tradeoff : Researchers must choose between false positive rates and accuracy rates when improving fairness because both cannot be dealt with simultaneously. There is an ethical tradeoff between aiming for equal outcomes between groups and treating everyone equally in models. For example, one may seek to create equivalent outcomes between racial and ethnic groups or aim to reduce errors in prediction, such as results that indicate an individual will re-offend when they do not re-offend.
- Accounting for Context : Addressing fairness also means considering how models will be used. Although fairness measures may be integrated into risk assessment tools, it is important to acknowledge the possibility of humans overriding risk assessment recommendations, which may introduce additional bias.

Gender Responsiveness in Risk Assessment

Addressing Gender-Specific Needs: Participants in the challenge predicted female recidivism better than male recidivism. Still, panel participants suggested that more variables representative of females' specific needs needed to be included. Otherwise, the results may not accurately predict female recidivism because information relevant to their success or failure is missing (known as omitted variable bias).
Considering Implications of Gender-Specific Needs for Programming: Gender-targeted programming can improve by considering important risk prediction variables. For example, a person’s relationship with their children (or other dependents) and their recidivism risk can be used to target programming (for example, family-oriented programming) for males and females.

Key Findings From Day Two

On day two, participants split into multiple focus groups. A key objective of day two was to gather recommendations for applying the challenge results to the field and make general data science applications. The focus group participants discussed various topics regarding risk assessments based on their professional expertise and the techniques they used to create their winning submissions, which were primarily through machine learning methodologies. The topics can fit into three major themes: (1) designing risk assessments, (2) implementation of risk assessments, and (3) increasing public and practitioner understanding. A summary of guidelines on these themes follows.

Design Considerations

Building Working Partnerships : Scientists are encouraged to partner with agencies to facilitate the successful design of the risk assessment instrument. This partnership is mutually beneficial. The implementing agency's involvement in the design phase will enhance their understanding of the tool and how they should use it in practice. Simultaneously, feedback provided by the agency can help scientists revalidate and improve the instrument based on its performance in the field.
Incorporating Contextual and Real-Time Information: To achieve seamless integration of risk instruments across different domains, scientists are urged to customize tools depending on the jurisdiction and population. Practitioners are expected to report on the effectiveness of the models in the communities they serve and to validate the instrument continuously. At the same time, scientists should train models using multiple data sources and calibrate them according to the locality where the tool will be deployed. Models must also be constructed using dynamic rather than static factors, with time-stamped variables to determine what increases or decreases risk.
Anticipating Errors and Determining Model Complexity: To reduce the likelihood of false positives and false negatives, scientists should scrutinize the data and quantify error rates. When determining whether to use simple or complex models, scientists must weigh the trade-off between accuracy and utility. Simple models are less accurate but are easier to employ and may better identify the factors that contribute most to recidivism, whereas complex models are more accurate but harder to comprehend and implement.

Implementation

Clarifying Goals at the Outset: At the start of implementation, tool developers and practitioners should align their goals for using a particular risk instrument. They should further collaborate in the development of a clear implementation plan. This plan should have a targeted focus on changing organizational needs. In addition, prediction models should undergo rigorous review (for example, by institutional review boards), and decision-makers should understand the models before incorporating them into practice.
Attending to Risk and Need: Agencies need to receive guidelines on what decisions to make based on the risk assessment results of individuals determined to be at a lower or higher risk of recidivism. In addition to focusing on risk, practitioners must also consider factors in the person’s life that may contribute to recidivism (criminogenic needs) and factors that may mitigate risk (protective factors). Scientists can assist agencies in designing personalized, holistic interventions that reflect the needs of supervised individuals by building models based on micro-level data instead of relying solely on aggregate data. Micro-level approaches may include information on supervised individuals' life trajectories, residences, and reentry expectations.
Managing Practitioner (Non)Compliance: Practitioners should avoid underestimating or overestimating risk prediction scores and use discretion — which may introduce bias — sparingly. Agencies should continuously monitor practitioners' use of risk assessment tools to encourage their use and reduce misapplication. Keeping agency resources in mind, practitioners also need training and guidelines to improve the tool's implementation and facilitate compliance.

Increasing Awareness and Understanding

Fostering Understanding, Community Trust, and Transparency: Researchers with expertise in risk assessment and risk communication are responsible for sharing their knowledge with stakeholders about how and why risk instruments are used and setting reasonable expectations for their performance. Once implementing agencies understand what drives a risk prediction score, they can help build public trust in the tool by offering accessible explanations for how the instrument functions, the data it employs, its results, its predictive performance, and its impact on crime and public safety.
Communicating Risk: Although practitioners and the public often prefer categorical risk results (for example, low, high), [6] these categories can be misleading and vary depending on factors such as the cut-off points selected by tool developers. [7] This may lead practitioners to overestimate recidivism risk. [8] As an alternative to risk categories, scientists and practitioners can use a ranking system based on recidivism likelihood, which is updated based on dynamic, real-time information that indicates changes in the risk profile of those under supervision. Focus can then be placed on a percentile ranking of the highest-risk individuals.

Conclusion and Implications

Overall, the 2021 NIJ “Recidivism Forecasting Challenge Winners Symposium” revealed several issues that can impact recidivism forecasts:

Attempts to eliminate racial bias in risk assessments are hindered because it is not always simple to identify the source of such bias, and removing race or proxy variables from the instrument is insufficient.
Along with race, recidivism predictions may vary based on gender. Participants in this challenge generally predicted recidivism better for females, but they did not find that certain variables mattered for females more than males. Based on past research, however, data scientists noted that risk instruments without gender-specific variables might not accurately reflect gender-specific needs.
Risk instruments may be incorrectly configured if (1) tool developers do not establish partnerships with the tool’s users, which is necessary for the instrument's revalidation, and (2) developers fail to consider the larger system in which the tool will operate, potential errors that could occur, and usability problems for practitioners.
Practitioners often lack formal training or guidelines on effectively using risk instruments. The consequences of this can be devastating for individuals if they are not granted resources based on their specific needs and risk levels. [9]
Risk assessments cannot function effectively if misunderstood. Practitioners and the public may have misguided expectations about what risk instruments can and cannot do, which may lead to distrust, inaccurate predictions, and misapplication.

Important implications can be drawn from these issues for recommendations on using risk assessments in applicable settings like community supervision. The data scientists participating in the symposium offered several scientific and practical strategies for improving risk instrument design, implementation, and reception by stakeholders while also addressing racial bias and gender responsiveness. These strategies are intended primarily for scientists and implementing agencies who are encouraged to collaborate throughout the design and implementation phases. They include:

Carefully weighing the tradeoff between error rates versus accuracy rates when attempting to enhance fairness in risk instruments.
Incorporating variables that are responsive to gender-specific needs and differences in recidivism.
Subjecting risk instruments to rigorous review and repeated testing and validation based on user feedback.
Monitoring practitioner applications of the tools to ensure appropriate delivery of interventions.
Providing open access to risk instruments’ results in a transparent and easily interpreted manner.

The hope is that these and other suggestions better integrate the responsible, equitable, and objective use of risk instruments for the benefit of public safety and the treatment of individuals involved in the criminal legal system.

About This Article

This article summarizes recommendations from 27 data scientists who participated in focus group discussions held during the NIJ “Recidivism Forecasting Challenge Winners Symposium” on December 1-2, 2021. These recommendations are elaborated in two peer-reviewed journal articles, one that is forthcoming and the other which can be found here:

D. Michael Applegarth, Raven A. Lewis, & Rachael M. Rief, “Imperfect Tools: A Research Note on Developing, Applying, and Increasing Understanding of Criminal Justice Risk Assessments,” Criminal Justice Policy Review 34 no. 4 (2023), https://doi.org/10.1177/08874034231180505 .

[note 1] Robert Werth, “Risk and Punishment: The Recent History and Uncertain Future of Actuarial, Algorithmic, and ‘Evidence-Based’ Penal Techniques,” Sociology Compass, 13 no. 2 (2019), Article e12659, https://doi.org/10.1111/soc4.12659

[note 2] Richard Berk and Jordan Hyatt, “Machine Learning Forecasts of Risk to Inform Sentencing Decisions,” Federal Sentencing Reporter, 27 no. 4 (2015), 222-228. https://doi.org/10.1525/fsr.2015.27.4.222; and Howard N. Garb and James M. Wood, “Methodological Advances in Statistical Prediction,” Psychological Assessment, 31 no. 12 (2019), 1456–1466, https://doi.org/10.1037/pas0000673.

[note 3] Melissa Hamilton, “Back to the Future: The Influence of Criminal History on Risk Assessments,” Berkeley Journal of Criminal Law, 20 no. 1 (2015): 75-133, https://ssrn.com/abstract=2555878; Jon Kleinberg et al., “Human Decisions and Machine Predictions,” The Quarterly Journal of Economics, 133 no. 1 (2018): 237-293, https://doi.org/10.1093/qje/qjx032; and Michael Tonry, “Predictions of Dangerousness in Sentencing: Déjà Vu All Over Again,” Crime and Justice, 48 no. 1 (2019): 439-482, https://doi.org/10.1086/701895.

[note 4] Richard Berk and Jordan Hyatt, “Machine Learning Forecasts of Risk to Inform Sentencing Decisions,” Federal Sentencing Reporter, 27 no. 4 (2015): 222-228, https://doi.org/10.1525/fsr.2015.27.4.222; Laurel Eckhouse et al., “Layers of Bias: A Unified Approach for Understanding Problems With Risk Assessment,” Criminal Justice and Behavior, 46 no. 2 (2019): 185-209, https://doi.org/10.1177/0093854818811379; and Jennifer L. Skeem and Christopher Lowenkamp, “Using Algorithms to Address Trade‐offs Inherent in Predicting Recidivism,” Behavioral Sciences & the Law, 38 no. 3 (2020): 259-278, https://dx.doi.org/10.2139/ssrn.3578591.

[note 5] National Institute of Justice (NIJ), (n.d.), Recidivism Forecasting Challenge, https://nij.ojp.gov/funding/recidivism-forecasting-challenge.

[note 6] Brandon L. Garrett and John Monahan, “Judging Risk,” California Law Review, 108 (2020): 439-493,

https://doi.org/10.15779/Z38B56D515; and Daniel A. Krauss, Gabriel I. Cook, and Lukas Klapatch, “Risk Assessment Communication Difficulties: An Empirical Examination of the Effects of Categorical Versus Probabilistic Risk Communication in Sexually Violent Predator Decisions,” Behavioral Sciences & the Law, 36 no. 5 (2018): 532-553, https://doi.org/10.1002/bsl.2379.

[note 7] Cecelia Klingele, “Making Sense of Risk,” Behavioral Sciences & the Law, 38 no. 3 (2020): 218–225, https://doi.org/10.1002/bsl.2458.

[note 8] Ben Green, “The False Promise of Risk Assessments: Epistemic Reform and the Limits of Fairness,” Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, (2020): 594–606. https://doi.org/10.1145/3351095.3372869; and Krauss, Cook, and Klapatch, “Risk Assessment Communication Difficulties,” 532-553.

[note 9] Richard Berk, “An Impact Assessment of Machine Learning Risk Forecasts on Parole Board Decisions and Recidivism,” Journal of Experimental Criminology, 13 no. 2 (2017): 193-216, https://doi.org./10.1007/s11292-017-9286-2; Garrett and Monahan, “Judging Risk,” 439-493; and Robert Werth, “Risk and Punishment: The Recent History and Uncertain Future of Actuarial, Algorithmic, and ‘Evidence‐Based’ Penal Techniques,” Sociology Compass, 13 no. 2 (2019), Article e12659, https://doi.org/10.1111/soc4.12659.