Description of original award (Fiscal Year 2023, $64,003)
The rapid development of AI-powered audio generation has made it increasingly difficult
to distinguish between real and fake audio, leading to the use of deepfakes for criminal
activities. Therefore, there is a pressing need to develop audio deepfake detection
(ADD) systems that can filter out deepfakes. To achieve accurate detection results, ADD
systems need to be robust against new and unknown deepfake techniques, provide
evidence to support their findings, and be compatible with other fake detection tools.
This grant proposal aims to investigate three key research questions (RQs) in ADD for
forensics and security:
1. How can we design ADD systems that are robust to new and unknown deepfake
2. How can we improve the convincingness and explainability of ADD results?
3. Can we leverage other modalities, such as vision, to enhance ADD performance?
To address the first RQ, the applicant proposes a novel training strategy called multicenter
one-class learning that considers the distribution mismatch between training and
evaluation. The proposed training strategy encourages the embedding space of speech
representations to separate deepfakes and cluster real recordings of diverse settings
around multiple centroids, leading to ADD systems that can outperform existing models
in detecting novel deepfake techniques.
To address the second RQ, the applicant proposes to identify fingerprints of the audio
generation process and use them as evidence for convincing ADD results. The
applicant proposes a proactive steganography framework that embeds the description
of deepfake generation algorithms into the generated audio, enabling authorized agents
to accurately determine the origin of fraudulent speech, even for novel deepfakes.
To address the third RQ, the applicant proposes to leverage cross-modal learning to
enhance ADD performance with information from other senses such as vision. The
applicant proposes modeling cross-modal mismatches on the synchronization of
general audio-visual events in complex scenes, including talking head and surveillance
videos. Incorporating knowledge graph reasoning on the acoustic properties can further
improve the ADD performance.
The proposed research will contribute to the development of effective ADD systems that
are robust against new and unknown deepfake techniques, provide evidence to support
their findings, and are compatible with other fake detection tools. As deepfakes become
more prevalent in crimes, the efficiency of audio forensics in juvenile justice can be
significantly improved through the development of accurate and robust ADD techniques.
The publications resulting from this research will be made publicly available to
interested juvenile justice practitioners and the broader public. CA/NCF