TalkingHeadBench: A Multi-Modal Benchmark & Analysis of Talking-Head DeepFake Detection

1University of North Carolina at Chapel Hill, 2Michigan State University
*Equal contribution, Corresponding to: xxiong@cs.unc.edu, ronisen@cs.unc.edu
TalkingHeadBench Pipeline

A new benchmark for talking-head deepfake detection.

Abstract

The rapid advancement of talking-head deepfake generation fueled by advanced generative models has elevated the realism of synthetic videos to a level that poses substantial risks in domains such as media, politics, and finance. However, current benchmarks for deepfake talking-head detection fail to reflect this progress, relying on outdated generators and offering limited insight into model robustness and generalization.

We introduce TalkingHeadBench, a new benchmark designed to address this gap, featuring talking-head videos from six modern generators, with an additional two emerging generators used exclusively for testing generalization. The dataset is built on an expert-led curation process that filters over 60% of samples to remove videos with noticeable artifacts, presenting a more difficult challenge for detectors. Our evaluation protocols are designed to measure generalization across identity and generator shifts. Benchmarking seven state-of-the-art detectors reveals that models with high accuracy on older datasets like FaceForensics++ show a significant performance drop on our curated data, particularly at strict false positive rates (e.g., TPR@FPR=0.1%). In addition, we identify a trend where detectors focus on background cues instead of facial features using Grad-CAM visualization.

The dataset is hosted on Hugging Face with all data splits and protocols. Our benchmark aims to accelerate research towards more robust and generalizable detection models in the face of rapidly evolving generative techniques.

Dataset Samples

Select a generator to view example videos.

Real Sample 1

Real Sample 2

Hallo Sample 1

Hallo Sample 2

Hallo2 Sample 1

Hallo2 Sample 2

Hallo3 Sample 1

Hallo3 Sample 2

EmoPortraits Sample 1

EmoPortraits Sample 2

AniPortrait Audio 1

AniPortrait Audio 2

AniPortrait Video 1

AniPortrait Video 2

LivePortrait 1

LivePortrait 2

Benchmark Overview & Results

To strictly evaluate the generalization of SOTA detectors, we design three specific protocols and utilize strict metrics following standard face recognition benchmarks.

Evaluation Protocols

  • Protocol 1: Identity Shift (P1)
    Evaluates generalization to unseen identities within the same generator.
  • Protocol 2: Generator Shift (P2)
    Evaluates generalization to an unseen generator.
  • Protocol 3: ID & Generator Shift (P3)
    Evaluates generalization to both unseen generators and identities (the most challenging "in-the-wild" setting).

Metrics

  • AUC (Area Under Curve) ↑ : Measures overall separation between real and fake.
  • Brier Score ↓: Measures the accuracy of the predicted probabilities (calibration). A lower score means the model is more confident and accurate.
  • TPR @ FPR (T1, T0.1) ↑: The key metric for deployment. T0.1 (TPR at 0.1% False Positive Rate) measures how many deepfakes are caught when the system is allowed only 1 false alarm per 1000 real videos.

Main Results

Benchmark Results
Table 1. Benchmark results across three protocols. Performance drops significantly when detectors face unseen generators (Protocol 2 & 3). Performance level: Excellent: ≥0.99, Good: 0.95–0.99, Reasonable: 0.85–0.95, Fair: 0.75–0.85, and Poor: <0.75.

BibTeX

@article{xiong2025talkingheadbench,
  title={TalkingHeadBench: A Multi-Modal Benchmark \& Analysis of Talking-Head DeepFake Detection},
  author={Xiong, Xinqi and Patel, Prakrut and Fan, Qingyuan and Wadhwa, Amisha and Selvam, Sarathy and Guo, Xiao and Qi, Luchao and Liu, Xiaoming and Sengupta, Roni},
  journal={arXiv preprint arXiv:2505.24866},
  year={2025}
}