A comprehensive survey with critical analysis for deepfake speech detection: Lam Pham et al

ABSTRACT

Recent advancements in deep learning have significantly enhanced speech generation systems, enabling their integration into a wide range of real-world applications such as text-to-speech support for individuals with speech impairments, voice-enabled customer service chatbots, and cross-linguistic speech translation. Despite their beneficial uses, these technologies also introduce serious risks when exploited for malicious purposes, particularly through the generation of highly realistic synthetic speech. This concern has led to the emergence of the Deepfake Speech Detection (DSD) task, which aims to identify speech synthesized by deep learning models. Given the recent development of this field, the existing body of survey literature remains limited, with most works primarily cataloging techniques rather than offering critical evaluations.

To address this gap, our study presents a comprehensive and analytically rigorous survey of the current landscape of Deepfake Speech Detection. This work, undertaken as part of the STARLIGHT, EUCINF, and DEFAME FAKEs projects, systematically examines key aspects including competitive challenges, publicly available datasets, and state-of-the-art deep learning methods designed to enhance detection performance. Building on this analysis, we formulate and test hypotheses regarding the integration of specific deep learning techniques to improve the robustness of DSD systems. In addition to the survey, we conduct extensive experiments to validate our hypotheses and introduce a competitive detection model. Based on our findings, we further outline promising future research directions to advance the field of Deepfake Speech Detection.

Authors: Lam Pham, Phat Lam, Dat Tran, Hieu Tang, Tin Nguyen, Alexander Schindler, Florian Skopik, Alexander Polonsky, Hai Canh Vu

Publisher: Computer Science Review (ELSEVIER)

Citation: Lam Pham, Phat Lam, Dat Tran, Hieu Tang, Tin Nguyen, Alexander Schindler, Florian Skopik, Alexander Polonsky, Hai Canh Vu, A comprehensive survey with critical analysis for deepfake speech detection, Computer Science Review, Volume 57, 2025, 100757, ISSN 1574-0137, https://doi.org/10.1016/j.cosrev.2025.100757.

Here’s an SEO‑friendly, abstract‑style summary of the paper, including academic focus, key findings, and implications:

Background & Objectives

This paper offers a comprehensive survey of deepfake speech technologies, analyzing the current landscape, detection challenges, and evolving countermeasures. It investigates major deep-learning architectures, benchmark datasets, and industry challenges, providing a structured synthesis of the field.

Scope & Methodology

  • Challenge Competitions: Reviews recent academic and industry-led competitions (e.g., ASVspoof, ADD, DeepfakeAudio detection challenges).
  • Datasets: Compares prominent public datasets such as VCTK, LibriSpeech, and FakeAVCeleb, evaluating their usage in detection model development.
  • Techniques: Critically analyzes state-of-the-art deep-learning methods: convolutional neural networks (CNNs), recurrent neural networks (RNNs), generative adversarial networks (GANs), and transformers.

Key Findings & Insights

  • Architectural trends: Transformer-based models are increasingly outperforming traditional CNN/RNN-based detection systems in terms of accuracy and robustness.
  • Dataset gaps: Current datasets insufficiently reflect real-world noisy and multilingual conditions, limiting model generalizability.
  • Benchmark disparities: Challenge benchmarks show varying performance metrics; few models demonstrate consistent cross-dataset reliability.

Critical Analysis & Recommendations

  • The authors highlight blind spots in dataset diversity, calling for multilingual, real-world audio samples.
  • Detection algorithms must evolve alongside generation methods, suggesting continuous benchmark updates.
  • Recommend integrated approaches combining audio-visual verification to improve robustness against adaptive deepfake techniques.

This survey provides a valuable roadmap for researchers and practitioners, outlining current achievements and essential directions for future work. By identifying limitations in datasets and detection paradigms, it offers actionable guidance for building more resilient deepfake speech detection systems.