ACM MULTIMEDIA AT-ADD CHALLENGE 2026
The Grand Challenge on All-Type Audio Deepfake Detection
ACM MULTIMEDIA AT-ADD CHALLENGE 2026
The Grand Challenge on All-Type Audio Deepfake Detection
Introduction
With the rapid advancement of Audio Language Models (ALMs), creators are now able to generate high-fidelity audio at low cost using AI tools. The generated content is increasingly diverse, extending beyond conventional speech to include environmental sounds, singing voices, and music. While these capabilities greatly enhance content creation and multimedia production, they also introduce significant security and trust concerns. High-quality audio deepfakes can now be generated and disseminated at scale, posing emerging risks to society.
Audio deepfakes can be exploited in various malicious scenarios. They may be used to impersonate identities or fabricate voice commands for fraudulent activities, synthesize environmental sounds to create misleading or fabricated events, and manipulate public perception and information dissemination. With the advancement of singing voice and music generation, synthetic audio also raises concerns regarding copyright infringement and content manipulation. In high-risk applications, such technologies may even bypass voice authentication systems, control intelligent devices, or fabricate critical evidence, thereby posing serious threats to social security and trust infrastructures.
Despite these growing risks, existing audio deepfake detection (ADD) methods and evaluation benchmarks remain largely speech-centric and are typically developed under relatively idealized experimental conditions, such as clean speech data and limited generation methods. This setting deviates significantly from real-world scenarios, where audio is captured by diverse devices (e.g., mobile phones, in-vehicle systems, and wearable devices) and is often affected by complex acoustic distortions, including noise, reverberation, and compression. Meanwhile, attack strategies and generative models are evolving rapidly, covering a broader range of generation mechanisms and audio modalities. As a result, current methods exhibit limited robustness in complex environments and insufficient generalization to emerging spoofing techniques and heterogeneous audio types.
To address these challenges, we introduce the AT-ADD: The Grand Challenge on All-Type Audio Deepfake Detection at ACM Multimedia 2026, aiming to bridge the gap between controlled laboratory settings and real-world multimedia forensics. This challenge is designed to promote the development of robust, generalizable, and deployable audio deepfake detection systems across diverse conditions, unseen generation methods, and multiple audio types.
Challenge Tracks
To address the challenges outlined in the introduction, the AT-ADD challenge consists of two complementary tracks focusing on robust speech deepfake detection and universal audio deepfake detection across multiple audio types.
Track1: Robust Speech Deepfake Detection
Goal
Track 1 aims to bridge the gap between existing benchmarks and real-world deployment scenarios for speech deepfake detection. It evaluates whether a detector can remain reliable under realistic domain shifts (e.g., multi-device acquisition, multilingual speech) and practical post-processing effects, while maintaining strong performance against modern high-fidelity synthesis systems.
Task definition
Given an input speech utterance, participants are required to predict whether the input is real or fake. In this task, fake refers specifically to deepfake speech generated using deep neural network-based methods, while real refers to non-deepfake speech. It should be noted that signal distortions or transformations, such as compression, resampling, speed perturbation, and pitch shifting, as well as replay-based attacks, do not change the original real/fake label in this task. The training and development data are fully provided by the organizers, and the use of external data is not allowed under the closed setting.
The evaluation set includes deepfake samples generated by methods that are unseen during training and reflect recent state-of-the-art generation techniques. Meanwhile, the real speech in the evaluation set is collected under realistic conditions, involving variations in recording devices, acoustic environments, languages, and other real-world factors.
Track2: All-Type Audio Deepfake Detection
Goal
Track 2 targets universal audio deepfake detection across heterogeneous audio types and aims to develop type-agnostic detectors that generalize across both audio types and unseen generation methods.
Task definition
Given an input audio clip of unknown type, participants are required to determine whether it is real or fake. In this task, fake denotes deepfake audio generated by deep neural network-based methods, whereas real denotes non-deepfake audio. Notably, in Track 2, audio-type labels (i.e., speech, sound, singing, and music) are not available at test time, reflecting realistic deployment scenarios.
Similar to Track 1, this track follows a closed setting, where participants must use only the provided training and development data, without access to external resources.