ByteDance Unit Unveils Seedance 1.5 Pro With Native Audio-Visual Synchronization

ByteDance’s cloud services division, Volcano Engine, has released Seedance 1.5 pro, a new video generation model capable of native audio-visual synchronization. Unveiled at the Volcano Engine 2025 Winter FORCE Conference, the model introduces significant upgrades in multi-language lip-syncing, complex instruction following, and motion control compared to its predecessor.
The system is currently accessible via the Volcano Ark Experience Center, with API availability for enterprise clients scheduled for December 23. Individual users can access the model through the Jimeng web platform and the Doubao application.
Audio-Visual Integration and Dialect Support
The primary advancement in Seedance 1.5 pro is its ability to generate video and audio simultaneously within a single workflow, rather than treating sound as a post-production addition. According to demonstration materials reviewed by toolmesh.ai, the model supports precise lip synchronization across multiple languages and dialects.
The system can handle complex auditory scenarios, including environmental noise, mechanical sounds, and background music. Notably, the model supports 16 specific Chinese dialects—such as Shaanxi, Sichuan, and Cantonese—alongside standard Mandarin and English. In demonstrations involving generated footage of public figures, the model maintained millisecond-level lip-sync accuracy while switching between American-accented English and specific regional Chinese dialects.
Beyond speech, the model exhibits improved capabilities in sound layering. In tests simulating ASMR (Autonomous Sensory Meridian Response) content, the system successfully differentiated and synchronized distinct audio layers, such as mechanical keyboard typing, breathing, and blowing into a microphone, aligning them with visual cues.
Performance and Instruction Adherence
Internal benchmarks presented by Volcano Engine suggest that Seedance 1.5 pro outperforms competitors such as Google’s Veo 3.1 and Kuaishou’s Kling 2.6 in specific audio metrics, including generation quality, synchronization, and expressiveness. In Text-to-Video (T2V) assessments, the model reportedly leads in alignment metrics, which measure how accurately the video reflects the user's text prompt.
The update addresses the "gacha" problem common in generative video—a reference to the randomness that often requires users to generate multiple iterations to achieve a usable result. Seedance 1.5 pro demonstrates high instruction adherence, allowing for complex prompts involving camera movements, lighting changes, and specific character actions to be executed in a single attempt.
Demonstrations included a commercial scenario for a Tesla advertisement, where the model followed a lengthy prompt describing a minimalist box opening to reveal a vehicle, accompanied by the assembly of a showroom. The output adhered to abstract requests for "tech feel" and specific lighting transitions without requiring multiple retries.
MMDiT Architecture and Training
From a technical perspective, Seedance 1.5 pro utilizes a unified modeling framework based on the Multi-Modal Diffusion Transformer (MMDiT) architecture. This structure allows for deep cross-modal interaction, ensuring that visual and auditory signals remain synchronized in the time dimension and consistent at the semantic level.
The training pipeline involved a three-stage process:
Joint Pre-training: The model was trained on large-scale mixed-modal datasets.
Supervised Fine-Tuning (SFT): The system was refined using high-quality audio-video datasets.
Reinforcement Learning from Human Feedback (RLHF): The team applied RLHF algorithms customized for audio-video scenarios to optimize motion quality and visual aesthetics.
To improve efficiency, the engineering team implemented a multi-stage distillation framework. Combined with quantization and parallel computing optimizations, this approach reportedly achieved a ten-fold increase in end-to-end inference speed while reducing the computational load required for generation.
Complex Motion and Cinematic Control
The model shows enhanced capabilities in handling high-dynamic scenes. In generated footage of Formula One racing and World War I battlefields, Seedance 1.5 pro maintained physical consistency while rendering motion blur and film grain typical of cinematic productions. The system also managed complex camera movements, such as handheld tracking shots and rapid zooms, without breaking the logical continuity of the scene.
For character performance, the model generated nuanced facial expressions—ranging from laughter to exhaustion—based on static input images. This capability extends to multi-character interactions, where the model successfully coordinated dialogue and reactions between two characters speaking different languages in a single continuous shot.
As the generative video sector moves toward production-grade tools, the release of Seedance 1.5 pro positions ByteDance to compete directly with global heavyweights in the creative AI space, targeting both individual creators and professional production workflows.