ByteDance's Volcano Engine Unveils Seedance 1.5 Pro AI Video Generation Model

Volcano Engine, ByteDance's enterprise technology service platform, has officially launched Seedance 1.5 Pro, its latest video generation model. The new model introduces native audio-visual synchronization, advanced camera movements, and multi-character, multi-language dialogue capabilities.
The release follows a period of rapid development in AI video models, both domestically and internationally, with video generation emerging as a key application scenario.
Core Capabilities of Seedance 1.5 Pro
A primary enhancement in Seedance 1.5 Pro is its native audio-visual synchronization. Previous models typically generated visuals, requiring creators to add audio post-production. Seedance 1.5 Pro now supports simultaneous generation of both audio and visuals. This feature addresses the challenge of realistic mouth movements for characters and appropriate environmental sounds.
The model also offers more complex camera movements and dynamic scenes. It can handle intricate camera choreography, including long-shot tracking, rapid perspective shifts, and emotionally driven camera pushes. In dynamic scenarios, the model adjusts sound in response to high-speed character movement, camera changes, and shifting backgrounds.
Seedance 1.5 Pro supports multi-character dialogue across various languages, including Mandarin, English, Korean, and Chinese dialects such as Sichuanese and Cantonese. This enables more organic interactions between characters in AI-generated videos.
The model emphasizes expressive completeness beyond mere visual generation. It focuses on subtle micro-expressions in close-ups and synchronizes sound rhythm with visual progression in dialogue scenes.
Seedance 1.5 Pro is currently available in the Volcano Ark Experience Center, with an enterprise API scheduled for release on December 23. Individual users can access it via the Doubao and Jiemeng applications.
Technical Insights from Research
According to a technical report reviewed by toolmesh.ai, Seedance 1.5 Pro is designed as a native audio-visual co-generation foundational model, treating video and audio as a unified generation task. This approach aims to move beyond creating visual material to producing directly usable, integrated audio-visual works.
Technically, the model's framework encompasses four main areas: data, architecture, post-training, and inference acceleration.
The data framework leverages a comprehensive system for high-quality audio-visual generation, featuring multi-stage cleaning pipelines, advanced captioning, and infrastructure for large-scale multimodal processing. Key aspects include audio-visual consistency and curriculum-based data scheduling, which gradually increases data complexity during training. The captioning system provides detailed descriptions of camera work, sound status, and emotional context, enabling the model to translate user prompts into complete visual and sound sequences.
The architecture employs a unified multimodal co-generation framework based on MMDiT, utilizing deep cross-modal interaction to ensure temporal synchronization and semantic consistency. This involves a "two-branch Diffusion Transformer" and a cross-modal joint module, which simultaneously consider both audio and video paths during generation.
Post-training involves supervised fine-tuning (SFT) with high-quality data and Reinforcement Learning from Human Feedback (RLHF) specifically for audio-visual scenarios, using a multi-dimensional reward model. This model evaluates aspects such as motion quality, aesthetics, audio fidelity, synchronization, and expressiveness. The report notes that RLHF infrastructure optimizations improved training speed by nearly three times.
Inference acceleration techniques include multi-stage distillation to reduce diffusion sampling steps, combined with quantization and parallel engineering optimizations, resulting in a more than tenfold increase in end-to-end inference speed while maintaining performance.
Evaluation and Prompt Adherence
The evaluation criteria for Seedance 1.5 Pro introduce a metric called "video vividness," which assesses the dynamic expressiveness of the video. This addresses a common industry issue where models prioritize stability by slowing down action, thereby sacrificing dynamism. The model evaluates vividness across action, camera, atmosphere, and emotion. For action, it considers micro-expressions, body posture, detailed movements, and natural character-environment interaction. For camera, it assesses composition and movement in relation to narrative.
Audio evaluation is divided into four parts: audio prompt adherence, audio quality, audio-visual synchronization, and audio expressiveness. This includes checking for missing sound effects, inaccurate language/dialect, audio-visual mismatches, distortion, spatial sense, and appropriate background music.
Prompt adherence has been redefined to focus on user intent rather than literal keyword implementation. The model is allowed to make intent-consistent creative extensions, such as filling in missing details or optimizing narrative structure, as long as it aligns with the core intent.