ByteDance Seedance 1.5 Pro AI: Dialect Video & A/V Sync

ByteDance's Volcengine unit has introduced Seedance 1.5 Pro, a new video generation model that features advanced audio-visual synchronization and robust capabilities in Chinese and dialect content creation. The model, unveiled at the Volcengine Force Original Power Conference, is accessible through platforms such as Doubao, Jiemeng, and Volcengine Ark.

Users can access Seedance 1.5 Pro on Doubao under "Video Generation" or "Animate Photos." On Jiemeng, the "Generate Video" function utilizes the 3.5 Pro model, which is based on Seedance 1.5 Pro's underlying technology. This version supports text-to-image generation, single reference inputs, and first/last frame controls. Volcengine Ark also offers an experience with Peking Opera and famous painting styles, with an API for the model currently available for reservation.

Key Capabilities of Seedance 1.5 Pro

The model's core features include sophisticated audio-visual synchronization in complex scenes, the ability to generate content in various Chinese dialects, and enhanced emotional expressiveness.

Testing of audio-visual synchronization demonstrated precise lip-syncing in scenarios ranging from a monkey rapping in a studio to multi-person dialogues. For instance, a prompt for a rapping monkey produced accurate lip movements, though the rap rhythm was noted as an area for potential improvement. The model also successfully assigned dialogue lines to specific characters in multi-person scenes, although it misinterpreted a request for "canned laughter" by generating actual cans.

Seedance 1.5 Pro supports multi-shot audio-visual synchronization and can generate videos up to 12 seconds in length. This allows for the creation of short advertisements or storyboards using reference images and text. An example involved generating a 12-second plot featuring characters Rick and Morty, which accurately matched dialogue, sound effects, camera cuts, and movements.

The model's Chinese and dialect capabilities are a significant highlight. It supports multiple languages including English, Japanese, Korean, and Spanish. However, its proficiency in Chinese dialects, such as Cantonese, Sichuanese, Shanghainese, Northeastern dialect, and Taiwanese accent, is particularly strong. The model can generate conversations where characters speak in their respective dialects or switch between them. To achieve specific language or dialect outputs, users must provide prompts in the target language.

Seedance 1.5 Pro also exhibits improved emotional expressiveness. By specifying different contexts, the model can deliver varied emotional tones for the same dialogue line. This includes subtle details like trembling mouth corners, forced smiles, and changes in vocal tone, contributing to an immersive experience. The model can generate suitable performances based solely on sentence content, integrating emotional nuances into background music, sound effects, and camera movements. An example of a first-person fighter jet driving clip demonstrated how these elements combine to create a rich, immersive video.

Volcengine also mentioned a forthcoming "draft samples" feature, which will provide lower-resolution drafts for user confirmation before generating high-definition final products. This aims to reduce iterations and improve efficiency in the video creation process.

The Seedance 1.5 Pro update is positioned to enhance the industrialization of AI video by enabling the generation of advertising-level or even film-level content that integrates images, dialogue, sound effects, rhythm, and emotion. This development is expected to influence future creative methods and concepts in AI video production, incorporating sound as a key consideration alongside visuals.