ByteDance Seedance 1.5 Pro: Dialect Speech & A/V Sync

ByteDance has released Seedance 1.5 Pro, a video generation model that introduces synchronized audio and visual output, with a focus on localized content. The model incorporates several updates designed to improve video generation capabilities.

Key Model Enhancements

The Seedance 1.5 Pro model supports synchronized audio and visual generation, including various dialects, with improved lip-sync and intonation alignment. Semantic understanding has been enhanced, allowing the model to analyze narrative contexts better and control emotional expression and acting abilities in sync with sound and visuals.

The model also features precise camera control, enabling autonomous camera movements such as long takes, dolly zooms, and Hitchcock zooms. Users can generate videos from specified start and end frames, with a maximum single generation length of 12 seconds, and options for 5-second and 10-second durations.

Dialect Generation Capabilities

A notable feature of Seedance 1.5 Pro is its ability to directly output dialects with synchronized audio and visuals. This capability could expand applications in film and television by allowing characters to speak in authentic local dialects.

For instance, a test prompt depicting an old Shaanxi man eating noodles and speaking in Shaanxi dialect demonstrated the model's ability to handle specific dialect words and intonation. The model accurately rendered the scene, including the character's actions and speech rhythm, even when the character was drinking and speaking.

Another test involved a complex scene with three characters of different ages and genders speaking Sichuan dialect at a mahjong table. The model generated appropriate voices and intonations for each character, including unique Sichuan dialect phrases. The camera work in this generation was also dynamic, automatically rotating to focus on the speaker and mimicking a handheld perspective.

A Cantonese dialect test in a restaurant setting also showed strong adherence to the prompt, capturing details like the setting, character actions, and dialogue. The model maintained visual and temporal consistency, rendering Cantonese pronunciations accurately.

According to information reviewed by toolmesh.ai, the emphasis on dialects reflects a focus on local culture, which is important for domestic models.

Non-Human Audio-Visual Synchronization

The model's synchronization capabilities extend to non-human subjects, such as pets. A test with a cat eating a pan-fried bun, producing "kazi kazi" crunching sounds, showed accurate matching of sound effects with visual material and realistic expressions.

Another test involved a Ragdoll cat speaking human language in a baby voice. The model maintained the cat's anatomical structure while producing the specified voice and rhythm, even when indicating sleepiness through speech.

Emotional Expression

Seedance 1.5 Pro demonstrates realistic emotional expression through synchronized audio and visuals. A prompt describing a survivor in a bunker expressing fear and pleading showed the model's ability to convey complex emotions through facial expressions, voice modulation, and subtle physical details like Adam's apple movement and saliva.

A cyberpunk-themed prompt with a mechanic and a robot, using English and a stylized art style, tested lip-sync and emotional changes in a 2D context. The model maintained stable lip-sync and expressions, even with a side profile, and integrated sobbing sounds with facial movements.

Camera Control

The model shows improved performance in complex camera movement control. A test involving a "Hitchcock zoom" in a European castle corridor, where the background compresses while the subject remains constant, was executed over a continuous 12-second duration. The model maintained background coherence and synchronized sound effects with the character's breathing to convey nervousness.

A long-take test, tracking a warehouse picker pushing a cart through a logistics warehouse, demonstrated the model's ability to maintain physical consistency and stable camera movement through various sections, including turns and a final zoom to the character's face.

Conclusion

While the model shows strong performance in many areas, some minor issues were identified, such as potential confusion with dialects too close to Mandarin and challenges in maintaining timbre consistency across different scenes for longer videos.

However, Seedance 1.5 Pro’s ability to integrate pure text-to-audio-visual generation, combined with autonomous camera movements, has significantly reduced video production complexity. This advancement is expected to drive further development in video generation products and video agents.

Individual users can access Seedance 1.5 Pro on Jiemeng AI, Doubao App, and the Volcano Ark Experience Center. Enterprise users can utilize the model API on Volcano Engine starting December 23rd.