Alibaba Tongyi Wanxiang 2.6: Role-Playing & Storyboard AI

Alibaba has launched its Tongyi Wanxiang 2.6 series models, integrating new features such as role-playing and storyboard control. The company states these additions make it the most feature-rich video model in the industry.

The release follows the September debut of Wanxiang 2.5, which ranked first in China for text-to-video generation on the LMArena benchmark.

Enhanced Capabilities for Professional Production

Tongyi Wanxiang 2.6 has been upgraded for professional film and television production scenarios, featuring role-playing functionality. This allows users to generate videos by inputting the IP image and voice of a specific person or object, maintaining consistency throughout the generated content.

An example provided shows a role-playing video generated by Wanxiang 2.6, where a user uploaded a personal video and entered a sci-fi suspense prompt. Storyboard design, character performance, and voiceover were completed in minutes.

Integrated Audio-Visual Role-Playing

The model's integrated audio-visual role-playing function enables realistic characters, voices, and sound effects. According to Alibaba, Wanxiang 2.6 is the second model globally and the first domestically to offer this feature. The company notes that while Sora 2 previously introduced a similar capability, Wanxiang 2.6 combines role-playing with cinematic visual quality.

Tongyi Wanxiang integrates multiple technologies in its model structure. It performs multi-modal joint modeling and learning on input reference videos, referencing a subject's emotions, postures, and visual features with temporal information. It also extracts acoustic features like timbre and speech rate. These elements serve as reference conditions during generation to achieve consistency and transfer from image to sound. The model can reference appearance and voice from input videos, allowing any person or object to become a protagonist in solo performances or duets.

The new version's text-to-video and image-to-video functions support generating videos up to 15 seconds long, enhancing spatio-temporal content capacity for narratives. Improved instruction following capabilities contribute to visual quality.

Wanxiang 2.6 can also generate non-human videos, supporting pets, cartoon IPs, figurines, objects, or buildings as subjects. Examples include a lifelike Crayon Shin-chan based on a reference video and a scene of Santa Claus riding a kitten. The model also features multi-subject co-shooting, allowing multiple characters to interact in a digital space, such as Guan Yu petting a cat.

Intelligent Storyboarding and Multi-Shot Formula

A significant upgrade in Wanxiang 2.6 is its support for intelligent storyboarding. The model can convert user prompts into multi-shot scripts, generating coherent narrative videos with multiple shots while maintaining consistency in subjects, scenes, and atmosphere.

Tests were conducted to evaluate Wanxiang 2.6's capabilities in multi-shot switching, logical consistency, and cinematic aesthetics.

In a "Cyberpunk Film Noir" test, the model was evaluated on its ability to consistently model light source position, ground reflections, and character attire details during shot transitions in a complex neon-lit, rainy environment. The test required specific shot compositions (close-up, medium shot, extreme long shot) and consistency, such as a holographic advertisement in a reflection appearing as a physical background in a long shot. Wanxiang 2.6 met these requirements.

A "Victorian Twilight" test assessed the model's stability in relative object positions and character facial features from different angles. This involved an extreme close-up of a hand, a circling shot of a profile portrait, and a pull-out shot of a room. The challenge included maintaining the sheen of a velvet dress and the constant direction of a "sunset" light source across shots. The model performed well, with natural faces and light/shadow changes.

An "Interstellar Explorer" test challenged the model's semantic understanding of large-scale scene construction and seamless switching from microscopic surface details to macroscopic planetary views. The video featured dynamic continuous camera movement, including a low-angle follow, a frontal subjective reflection in an astronaut's visor, and a God's eye view. The model preserved continuous footprint tracks and aligned light, shadow, and spatial logic.

The Wanxiang team has released prompt formulas for role-playing and storyboard control to maximize the model's capabilities. These formulas allow precise control over structure, camera position, and timing of shots, while maintaining consistency across multiple shots.

Wanxiang 2.6 also maintains its text-to-video and image-to-video functions. The model demonstrates synchronized facial expressions, micro-expressions, and speech, including eyebrow raises and smiles, in generated content.

Industry Impact

Alibaba states that Wanxiang 2.6's role-playing and storyboard control capabilities open new possibilities for film and television production. Tasks that previously required professional team collaboration can now be initiated by individuals using a single prompt.