Tongyi Bailin Upgrades Voice Models, Open-Sources Fun-CosyVo

Tongyi Bailin has announced significant upgrades to its voice models, including enhancements to Fun-CosyVoice3 and Fun-ASR, alongside the open-sourcing of Fun-CosyVoice3 (0.5B) and Fun-ASR-Nano (0.8B). The updates aim to improve performance in areas such as voice cloning, multilingual support, and speech recognition in challenging environments.

The upgraded Fun-CosyVoice3 model now supports voice cloning across 9 languages and 18 dialects, with emotion control capabilities. It can clone a voice from a 3-second audio sample and apply it to new speech in various languages, including Chinese, Cantonese, Japanese, and English, while maintaining emotional nuances.

The Fun-ASR model has enhanced speech recognition capabilities, achieving 93% accuracy in noisy environments. It supports recognition of lyrics and rap, handles 31 languages with free mixing, and covers numerous Chinese dialects and accents. The first-word latency for streaming recognition models has been reduced to 160ms.

Model Enhancements and Open-Source Releases

The Fun-CosyVoice3 model has received several key upgrades. Its first packet latency has been reduced by 50%, enabling bidirectional streaming synthesis for real-time applications like voice assistants and live dubbing. The Chinese-English mixed word error rate (WER) has decreased by 56.4%, improving accuracy for sentences containing technical terms or code-switching. In zero-shot text-to-speech (TTS) evaluations, content consistency and timbre similarity have been enhanced, with a 26% relative reduction in character error rate (CER) in complex scenarios. The model now supports 9 common languages, 18 Chinese dialects, and 9 emotion controls, alongside cross-language timbre replication.

Tongyi Bailin has also open-sourced Fun-CosyVoice3-0.5B, which provides zero-shot timbre cloning. This version allows users to replicate a voice from a 3-second audio sample for new speech synthesis and supports local deployment and secondary development. It is available on platforms including ModelScope, HuggingFace, and GitHub.

Advanced Speech Recognition with Fun-ASR

The Fun-ASR model, designed for end-to-end speech recognition, has undergone comprehensive upgrades. It is trained on millions of hours of speech data and is used in applications such as DingTalk's "AI Meeting Notes." The latest enhancements focus on robustness in noisy environments, multilingual mixing, Chinese dialect coverage, lyrics recognition, and customization.

Fun-ASR achieves 93% recognition accuracy in high-noise or far-field environments. It now includes recognition capabilities for songs and rap, improving performance with background music interference. The model supports free mixing of 31 languages, automatically switching recognition without prior language specification. It has optimized handling for East Asian and Southeast Asian languages and accurately processes code-switching sentences. For Chinese, it covers 7 major dialects and 26 local accents.

For enterprise-level customization, Fun-ASR incorporates a Retrieval Augmented Generation (RAG) mechanism, increasing the custom hotword limit from 1,000 to 10,000 without compromising general recognition accuracy. This feature supports high-recall and high-accuracy recognition of specialized terms in fields like finance, medicine, and education.

Lightweight Speech Recognition Model

In addition to the Fun-ASR upgrades, Tongyi Bailin has launched Fun-ASR-Nano, a lightweight version with a total parameter count of 0.8B. This model offers lower inference costs and has been open-sourced for local deployment and customized fine-tuning. It is accessible via ModelScope, HuggingFace, and GitHub.