Alibaba's AI video model rises to No. 2 in global rankings, as OpenAI's Sora and ByteDance's Seedance fall away
AI Summary
Alibaba Cloud released HappyHorse 1.1, an upgraded AI video generation model claiming the No. 2 global ranking. The model targets enterprise customers with production-ready video synthesis and API access amidst a consolidating AI video market.
Alibaba Cloud on Sunday released HappyHorse 1.1, a major upgrade to its AI video generation model that the company says delivers production-ready video synthesis across core content creation scenarios. The model is now live on Alibaba Cloud Model Studio with full API access for enterprise customers and developers, accompanied by a 40% sitewide launch discount for the first two weeks. The release arrives at a moment of remarkable upheaval in the AI video generation market — and Alibaba appears keenly aware of the timing. OpenAI discontinued Sora after it proved financially unsustainable. ByteDance indefinitely shelved the international rollout of Seedance 2.0 following a barrage of copyright complaints from Hollywood studios. For enterprise procurement teams that had been evaluating or integrating those tools into marketing, advertising, and content production workflows, the competitive landscape has contracted sharply in a matter of months. That contraction creates both an opportunity and a test for Alibaba. HappyHorse 1.1 is not a research demo or a consumer toy — it is an API-first product built for integration into enterprise software stacks, priced for volume, and backed by a $52.7 billion global infrastructure buildout. Whether it can convert technical capability into enterprise adoption, particularly in Western markets navigating intensifying U.S.-China tech tensions, will determine whether Alibaba can establish itself as a serious player in the generative video market that analysts expect to reach tens of billions of dollars by the end of the decade. How HappyHorse climbed from anonymous benchmark entry to top-ranked video model HappyHorse first appeared in early April as an anonymous submission on the Artificial Analysis Video Arena, an independent benchmarking platform where real users compare model outputs in blind, side-by-side evaluations. The model immediately claimed the top position in both text-to-video and image-to-video rankings. Alibaba was subsequently confirmed as the creator, revealing it was built by the company's ATH (Alibaba Token Hub) AI Innovation Unit — a team previously part of the Future Life Lab under the Taobao and Tmall Group before a strategic organizational restructuring. According to Arena.ai, HappyHorse 1.0 now holds the No. 2 position across all three Video Arena leaderboards. The platform noted the model scores 1,444 in both text-to-video and image-to-video categories, leading Google's Veo-3.1 (with audio) by 69 points in text-to-video and xAI's Grok-Imagine-Video by 23 points in image-to-video. In Elo-based ranking systems like Arena's, models gain or lose points based on whether users prefer their outputs in head-to-head comparisons, meaning persistent double-digit leads reflect a consistent quality gap as perceived by human evaluators — not a statistical fluke. The model's architecture helps explain why. According to community-compiled technical documentation, HappyHorse is built around a 15-billion-parameter unified self-attention Transformer that processes text, image, video, and audio tokens within a single token sequence. Unlike many competitors that stitch together separate models for video and audio, HappyHorse operates as a unified system that handles all modalities in a single generation pass, eliminating the need for third-party dubbing or post-processing audio tools. For enterprise buyers evaluating total cost of ownership, that architectural simplicity translates directly into fewer integration points, fewer vendor dependencies, and faster time to production. What the 1.1 upgrade fixes — and why it matters for commercial video production The 1.1 upgrade targets a set of pain points that enterprise video production teams know intimately. Alibaba Cloud described the release as "systematically optimized across core content generation scenarios," and the specific improvements reveal a model that has been tuned for commercial deployment rather than viral social media demos. The most consequential upgrade is multi-image reference capability, which Alibaba calls R2V (Reference-to-Video). The feature allows users to upload multiple character reference images and maintain consistent identity across generated video — directly addressing one of the hardest problems in AI video production, where subjects tend to drift in appearance between frames or shots. For brands producing advertising campaigns, product videos, or serialized marketing content, identity consistency is not a nice-to-have; it is a requirement that has historically forced teams back to traditional production methods. Motion quality receives a significant overhaul, with what Alibaba describes as "strengthened motion modeling" that addresses prior limitations in speed and fluidity. The company also made targeted improvements to visual texture, specifically calling out the elimination of "facial oiliness," "over-sharpening," and "unnatural textures" — artifacts that have plagued commercial AI video sinc