HappyHorse 1.0: Alibaba's Mystery AI Video Model That Topped Every Benchmark

· Genra AI

On April 7, 2026, an unnamed model appeared on the Artificial Analysis Video Arena leaderboard with no announcement, no team, and no public weights. Within days it ranked #1 in both Text-to-Video and Image-to-Video. Then Alibaba stepped forward.

The Anonymous Model That Broke the Leaderboard

The AI video space has a leaderboard problem. When a well-known lab submits a model, community voting can be biased by name recognition alone. People vote for the brand as much as the output. It is a dynamic that has plagued LLM benchmarks for years.

On April 7, 2026, someone decided to sidestep that problem entirely. An AI video model appeared on the Artificial Analysis Video Arena leaderboard under a name nobody recognized: HappyHorse. No press release. No company logo. No associated research lab. Just raw outputs submitted for blind human evaluation.

Within 48 hours, HappyHorse climbed to the top of the Text-to-Video leaderboard with an Elo rating of 1389 — a full 115 points ahead of Seedance 2.0, the previous leader. On Image-to-Video, it posted an Elo of 1416, again first place. The gap was not marginal. It was a decisive lead in both categories.

The AI community did what it always does: speculated. Was it Google DeepMind testing something? A startup nobody had heard of? An open-source project that had been quietly training for months?

On April 9-10, 2026, a newly created X (formerly Twitter) account revealed the answer. HappyHorse 1.0 was built by Alibaba's ATH AI Innovation Unit, a new division led by a name that immediately explained the model's quality: Zhang Di, former VP of Kuaishou and the architect behind Kling AI.

The man who built Kling had quietly built its replacement.

The Dramatic Origin Story: From Kling AI to HappyHorse

To understand why HappyHorse matters, you need to understand who built it and why they left their previous company to do so.

Zhang Di: The Most Important Name in Chinese AI Video

Zhang Di served as Vice President of Kuaishou, one of China's largest short-video platforms (comparable to TikTok's domestic competitor). At Kuaishou, he led the development of Kling AI, which became one of the most capable AI video generation systems in the world. Kling consistently ranked at or near the top of public benchmarks and was widely regarded as the leading Chinese AI video model through most of 2025.

Then, at the end of 2025, Zhang Di left Kuaishou.

He joined Alibaba Group to lead the Taotian Future Life Lab, an R&D division under Alibaba's e-commerce arm. The move was significant but received limited coverage in Western media at the time. In China's tech circles, however, it was understood as a major talent acquisition. Alibaba was not just hiring an executive — they were acquiring the person who had built the best AI video system in China.

The Anonymous Reveal

The decision to submit HappyHorse anonymously to the Video Arena was deliberate. By stripping out the Alibaba brand, Zhang Di's team ensured that the model's performance would be evaluated purely on output quality. No halo effect. No pre-existing biases for or against Alibaba's AI capabilities.

When the X account @AthAI_Official confirmed the connection on April 9-10, the reveal landed with impact precisely because the results were already on the board. HappyHorse was not announced and then tested. It was tested, dominated, and then claimed.

The strategic messaging was clear: this team can build a model that beats every competitor on blind evaluation, and they did it within roughly four months of the unit's formation.

ATH AI Innovation Unit

The ATH AI Innovation Unit appears to be a relatively new division within Alibaba, distinct from the company's existing Tongyi (Qwen) AI lab. Details about the unit's structure are limited, but the model's capabilities suggest a well-resourced team with deep expertise in video generation architectures. The name "ATH" has not been publicly explained by Alibaba, though it may reference "Alibaba Taotian Holdings," the e-commerce subsidiary under which the Taotian Future Life Lab operates.

Technical Architecture: What Makes HappyHorse Different

HappyHorse 1.0 is not simply a larger version of existing video models. Its architecture represents a meaningful departure from the multi-stage pipelines that most AI video systems use today.

Core Specifications

  • Parameters: 15 billion
  • Architecture: Unified 40-layer self-attention Transformer
  • Design: Single-stream architecture (video + audio generated jointly in one forward pass)
  • Resolution: Native 1080p HD output
  • Generation speed: Approximately 38 seconds for a 1080p clip on a single H100 GPU

Single-Stream Unified Generation

Most existing AI video models that handle both video and audio do so with separate modules. A video generation backbone produces the visual frames, and a separate audio model — often using cross-attention mechanisms — generates corresponding sound. This multi-stage approach introduces latency, synchronization artifacts, and compounding errors between the visual and audio streams.

HappyHorse takes a fundamentally different approach. Its single-stream architecture generates video and audio jointly within the same forward pass through a unified 40-layer self-attention Transformer. There are no cross-attention modules bridging separate visual and audio sub-networks. Instead, both modalities share the same attention layers, allowing the model to learn joint representations of how visual content and sound relate to each other.

The practical result: lip movements, ambient sounds, music, and Foley effects are generated in tight synchronization because they emerge from the same computational process, not from two separate systems trying to stay aligned.

15 Billion Parameters in Context

At 15 billion parameters, HappyHorse is not the largest video model in existence — some competitors exceed 30B parameters — but its performance suggests that architectural efficiency matters more than raw scale. The unified single-stream design likely reduces redundant computation that multi-module systems carry. The 40-layer depth provides sufficient representational capacity for joint audio-video modeling without the overhead of maintaining separate attention pathways.

For reference, the approximately 38-second generation time for a 1080p clip on a single H100 is competitive. Many comparable models require multiple GPUs or significantly longer generation times to produce equivalent-resolution output.

Key Capabilities: What HappyHorse Can Actually Do

Benchmark Elo scores tell you a model wins blind comparisons. They do not tell you what the model is specifically good at. Based on available demonstrations and technical disclosures from the ATH AI team, here is what HappyHorse 1.0 delivers.

Unified Audio-Video Generation

This is HappyHorse's headline feature and the one most likely to matter commercially. In a single generation pass, the model produces:

  • Dialogue with precise lip-sync — Characters speak with mouth movements that match the audio waveform at a phoneme level, not just rough jaw movement
  • Ambient sound — Environmental audio appropriate to the scene (city streets, nature, indoor spaces) generated contextually
  • Music — Background music that matches the mood and pacing of the visual content
  • Foley effects — Sound effects tied to on-screen actions (footsteps, door closes, object interactions) timed to the visual events

All of this happens in one forward pass. No post-processing audio pipeline. No separate TTS system bolted on afterward. The implications for production workflows are significant: what normally requires a video model, a speech synthesis system, a Foley library, and a mixing engineer is collapsed into a single generation step.

Multi-Language Lip-Sync

HappyHorse supports lip-synchronized dialogue in seven languages: English, Mandarin, Cantonese, Japanese, Korean, German, and French. The team claims "ultra-low word error rate" lip-sync across these languages, meaning the visual mouth movements are not just generically open-and-close but are modeled to match the specific phonetic patterns of each language.

This is technically challenging because different languages have dramatically different mouth shapes for common sounds. Mandarin's tonal structure involves different lip and jaw positions than English's consonant clusters. Japanese's syllabary produces different articulation patterns than French's liaison-heavy flow. A model that handles all of these in a single architecture is a non-trivial achievement.

Character Consistency and Environment Preservation

One of the persistent weaknesses of AI video models has been maintaining consistent character appearance across frames and scenes. A character's face might subtly shift, clothing might change color between cuts, or environmental details might drift. HappyHorse appears to handle character consistency at a level that makes practical applications viable:

  • Animating concept art — Provide a static character illustration and generate video of that character in motion while preserving the original art style
  • Portrait animation — Animate a still photograph into a speaking or moving video while maintaining facial identity
  • Product photo animation — Take a static product image and generate video showing the product in use, from different angles, or in contextual environments

Generation Speed

HappyHorse generates output in approximately 10 seconds on average, making it one of the fastest models at this quality tier. For context, some competing models at similar quality levels take 30-90 seconds per generation. Speed matters for iterative creative workflows where users generate multiple variations before selecting a final output.

Supported Modes

  • Text-to-Video — Generate video from a text description
  • Image-to-Video — Animate a static image into video
  • Audio generation — Dialogue, music, ambient sound, and Foley effects generated jointly with video

Benchmark Performance: The Numbers in Detail

The Artificial Analysis Video Arena uses blind human evaluation to rank AI video models. Users are shown outputs from two anonymous models side by side and choose which they prefer. The results are converted to Elo ratings — the same system used in chess — where higher scores indicate a model that wins more frequently in head-to-head comparisons.

Here is how HappyHorse 1.0 performs as of mid-April 2026.

Text-to-Video (Without Audio)

Rank Model Elo Rating Gap to #1
1 HappyHorse 1.0 1389 --
2 Seedance 2.0 1274 -115
3 Kling 3.0 ~1260 ~-129

A 115-point Elo gap in a blind human evaluation is substantial. In chess terms, that is roughly the difference between a strong club player and a regional champion. It means HappyHorse wins the majority of head-to-head visual comparisons against every other model on the leaderboard by a wide margin.

Image-to-Video (Without Audio)

Rank Model Elo Rating Gap to #1
1 HappyHorse 1.0 1416 --
2 Seedance 2.0 ~1300 ~-116
3 Kling 3.0 ~1280 ~-136

The Image-to-Video lead is even more pronounced. An Elo of 1416 is the highest score any model has achieved on this leaderboard to date. Image-to-Video is arguably the more commercially important mode because it enables users to animate existing assets — product photos, concept art, storyboards — rather than generating entirely from text.

Text-to-Video (With Audio)

Rank Model Elo Rating
1 Seedance 2.0 1220
2 HappyHorse 1.0 1215

A 5-point difference at these sample sizes is within the margin of error. This is a statistical tie. Both models produce audio-visual output that human evaluators find equally compelling.

Image-to-Video (With Audio)

HappyHorse and Seedance 2.0 are within 2 Elo points of each other in this category — another statistical tie. Neither model has a meaningful advantage when audio quality is factored into the evaluation.

What the Benchmarks Tell Us

The pattern is clear: HappyHorse dominates on pure visual quality with decisive leads in both T2V and I2V without audio. When audio is added to the evaluation, Seedance 2.0 closes the gap to a statistical tie, suggesting that Seedance may have a slight edge on audio quality or audio-visual synchronization that offsets HappyHorse's visual advantage.

For users who primarily need visual output (and will add audio separately or do not need it), HappyHorse is the clear leader. For users who need integrated audio-video output, both models are effectively equivalent on current benchmarks.

HappyHorse 1.0 vs. Seedance 2.0 vs. Kling 3.0: Head-to-Head

The irony of this comparison cannot be overstated. Zhang Di built Kling at Kuaishou. He left. He built HappyHorse at Alibaba. And now HappyHorse outperforms the model he originally created. This is the AI equivalent of a head coach leaving a championship team, joining a rival, and immediately winning a bigger championship.

Category HappyHorse 1.0 Seedance 2.0 Kling 3.0
Developer Alibaba (ATH AI) ByteDance Kuaishou
T2V Elo (no audio) 1389 (#1) 1274 (#2) ~1260 (#3)
I2V Elo (no audio) 1416 (#1) ~1300 (#2) ~1280 (#3)
T2V Elo (with audio) 1215 (#2) 1220 (#1) N/A
I2V Elo (with audio) Statistical tie Statistical tie N/A
Parameters 15B Not disclosed Not disclosed
Architecture Unified single-stream Transformer Multi-module pipeline Diffusion Transformer
Native resolution 1080p 1080p 1080p
Audio generation Unified (single pass) Integrated (multi-module) Separate pipeline
Lip-sync languages 7 (EN, ZH, Cantonese, JA, KO, DE, FR) Limited disclosure 2-3 confirmed
Average generation speed ~10 seconds ~30 seconds ~45 seconds
Open source Claimed (weights not yet released) No No
API availability Coming soon (late April 2026) Available Available
Pricing Not yet announced Pay-per-generation Pay-per-generation

The Zhang Di Factor

The most striking element of this comparison is the talent pipeline. Zhang Di spent years at Kuaishou building Kling into a top-tier AI video system. He understood its architecture intimately, knew its limitations, and presumably had ideas about how to build something better that Kuaishou's organizational structure or strategic priorities may not have supported.

At Alibaba, with fresh resources and a mandate to build something new, he appears to have done exactly that. The unified single-stream architecture that defines HappyHorse is a philosophical departure from Kling's approach, suggesting that Zhang Di's next-generation ideas required a clean-sheet design rather than incremental improvements to the Kling codebase.

This pattern — a key technical leader leaving one AI lab and building a superior system at a competitor — is becoming a defining dynamic of the Chinese AI video industry. It mirrors similar talent flows in Silicon Valley but is happening at a faster pace and with more immediate competitive consequences.

Three Chinese Models at the Top

A fact worth stating plainly: the top three models on the Artificial Analysis Video Arena leaderboard are all from Chinese companies. HappyHorse (Alibaba), Seedance 2.0 (ByteDance), and Kling 3.0 (Kuaishou) occupy the first, second, and third positions respectively. No Western model currently holds a top-three position in either Text-to-Video or Image-to-Video on this benchmark.

This is not to say Western labs are not producing capable video models — Google's Veo 2, OpenAI's Sora, and Runway Gen-4 all have notable capabilities. But in terms of blind human preference rankings, the current leaderboard belongs to Chinese AI labs.

Open Source and Availability: The Gap Between Claims and Reality

HappyHorse 1.0 has been described as an open-source model. However, as of April 20, 2026, the reality does not match the claim.

What Has Been Released

  • Public weights: Not available. No downloadable model checkpoint has been published on any platform (HuggingFace, ModelScope, or direct download).
  • GitHub repository: A repository exists but shows "coming soon" status with no source code or model files.
  • Technical paper: No peer-reviewed paper or detailed technical report has been published. Available technical details come from social media posts and limited disclosures by the ATH AI team.
  • API access: Not yet available for public use.

What Is Coming

  • fal.ai has a dedicated HappyHorse page confirming the model is "coming soon" in late April 2026. fal.ai is a well-known inference platform that provides API access to various AI models, so this is a credible indicator of near-term availability.
  • Atlas Cloud is also reportedly preparing API access for HappyHorse, though no specific launch date has been confirmed.
  • The ATH AI team has indicated that open-source weights will be released, but no timeline has been committed.

The "Open Source" Question

The term "open source" in the AI industry has become increasingly ambiguous. Some models release full weights under permissive licenses (truly open). Others release weights under restrictive commercial licenses (open-weight but not open-source by traditional definitions). Others announce open-source intentions but delay or never follow through.

HappyHorse currently falls into the last category: the intention has been stated, but no weights or code have been released. This is worth monitoring rather than celebrating. If and when the weights are published, the license terms will determine whether HappyHorse is genuinely open-source or merely open-weight with commercial restrictions.

For practical purposes, the most likely near-term path to using HappyHorse will be through hosted API providers like fal.ai and Atlas Cloud. Pricing has not been announced, but given the competitive dynamics in the AI video API market, it is likely to be priced comparably to Seedance 2.0 and Kling 3.0 endpoints.

What This Means for the AI Video Landscape

HappyHorse 1.0's emergence carries implications that extend beyond a single model topping a single leaderboard.

The Acceleration of Chinese AI Video

Twelve months ago, the AI video conversation was centered on Sora's announcement, Runway's Gen-3, and Pika's rapid iteration. Chinese models existed but were generally seen as competitive rather than dominant. That dynamic has inverted. In April 2026, Chinese models hold the top positions across every major video generation benchmark, and the gap is widening rather than narrowing.

The pace is particularly notable. HappyHorse went from team formation (late 2025) to #1 on the leaderboard (April 2026) in roughly four months. That timeline suggests either extraordinary engineering velocity, significant pre-existing research carried over from Zhang Di's prior work, or both.

Talent as the Critical Variable

The HappyHorse story underscores a reality that the AI industry sometimes underweights: models are built by people, and the movement of key technical leaders can reshape competitive dynamics faster than any amount of compute scaling.

Zhang Di's move from Kuaishou to Alibaba is not an isolated incident. The Chinese AI video space has seen an accelerating flow of talent between major tech companies, startups, and academic labs. Each move carries institutional knowledge, architectural intuitions, and lessons learned from previous failures. The result is a competitive ecosystem where no single company can maintain a durable lead because the people who created that lead might leave and build something better.

For Western AI labs, this dynamic presents a strategic challenge. The Chinese AI video ecosystem is not a single competitor to track — it is a talent market where breakthrough capabilities can emerge from unexpected directions at any time.

Unified Architecture as the New Standard

HappyHorse's single-stream unified architecture for joint audio-video generation may represent the beginning of a broader architectural shift. If the approach proves robust as more users test the model, it could establish a new standard that other labs will need to match. Multi-module pipelines with separate audio and video stages may increasingly look like legacy architectures.

This has practical implications for model efficiency. A single unified model is simpler to deploy, requires less infrastructure overhead, and avoids the synchronization challenges that plague multi-stage systems. For API providers and cloud platforms, a unified model is more cost-effective to serve.

The Speed Factor

HappyHorse's approximately 10-second average generation time is worth emphasizing. Fast generation is not just a convenience — it fundamentally changes how people interact with AI video tools. At 10 seconds per generation, users can iterate rapidly: generate a clip, evaluate it, adjust the prompt, and generate again. At 60-90 seconds per generation, each iteration feels like a commitment, and users are less likely to explore creative variations.

Speed also matters for commercial applications. Real-time or near-real-time video generation opens use cases in live content production, interactive experiences, and personalized video at scale that are impractical at slower generation speeds.

What We Are Watching at Genra

At Genra, we monitor every major AI video model release because our multi-model pipeline is designed to route each generation request to the best-available model for that specific task. HappyHorse 1.0's performance on visual quality benchmarks is impressive, and we plan to integrate it into our pipeline once API access becomes available through fal.ai or other providers.

The unified audio-video generation capability is particularly interesting for our users who need complete video-with-sound output in a single workflow step. If HappyHorse's audio quality holds up in production usage as well as it does in benchmarks, it could reduce the number of pipeline stages needed for many common video generation tasks.

Key Takeaways

  • HappyHorse 1.0 is the top-ranked AI video model on the Artificial Analysis Video Arena, holding #1 in both Text-to-Video (Elo 1389) and Image-to-Video (Elo 1416) without audio. With audio, it ties Seedance 2.0 in both categories.
  • Built by Alibaba's ATH AI Innovation Unit, led by Zhang Di — the former Kuaishou VP who built Kling AI. The model went from team formation to #1 ranking in roughly four months.
  • 15 billion parameters with a unified single-stream architecture that generates video and audio jointly in one forward pass. No cross-attention modules between separate audio and video sub-networks.
  • Native 1080p with ~10 second generation speed, making it one of the fastest models at this quality tier. Supports 7-language lip-sync including English, Mandarin, Cantonese, Japanese, Korean, German, and French.
  • Open-source claims remain unverified — no public weights, no downloadable model, no published code. API access expected via fal.ai and Atlas Cloud in late April 2026.
  • Three Chinese models now dominate every major benchmark: HappyHorse (Alibaba), Seedance 2.0 (ByteDance), and Kling 3.0 (Kuaishou). The talent flow between these companies is accelerating competitive development.
  • The unified audio-video architecture may set a new standard that pushes competitors to move away from multi-stage pipelines toward single-model joint generation.

Frequently Asked Questions

What is HappyHorse 1.0?

HappyHorse 1.0 is an AI video generation model built by Alibaba's ATH AI Innovation Unit. It is a 15-billion-parameter unified Transformer that generates video and audio jointly in a single forward pass. It currently ranks #1 on the Artificial Analysis Video Arena in both Text-to-Video (Elo 1389) and Image-to-Video (Elo 1416) categories.

Who built HappyHorse 1.0?

HappyHorse was developed by the ATH AI Innovation Unit within Alibaba Group. The team is led by Zhang Di, who previously served as Vice President of Kuaishou and was the technical leader behind Kling AI. He joined Alibaba at the end of 2025 to lead the Taotian Future Life Lab.

Is HappyHorse 1.0 open source?

The team has stated an intention to open-source the model, but as of April 20, 2026, no public weights, source code, or downloadable model files have been released. The GitHub repository shows "coming soon" status. The first available access is expected via API providers like fal.ai in late April 2026.

How does HappyHorse compare to Seedance 2.0?

HappyHorse leads Seedance 2.0 by a significant margin in visual-only benchmarks: 115 Elo points ahead in Text-to-Video and approximately 116 points ahead in Image-to-Video. When audio is included in the evaluation, the two models are in a statistical tie (within 2-5 Elo points), suggesting Seedance has competitive or slightly better audio generation.

How fast is HappyHorse 1.0 at generating video?

HappyHorse generates output in approximately 10 seconds on average, making it one of the fastest models at its quality tier. A 1080p clip takes about 38 seconds on a single H100 GPU. This speed enables rapid iteration during creative workflows.

What languages does HappyHorse support for lip-sync?

HappyHorse supports lip-synchronized dialogue in seven languages: English, Mandarin Chinese, Cantonese, Japanese, Korean, German, and French. The model generates phoneme-accurate mouth movements for each language rather than generic lip movement approximations.

When will HappyHorse 1.0 be available to use?

API access is expected in late April 2026 through inference platforms like fal.ai and Atlas Cloud. No confirmed pricing has been announced. Open-source weight release has been indicated but has no confirmed timeline.

Why did HappyHorse launch anonymously?

The ATH AI team submitted HappyHorse to the Artificial Analysis Video Arena without identifying Alibaba as the developer. This ensured the model was evaluated purely on output quality in blind human comparisons, without brand bias influencing voter preferences. Alibaba revealed the connection approximately 2-3 days after the initial submission, after the model had already achieved #1 rankings.


About the Author
The Genra AI team builds tools that help creators produce professional video content using AI. Follow @GenraAI for updates, tutorials, and honest takes on the AI video space.