Robin Li Just Said the Model Era Is Over. AI Video Has Been Quietly Proving It.

· Chris Sherman

Baidu's CEO opened Create 2026 in Beijing today by retiring "which model is best?" as the question that matters. For AI video, this just made a four-month-old consensus official.

The Sentence That Reframed the Industry

Robin Li, Baidu's co-founder and CEO, stood on stage at Baidu Create 2026 in Beijing on May 14, 2026 and delivered a line that's going to be quoted for the rest of the year: the AI industry, he said, has moved past "model competition" and entered "the agent era." He paired it with a concrete proposal — that the new industry metric should be Daily Active Agents (DAA), the agent-era equivalent of the mobile internet's DAU, with global DAA projected to eventually exceed 10 billion.

If you've been watching the AI video market for the last four months, none of this is a prediction. It's a description.

Sora 2 collapsed in 84 days under the weight of a model-only strategy. HappyHorse 1.0 took the Arena #1 in 48 hours and instantly compressed the meaningful technical gap between frontier video models to roughly zero. Seedance 2.0, Veo 3.1, and now the leaked Gemini Omni are all converging on the same architectural endpoint. The question of "which model is best?" stopped being interesting somewhere between February and April. Today, Robin Li became the first major-platform CEO to say it out loud.

This piece is about what that means specifically for AI video — what Li said, what Baidu actually shipped today, and why a Beijing keynote on the application layer turns out to be the most accurate description we have of the competitive landscape going into the second half of 2026.

What Li Actually Said

Three things to extract from the keynote, all in his own framing.

1. The "AI Evolution Theory" — a three-layer shift

Li laid out what he called an "AI evolution theory": a simultaneous transformation at three layers. Agents evolve from passive responders into autonomous executors that continuously learn from their environment. Individuals evolve from ordinary users into "super individuals" who coexist with AI to multiply their own output. Enterprises evolve from human-to-human collaboration into mixed human-agent formations that operate as unified super-organizations.

Strip the rhetorical packaging and the core claim is clear: the value migration is moving away from raw model capability and toward the layer that orchestrates capability into outcomes. That's the agent layer. Everything above the model — what gets generated, when, by which agent, for which user, in service of which goal — is where the next decade of value lives.

2. Daily Active Agents (DAA) — a new metric

Li proposed DAA as the agent-era successor to DAU. The argument: tokens measure cost, not value — they're an input metric, not an output metric. Active agents, by contrast, measure how often autonomous software is actually doing useful work on someone's behalf. He projected global DAA could eventually surpass 10 billion.

Whether or not the number is the right one, the framing matters. DAU rewarded engagement (time spent in app). DAA rewards productive autonomy (work completed without user intervention). The two have very different design implications for video creation tools.

3. "Disposable software" — applications as throwaway artifacts

The third thread: as code generation costs collapse, software development barriers fall, and one-time or "disposable" applications become viable. Users generate a custom piece of software for a single task and discard it. Li cited Baidu's coding agent Miaoda — which reportedly generates around 90% of its own code — as a working example.

For video, the analog is obvious. The agent that generates a 60-second commercial isn't a feature inside a tool; it's a temporary, task-specific construct that exists for as long as the project does. Pipeline assembled, models routed, output rendered, agent dissolved.

What Baidu Actually Shipped Today

Four product announcements, all positioned as proofs of the thesis rather than standalone launches.

Product What it is Why it matters
DuMate General-purpose agent — Baidu's flagship horizontal agent product Direct shot at OpenAI's Operator/ChatGPT-as-agent positioning
Miaoda (app + enterprise edition) Coding agent generating ~90% of its own code The "disposable software" thesis made concrete
Baidu YiJing (upgraded) Multi-agent digital human platform for livestreaming and real-time video generation The most directly relevant launch for AI video creators
Famou Agent 2.0 Self-evolving agent platform Continuous-learning autonomy is the long-term DAA play

The interesting one for our beat is YiJing. It's a multi-agent digital human platform — meaning the system isn't a single video model with a chat interface bolted on. It's an orchestration layer that coordinates multiple specialized agents for livestreaming and real-time video generation: one agent for script, one for delivery and lip-sync, one for camera and shot selection, one for audience response, one for product/promo logic. The video model itself is somewhere underneath, treated as a swappable component.

If you wanted a one-product demonstration of the agent-era thesis applied to video, YiJing is it. The pitch is no longer "we have the best video model." It's "we orchestrate the best agents on top of whatever video model is currently winning."

Why This Lands Now, Not Six Months Ago

This thesis has been circulating in technical circles for a year. What makes Li's May 14 keynote a real inflection point — rather than another conference talk — is the empirical evidence stack that arrived in the four months leading up to it.

  1. Sora 2's economic collapse. OpenAI's flagship consumer video model shut down in 84 days because a $15M/day inference burn against $2.1M revenue is what happens when you bet a model-only strategy at consumer scale. See our post-mortem.
  2. HappyHorse 1.0's instant ascent. Alibaba's unified audio-video model took Arena #1 in 48 hours with a 15B-parameter architecture, demonstrating that the model layer can be matched or surpassed in months by a focused team. See our review.
  3. Architectural convergence. Seedance 2.0, HappyHorse 1.0, and the leaked Gemini Omni all point at the same destination — unified audio-video models with multi-modal inputs. When the architecture commoditizes, differentiation has to live somewhere else.
  4. Pricing compression. Top-tier video API pricing has been collapsing from $0.50/sec (Veo 3.1) toward $0.05/sec (HappyHorse 1.0). Models that cost the same and look the same can't be the basis for a moat.

Li didn't predict the shift. He named it. There's a meaningful difference, and the difference is what makes this keynote quotable for the rest of 2026.

What the Agent Era Actually Means for AI Video

Five concrete reframings to internalize if you're producing video with AI as a serious part of your workflow.

1. The question "which model should I use?" is now obsolete

The correct question is "which agent stack routes my work to the best model for each shot?" Veo 3.1 may be best for high-physics motion. HappyHorse 1.0 may be best for synchronized speech. Seedance 2.0 may be best for multi-shot sequences. Kling 3.0 may be best for stylized aesthetics. The job of the agent is to know which is which, and to route automatically. If you're still picking one model and committing, you're playing a game from 2024.

2. Output quality stops being about model capability

It becomes about prompt-translation quality, shot decomposition quality, continuity management across shots, and audio-video sync verification — none of which the model itself does well. These are agent-layer problems. Two teams using the same underlying models will produce wildly different output because their agents are wildly different.

3. The unit of differentiation moves from "model" to "workflow"

If you're a tool, you don't compete on "we use Veo 3.1." You compete on what your agent does on top of Veo 3.1, Seedance 2.0, HappyHorse, Kling, Luma, and Runway combined. This is the central thesis of our mid-2026 recap, and Li's keynote is the public-validation moment for it.

4. DAA reframes the success metric for creator tools

Tools optimized for DAU (engagement) push users toward fiddling — more prompt iterations, more knobs, more re-renders. Tools optimized for DAA push users toward delegation — fewer interactions, higher autonomy, more work completed per session. The two design philosophies are incompatible, and the second is the one Li just blessed. AI video tools that still optimize for time-in-app are being told, on May 14, that they're tracking the wrong number.

5. "Disposable agents" become the unit of creative work

The most novel framing in the keynote. Instead of a permanent tool with persistent settings, each project gets its own custom agent — assembled for the brief, optimized for the constraint, dissolved when the deliverable ships. For commercial video work, this is how YiJing-style multi-agent platforms will scale: not as a single super-tool, but as an infrastructure for spinning up project-specific agent ensembles.

What This Means for You Specifically

Three concrete situations.

If you're an individual creator

Stop benchmarking models. Start benchmarking workflows. The most useful question you can ask in the next 60 days is not "is HappyHorse better than Veo for my work?" — it's "does my current tool route between models intelligently, or am I doing the routing manually?" If you're doing it manually, you're absorbing work that should be absorbed by the layer above the model.

If you're building a video product

Treat your model integrations as configuration, not code. The pace of model releases — Omni next week, whatever Anthropic ships next, whatever ByteDance ships in Q3 — guarantees that hardcoding to a specific model is a six-month time bomb. Build your differentiation in the agent layer, not the model layer. The market is rewarding orchestration depth, not model selection.

If you're running an enterprise creative team

Li's "mixed human-agent formations" is not a slogan. It's a concrete operational target — small human teams supervising large agent fleets, with the human role being judgment, brief-writing, and quality gating. The competitive question for the next 18 months is whether your team can produce 10x output at the same headcount by delegating production execution to agents, while keeping creative direction in human hands. Teams that don't make this shift will be outproduced by ones that do.

Three Signals to Watch From Here

The agent-era thesis has now been said out loud. Three downstream events will determine whether it accelerates or stalls.

Signal 1: Google I/O 2026 (May 19–20)

If Gemini Omni ships as a unified omni-modality model with an agent-native interface (chat-driven editing, in-line remixing, workflow templates), Google is implicitly endorsing the same thesis Li laid out today. If Omni ships as a standalone video model with API access and nothing else, Google is still playing the model-competition game. The framing of Omni's launch will tell you which side of the line Google has chosen.

Signal 2: The Hailuo/MiniMax hearing (May 29)

Disney, Warner Bros., and NBCUniversal vs. MiniMax goes before Judge Blumenfeld on May 29. If the case advances on the merits, the legal infrastructure for "agents that route between video models" gets complicated — agents become liable for what their routed-to models produced. The agent layer's economics depend on which way this goes.

Signal 3: DAA adoption by major platforms

Watch whether OpenAI, Anthropic, Meta, or Google adopt DAA (or some equivalent autonomy-based metric) in their next quarterly disclosures. If they do, Li's framing wins by default. If they keep reporting tokens and DAU, the agent-era narrative is still contested. Q2 2026 earnings calls are the first test.

The Bottom Line

The most useful thing about Robin Li's May 14 keynote isn't that he announced new products — he did, but DuMate and Miaoda and YiJing are Baidu-shaped responses to a pattern that was already there. The most useful thing is that he gave a name and a metric to a shift that's been happening quietly across the AI video market for four months.

The model layer keeps moving. It will keep moving. Gemini Omni next week, Seedance 3 in Q3, whatever Anthropic and Meta ship between now and year-end. None of it is going to settle. That's exactly the point. When the model layer is in permanent motion, the only durable place to build is one level up — at the agent layer, where workflows compound and orchestration improves with use.

For AI video, this isn't speculation. We've been operating on this thesis since the start of 2026, which is why Genra is built as an end-to-end agent on top of Veo + Seedance rather than as a frontend to a single model. The job of the agent is to route to the right model, manage continuity across shots, sync audio and motion, and ship the final cut without making you the routing engine. Li's keynote is the most explicit public endorsement of that architecture choice we've gotten this year.

Five days until Google I/O. Fifteen days until the MiniMax hearing. The next two weeks will tell you how much of the industry agrees with what Li just said in Beijing.

FAQ

What is Baidu Create 2026?

Baidu Create 2026 is Baidu's annual AI developer conference, held in Beijing on May 13–14, 2026. CEO Robin Li used the keynote on May 14 to declare that the AI industry has moved from "model competition" into "the agent era," and to propose Daily Active Agents (DAA) as the new defining industry metric.

What did Robin Li actually announce?

Four products: DuMate (general-purpose agent), Miaoda app + enterprise edition (coding agent generating ~90% of its own code), the upgraded Baidu YiJing multi-agent digital human platform, and Famou Agent 2.0 (self-evolving agent platform). He also proposed the DAA metric and outlined an "AI evolution theory" describing simultaneous transformation at agent, individual, and enterprise layers.

What is Daily Active Agents (DAA)?

DAA is the metric Li proposed as the agent-era equivalent of DAU. It measures how many autonomous agents are actively performing useful work in a given day, on the argument that tokens are an input/cost metric rather than an output/value metric. Li projected global DAA could exceed 10 billion over time.

Why does this matter for AI video specifically?

The AI video market has spent the first four months of 2026 demonstrating the agent-era thesis empirically — Sora 2 collapsed running a model-only strategy, HappyHorse 1.0 took the Arena #1 in 48 hours showing the model layer can be matched quickly, and pricing across the top-tier video APIs has compressed by 10x. Li's keynote names what was already happening and gives it a public-validation moment.

What is "disposable software" and how does it apply to video?

Li's framing for a world where code generation is cheap enough that users assemble single-use software for specific tasks and discard it. Applied to video, the analog is project-specific agents — a custom agent ensemble assembled for one campaign, dissolved on delivery, rather than a permanent tool with persistent settings.

What should I do as a creator?

Stop benchmarking models in isolation. Start benchmarking workflows. The useful question is no longer "is HappyHorse better than Veo" — it's "does my tool route between models intelligently?" If you're picking models manually, you're absorbing work that belongs to the agent layer.


About the Author
Chris Sherman covers AI video technology and creative production workflows. Follow @GenraAI for live coverage of Google I/O 2026 (May 19–20) and the MiniMax hearing (May 29).