I/O 2026 Eve: 5 Real Questions in AI Video (Not 5 New Models)
· Chris ShermanGoogle I/O 2026 opens in less than 24 hours. The internet is wall-to-wall Veo 4 prediction posts. They're all asking the same question: what specs will the new model have? That's the wrong question. The five questions actually shaping AI video right now have very little to do with which model wins tomorrow.
It's the evening of May 18, 2026. Tomorrow morning, Sundar Pichai walks on stage and announces the next generation of Veo. Every AI video creator, marketer, and analyst is refreshing the same Twitter timelines, waiting for leaked specs.
Here's a counterintuitive take: the announcement tomorrow probably won't change much. Not because it won't be impressive — it likely will. But because the actual unresolved problems in AI video have moved past "which model has the best output." Those problems sit one layer up, in the gap between a clip and a finished video. A better Veo doesn't close that gap. A better agent does.
Below are five questions that matter more than tomorrow's keynote. Read them, then go enjoy the show.
Question 1: Why does cross-clip consistency still break?
Every AI video model in 2026 can produce a beautiful eight-second clip. Run it again with the same prompt and you get a different person, a different product, a different brand color, a different background. The model has no memory between generations.
For a one-off cinematic shot, that's fine. For anything resembling a real video — a product demo with three angles, an ad with a narrator who appears in shots one and four, a course module with a consistent presenter — it's the entire problem.
The model layer's answer is reference-image conditioning: upload three pictures of a character, the model tries to match them. It works maybe 70% of the time. The remaining 30% is where most production hours actually go.
The agent layer's answer is different: maintain a reference set per entity (character, product, environment) across the full sequence, regenerate failing shots automatically, lock seeds where consistency matters, and version the references so brand assets stay stable across months of content. The model improvement helps. The orchestration is what makes it shippable.
What tomorrow won't fix: Veo 4 may ship native ID-embedding. It will be better than today. It will not solve consistency for a marketer producing 40 clips a month across 8 product SKUs without thinking about it.
Question 2: Why is "clip" still mistaken for "finished video"?
Watch any model demo and you see the same thing: a single shot, perfectly lit, no cuts, no captions, no music, no platform-specific framing, no call to action. It's a clip. It's not a video anyone would actually publish.
A real video — the kind that goes on a YouTube channel, a TikTok feed, an ad account, a product page — has scripting, scene planning, voiceover, B-roll, captions in the target language, cuts on the beat, a hook in the first three seconds, and an output format matched to its destination platform. The model handles one of those things. The other ten are someone's manual problem.
The current default solution is to stitch together five tools: a script writer, a video model, a voice generator, an editor, a captioning tool. Each tool has its own UI, its own pricing, its own failure modes. The result is that "AI video" still takes hours per finished asset for anyone serious about quality.
The agent layer's answer is to own the full pipeline as one system. Brief in plain language, finished video out. Genra runs on Veo and Seedance and handles every step in between. That's not a workflow improvement. It's a different category of product.
What tomorrow won't fix: Veo 4 will produce better clips. The clip-to-finished gap stays exactly where it is.
Question 3: What happens to AI video copyright in 11 days?
On May 29, 2026, the MiniMax copyright case enters its hearing phase. It's the first major AI video copyright case to reach a substantive ruling stage, and the outcome will set precedent everyone in the industry will live with for years.
The questions the court is being asked include: can a model be trained on copyrighted footage without a license? Who is liable when an AI-generated clip looks substantially similar to a copyrighted scene — the model provider, the platform, or the end user? What does "substantially similar" even mean when the model has seen millions of training videos?
This matters more than tomorrow's keynote for one reason: a Veo 4 announcement is a product. A copyright ruling is a constraint that shapes every product. If the ruling lands one way, the safe-harbor assumptions every Western AI video provider currently operates under get reshuffled. If it lands the other way, the moat around training data becomes a real defensible asset.
Smart creators and brand teams aren't waiting for the ruling. They're treating commercial AI video as something that needs a defensible content trail — what models were used, what references were uploaded, what consents were obtained. Genra's pipeline logs this by default, because we expect the regulatory floor to keep moving.
What tomorrow won't fix: Google will not address the MiniMax case at I/O. The legal landscape under everyone's feet keeps shifting regardless of what specs Veo 4 ships with.
Question 4: Where does a finished AI video actually go?
You generated a video. Now what? It needs to land on YouTube as a 16:9, on TikTok as a 9:16, on Instagram Reels with captions burned in for autoplay, on your landing page as an embedded MP4, on a paid ad platform with the first three seconds re-cut as a hook variant, and on your email list as a thumbnail preview linking to a hosted player.
Each destination has its own aspect ratio, duration cap, file size limit, caption format, accessibility requirement, and analytics integration. The model produces one rendered output. The distribution work is a separate, larger, mostly manual project.
This is the part of AI video nobody demos at I/O. It's also the part that determines whether the video makes money or sits in a folder.
The agent layer's answer is to make distribution a first-class output. Same brief, multiple platform-native cuts, generated in parallel, optimized for each surface's actual behavior — TikTok's algorithm doesn't reward the same hook structure as YouTube Shorts, and Instagram Reels favors a different first frame entirely.
What tomorrow won't fix: Better generation doesn't solve distribution. The platforms remain fragmented. The work to fit each one stays the same. The agent layer either owns it or the user does.
Question 5: When does AI video stop being a cost center?
Google made Veo 3.1 free in April. The cost of generating individual clips collapsed for anyone willing to accept a watermark and an 8-second cap. Free models are everywhere. So why are AI video budgets at most companies still growing?
Because the model cost was never the bottleneck. The bottleneck was the labor surrounding it: the prompt engineering, the manual stitching, the consistency babysitting, the platform cutting, the iteration loops with stakeholders, the brand QA. A free model collapses the line item that was already a rounding error and leaves the actual cost structure untouched.
The companies that have moved AI video from "experiment" to "infrastructure" did it by treating the agent layer as the unit of cost, not the model. They measure cost per finished video shipped, not cost per generated clip. Those numbers point to a different conclusion than the free-model narrative suggests.
For most teams, the path to AI video being a profit center looks like this: own the brief-to-finished pipeline in one tool, eliminate the five-tool stitching tax, measure output per week per operator, and let the model layer commoditize underneath. The cost of the model is going to zero. The cost of the agent layer is what determines unit economics.
What tomorrow won't fix: Even if Veo 4 is free at launch, your AI video budget probably grows next quarter. The line item that's expanding isn't model usage. It's everything around it.
The Larger Point
Tomorrow's keynote will be a great show. Native 4K is coming. Multi-scene narratives are coming. Faster generation is coming. We'll integrate every meaningful improvement Google ships, because better models genuinely make every video on Genra a little better.
But the five questions above don't get answered by a better model. They get answered by a better agent, a maturing legal framework, and an industry that stops mistaking demos for production.
Watch the keynote tomorrow. Then come back and ask whether anything in it actually moved the needle on consistency, on clip-to-finished, on copyright, on distribution, or on real unit economics. Our prediction: a little on the first, almost nothing on the rest.
The model layer is the headline. The agent layer is the work.
Key Takeaways
- Google I/O 2026 will be dominated by Veo 4 predictions and announcements. The model is one layer in a much taller stack.
- Cross-clip consistency is mostly an orchestration problem, not a model problem. Native ID-embedding helps; it doesn't close the gap for someone shipping 40 clips a month.
- A clip is not a finished video. Scripting, voiceover, B-roll, captions, platform cuts, and distribution are all separate problems the model doesn't touch.
- The MiniMax copyright hearing on May 29 will shape AI video regulation more than any I/O announcement. Operators should be logging provenance now, not later.
- Distribution fragmentation across YouTube, TikTok, Instagram, ads, and email is its own production tax. The agent layer either owns it or the user does.
- Free models collapse the cheapest line item in AI video production. Real unit economics are determined by everything around the model — the agent layer.
- Genra runs on Veo and Seedance and handles the full pipeline as one agent. Tomorrow's model improvements will fold into the backend silently. The five real questions stay where they were.
Frequently Asked Questions
What is the agent layer in AI video?
The agent layer is the system that turns a brief into a finished, distributable video. It handles scripting, scene planning, model selection, generation, consistency, voiceover, editing, captioning, and platform-specific output. The model layer generates clips. The agent layer ships videos.
Will Veo 4 solve AI video consistency?
Partially. If Veo 4 ships native ID-embedding as expected, single-shot consistency improves. Multi-clip, multi-shoot, brand-stable consistency across an ongoing content pipeline still requires orchestration — reference management, regeneration logic, seed locking, version control. The model helps. The agent does the work.
What is the MiniMax copyright case and why does it matter?
The MiniMax case is the first major AI video copyright matter to reach a substantive hearing, scheduled for May 29, 2026. The ruling will influence how training data, model output liability, and substantial similarity are interpreted across the industry. Outcome shapes regulation for Western and Asian providers alike.
If Veo 3.1 is free, why isn't AI video free to produce?
Because the model was never the expensive part. The expensive part is the labor around the model — prompt iteration, manual stitching, consistency QA, platform cutting, stakeholder loops. Free models collapse the cheapest line item. Real production cost lives in the agent layer.
What models does Genra use?
Veo and Seedance. The agent chooses which model to use for each shot based on the requirements. Users describe what they want; the agent handles model selection and the rest of the pipeline.
When is Google I/O 2026?
May 19-20, 2026. The opening keynote starts at 1:00 PM ET / 10:00 AM PT on May 19, livestreamed free at io.google. Veo and Gemini announcements typically land in the first 90 minutes.
How should brands prepare for AI video copyright uncertainty?
Log provenance for every video: which models generated each clip, what reference materials were uploaded, what consent or licensing exists for those references. Treat the audit trail as a deliverable, not an afterthought. The legal floor will keep moving for the next two years.
Why does platform distribution still take so much manual work?
Because each platform has different aspect ratios, duration caps, caption formats, hook patterns, and algorithmic preferences. A single rendered output rarely performs well across surfaces. Either the agent generates platform-native variants from the same brief, or someone manually re-cuts.
About the Author
Chris Sherman covers AI video technology, agent architectures, and the business of creative production. Follow @GenraAI for live coverage of Google I/O 2026 (May 19–20) and the MiniMax hearing (May 29).