SAIL Media

SAIL Media

Why Video Agent models are next — Ethan He, xAI Grok Imagine

Inside xAI: Building Grok Imagine in 3 Months, Videogen vs World Models, and why Grok Imagine is so underrated. For the first time, we do a deep dive with the guy who led it!

Latent.Space
Jun 04, 2026
∙ Paid
This post originally appeared in Latent Space.

“At a certain point, coding models got so good that the only significant next step to improve performance was handling the orchestration of these models.”

We’re announcing AIEWF speakers this week! Take the AI Engineering Survey!


Today’s guest Ethan first joined us for the LS Paper Club as the lead on NVIDIA Cosmos World Model, but then joined xAI and built Grok Imagine in 3 months:

X avatar for @EthanHe_42
Ethan He@EthanHe_42
Thrilled to share our new Grok Imagine release 🚀 It is the highest quality, fastest, and most cost-effective video generation model yet. Comes with 720P, video editing and better audio! We listened closely to your feedback and moved fast. Just six months ago, we had almost
X avatar for @xai
xAI @xai
Understanding requires imagining. Grok Imagine lets you bring what’s in your brain to life, and now it’s available via the world’s fastest, and most powerful video API: https://t.co/tqQwQVgCEI Try it out and let your Imagination run wild.
5:43 AM · Jan 29, 2026 · 116K Views

127 Replies · 107 Reposts · 1.35K Likes

He comes back on Latent Space with some nuclear hot takes: that Video Models primarily get their intelligence from LLMs, not from training on video data, and that the next frontier for truly interactive, realtime, long-horizon world models is to work on LLMs (perhaps Interaction Models as well…)

Put it this way: In the near term, the next Sora won’t be a better video model, but a video agent.

Generative Media may more closely follow the evolution of AI coding which went from focusing on one-shot output performance and cost, to multiturn reasoning and planning models for agents and systems that can plan, edit, test, debug, and submit PRs.

At a certain point, coding models got so good that the only significant next step to improve performance was handling the orchestration of these models.

Now as the performance of video models increases significantly across realism, consistency, & prompt adherence while becoming more cost efficient, the next evolution of video generation may also be systems that can plan, generate, edit, critique, and iterate across an entire creative task.

X avatar for @XFreeze
X Freeze@XFreeze
Grok Imagine Agent Mode (Beta) just went live on Grok web It’s a full creative agent working on one infinite open canvas Grok Agent plans → generates → edits → iterates everything automatically in the same workspace Tell it what you want and watch it plan, generate, edit,
5:42 AM · Apr 30, 2026 · 920K Views

681 Replies · 1.15K Reposts · 3.97K Likes

In this episode, Ethan joins swyx and Vibhu to unpack what it actually takes to build frontier image and video systems: data, VAEs, diffusion transformers, audio-video alignment, inference speedups, and the hidden cost of storing and moving massive video datasets. From building NVIDIA’s Cosmos world model to joining xAI as Grok Imagine was being built from zero to one, Ethan He has been at the center of some of the most important work in video generation, multimodal models, and real-time world models.

Keep reading with a 7-day free trial

Subscribe to SAIL Media to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2026 SAIL media, LLC · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture