Mark your calendar. Today is the birth of the AI Video Director, a Full Self-Driving (FSD) experience for video production.
It’s not a new button. It’s not a new template. It is not “yet another generator.”
It is an end-to-end system that takes a destination you describe, including purpose, audience, tone, and constraints, and handles the heavy lifting of production. It plans the visual flow, proposes shots, maintains continuity, and assembles a coherent first cut you can judge, override, and refine. Crucially, it takes your ingredients (images of characters, products, logos) and blends them seamlessly into the final video.
LLMs are brilliant… and still not a director
Large language models are shockingly intelligent. They can reason, summarize, draft, translate, and improvise. In other words: they are elite generalists.
But a generalist is not the same thing as a specialist.
Directing a video isn’t just “write some text.” It is a pipeline of interlocking decisions:
- Interpretation: What does the user actually mean, and what should the audience see?
- Planning: What scenes exist, in what order, with what pacing, and why?
- Continuity: How do identity, environment, props, and style remain consistent across shots?
- Execution: How do you translate intent into prompts for downstream image/video models?
- Assembly: How do you mix the result into something watchable—rhythm, emphasis, coherence?
A single prompt can’t reliably hold all of that complexity. If you want the “treasure” in specialty domains, including film planning, shot design, script-to-visual translation, and prompt engineering, you don’t just need a smarter model.
You need a director’s system.
The missing layer: An agentic production pipeline
That is what our AI Director is.
Think of foundation models (LLMs, vision models, image/video generators) as the engine. Powerful, but not enough on their own.
What we built is the driver:
- A comprehensive agent pipeline that reasons in stages.
- Produces explicit artifacts (plans, shot lists, prompts, edits).
- Verifies continuity and constraints.
- Keeps the user in control at every step.
This is the difference between “Generate me something” and “Get me to my destination.”
Full Self-Driving for video: You provide the destination, the system drives
The analogy is simple because the user experience is simple. In an Full Self-Driving (FSD) model, the user provides the destination, and the system handles the driving. The user stays in the driver’s seat, supervising, correcting, and overriding.
That’s exactly how our AI Director works. You describe the video you want, and the system:
- Breaks down the intent into a production plan.
- Proposes “casting”: characters, products/logos, and environments.
- Lets you edit and verify these elements before any scene generation starts.
- Generates the shot list and initial imagery for the scenes.
- Animates all scenes (upon your verification).
- Mixes and assembles a first cut.
And crucially: the UI never asks you to “trust the magic.”
It gives you checkpoints where you can judge quality, replace assets, edit the structure, correct visual intent, or re-generate a scene if needed. This is not automation that removes the human. It’s automation that removes the busywork.
Image and video generation models are amazing, until you need consistency
Today’s image models (led by Google’s Nano Banana model) can produce stunning, art-directable frames on demand: photoreal product shots, stylized illustrations, cinematic lighting. The works. And the new wave of video generation models is just as impressive, with Google’s Veo setting the bar: believable motion, camera moves, scene extension, and transitions that feel like real cinematography.
But there’s a catch that anyone building real multi-scene videos runs into fast: each clip is generated separately.
That’s great for creativity, and brutal for continuity.
The same character can subtly (or dramatically) change face, wardrobe, or proportions between clips. Logos morph. Props drift. Locations “reset.” Identity consistency becomes the hard problem, not raw visual quality.
The industry’s answer is visual anchoring: instead of asking a model to remember everything from text alone, you give it anchors it can hold onto, such as ingredient reference images, or explicit first/last frames that guide what must stay stable.
Visla’s AI Video Director takes that idea one level higher. Rather than treating anchoring as a manual trick you apply clip-by-clip, the AI Director orchestrates the entire video:
- Plans story flow and visual flow across scenes.
- Proposes casting and visual anchors (characters, products/logos, environments).
- Supports user-injected casting: characters, product images, and logos.
- Generates and manages the right anchor images for the right moments.
- Uses those anchors to drive downstream image/video models so identity and brand stay coherent.
In other words: modern generation models are powerful engines, but long-form, consistent video requires a driver that plans, anchors, and keeps the whole production on the rails.
What this unlocks
If you’ve ever produced a video the old way, you know the pain: you spend hours getting to “something” before you can even judge if it works. Continuity breaks constantly. You rebuild from scratch for minor edits because everything is coupled.
Our AI Director is built to invert that:
- Get a coherent first cut fast.
- Make changes locally without redoing everything.
- Keep identity and visual intent stable.
In practice, it means non-creative professionals can produce with creative leverage, without becoming experts in production tooling.
Today, we mark the beginning
I don’t say this lightly: with this release, we’re not just improving workflow. We’re making a large class of template-based, rigid animation and manual editing workflows start to feel… obsolete.
That’s the uncomfortable part of real platform shifts: years of engineering excellence and hard-won professional craft, mastering timelines, keyframes, presets, render settings, and the thousand tiny tricks of production, begin to fade from being a prerequisite. The craft doesn’t disappear, but it stops being a gate.
The most telling signal is coming from inside our own team.
Our AI engineers and software developers, people who spend their days in models, code, and data, not After Effects, are running the pipeline for testing. But the atmosphere has shifted from debugging to pure creative hype.
In just the last few days, I’ve seen engineers produce cartoons for their children. I’ve seen a complex infographic video explaining the Quantum Delayed Choice Experiment . I’ve seen a cinematic short film that depicts military action, and a spiritual music video. Others simply drop in a URL and watch it transform into a polished product promo video with logos and real products.
When you remove the friction of production, you don’t just get faster work. You unlock latent creativity that was previously trapped behind a learning curve.
Once you experience that, the ability to move from thought to video at the speed of intent, you don’t want to go back to pushing every pixel by hand.
Soon, you’ll be able to take the wheel.
Melinda Xiao-Devins
As the chief AI architect at Visla, Melinda Xiao-Devins has been instrumental in leading the charge towards a new era of video creation. With her team, she’s harnessed the capabilities of the latest LLMs, especially ChatGPT, to transform how videos are created. Melinda’s rich experience includes her role as the senior manager of the NLP team at Zoom, where she innovated and led AI initiatives. While her academic pursuits in physics and computer software engineering at Purdue University laid a strong foundation, it’s her hands-on work in the industry that truly drives her passion: making AI-driven products accessible and empowering every user to visually narrate their unique stories.

