Chase argues that Cling 3.0 represents a genuine leap forward in AI video generation — specifically because it handles multi-shot video, text rendering, and emotional expression better than anything else on the market. The model excels at cutting between multiple shots within a single generation, giving creators unprecedented control over how scenes unfold.
What's Actually Different About Cling 3.0
The core innovation here is what Chase calls "multi-shots." Rather than generating one continuous shot, users can now program three distinct cuts into a single generation. Each shot can be adjusted independently by dragging to change its duration — with a maximum of 15 seconds per shot.
This matters because previous models forced creators to generate clips separately and stitch them together afterward. Multi-shots eliminate that extra step entirely.
The second major addition is something called Elements. Think of it as reference images, but for video generation. Users can upload 360-degree views of characters — side angles, front-facing poses, back views — giving the AI a complete picture of how subjects appear from different perspectives. This dramatically improves consistency across multi-shot sequences.
Chase demonstrates by creating an Element: a woman with brown hair described in simple terms. Once uploaded, this Element can be referenced in prompts using "@" syntax or through the Elements menu.
The Six Things Every Prompt Needs
The real value Chase provides is a prompting framework he developed for Cling 3.0 users. The model responds best when prompts include exactly six components: camera, scene, subject, action, audio, and style.
This matters because AI video generation defaults to average quality when given vague instructions. Using precise terminology — like "low angle tracking shot using a 24mm anamorphic lens with slow dolly pushin" — produces dramatically better results than casual language descriptions.
The vocabulary matters enormously. These terms are what the model was trained on, and they function like a film director's nomenclature.
Chase recommends that users sign up for Shotdeck.com, a free database of cinematic scenes from major films. Users can search specific movies — he demonstrates with Dune 2 — and extract technical details: shot type, lens size, composition, lighting, camera movement, and film stock. These technical specifics can then be fed directly into Cling 3.0 prompts.
Where the Model Still Struggles
Two significant limitations deserve attention when using Cling 3.0.
First, Elements technology is still maturing. Overloading prompts with too many Elements alongside multiple shot changes sometimes causes the model to ignore hard cut instructions — collapsing separate shots into one long clip and producing unexpected audio artifacts.
Second, generation speed remains slower than competing models like VO 3.1 Fast. Creators producing longer videos requiring iterative refinement should factor this limitation into their workflows.
"If we don't explicitly tell it these things, then it's just going to default to the mean, which is going to give you a mediocre output."
Chase notes that the model's true strength emerges when given minimal constraints — allowing natural generation without starting images or heavy Element references. The resulting videos demonstrate remarkable emotional depth and facial expression quality that competitors haven't matched.
Bottom Line
Chase's core argument holds: Cling 3.0 genuinely represents the current peak of AI video generation, particularly for creators who want cinematic control through multi-shot sequences and precise prompting. His six-element framework provides a practical methodology for achieving those results — and Shotdeck offers a legitimate learning resource for building that vocabulary.
The vulnerability is practical rather than theoretical: the model remains expensive to run at scale, slower than alternatives, and Element-based workflows still require experimentation. Users should start small with prompts before adding complexity.