← Back to Library

Google's Embedding 2 Is RAG on Steroids (But Everyone is Getting it Wrong)

A new model from Google has quietly upended the entire AI landscape. Embedding 2 is the first natively multimodal embedding model from Google, and it's doing something that most people aren't even close to understanding: letting you embed videos directly into a vector database.

This isn't a small tweak. For years, anyone wanting to analyze video data through RAG systems had to use hacky workarounds—transcribing videos, describing them with text, then pretending they were normal documents. Now the video itself becomes searchable. The business implications are massive. Organizations with proprietary video archives suddenly have a path to actually interrogate that content at scale.

But here's what's causing the confusion: most people assume embedding a video means you can simply ask questions about it and get intelligent answers. That's not how it works. And that's why this article matters.

The Naive RAG System Problem

The standard expectation looks something like this: you find a great video, embed it using Embedding 2, then ask your AI system specific questions about the content. You imagine the LLM will analyze what's inside the video and give you a detailed response.

That's not what happens.

When you take a naive RAG system—one that simply connects Embedding 2 to an existing pipeline—and feed it a video about Playwright browser automation, then ask "How does Playright test without opening browsers?", the system doesn't respond with a text explanation. It hands you back a two-minute clip of the original video and says, "The answer is in there." That's what most tutorials are showing people to do. And while there's some value in that, it's not where real utility lives.

This is the critical misunderstanding: an embedding model takes data and turns it into a vector—a point in mathematical space. It handles similarity and search. But it's not responsible for generating explanations or analysis. It's like the difference between recognizing a face in a crowd and being able to describe who that person is, what they do, and why they matter.

How RAG Actually Works

To understand where this goes wrong, you need to understand how retrieval-augmented generation flows.

On one side of the equation: document ingestion. A text document about World War II battleships goes through an embedding model, which turns it into 1,526 numbers—a vector in space. The model places that document near other vectors with similar semantic meaning (ships, boats) rather than unrelated concepts (bananas, green apples).

On the other side: retrieval. A user's question becomes its own vector. The LLM searches the database for vectors closest to the question's vector, retrieves those documents, and ingests their content to augment its answer.

This works perfectly with text because when a document is retrieved, it's paired with actual text—the original PDF or Word file—that the system can read and respond to.

Now apply that same logic to video. You embed a video about World War II battleships. It becomes a vector in space. A user asks a question. The system retrieves that video vector. And what comes back? An MP4 file. Not analysis. Not explanation. Just the raw video data.

The LLM can't ingest what's inside the video because it's not text—it's just a video file. Unless you're using something like Gemini specifically designed to parse video, you have no real answer mechanism.

The Actual Solution

This is solvable. But it requires rethinking where the work happens.

You don't want to augment on the retrieval side—where every question triggers expensive video analysis by the LLM. That's slow and costly. Instead, you want to augment during ingestion: when you first put that video into the system, you also run it through Gemini (or a comparable model) to generate a text description and transcript.

The pipeline becomes: video goes into Embedding 2, but simultaneously gets processed by a language model to produce accompanying text. When a question retrieves that video vector, it's paired not just with an MP4 file, but with written explanation—text the AI can actually ingest and use to generate answers.

This is what makes a multimodal RAG system actually work. The architecture requires adding Gemini (or similar) on the ingestion side of the pipeline, creating a workflow where video content gets augmented once, at entry, rather than on every query.

Counterpoints Worth Considering

Critics might note that this approach still has limitations. Embedding 2 only handles videos up to 120 minutes and text up to 8,192 tokens. While workarounds exist for longer content, the model wasn't designed with heavy video workloads in mind—it's an embedding model first, not an analysis engine.

Additionally, some argue that using Gemini during ingestion adds latency and cost to the pipeline. Others suggest that pure multimodal models like Gemini itself might eventually render this intermediate step unnecessary by handling both embedding and explanation natively. The architecture described here may be a transitional solution.

Embedding models take data and turn it into vectors—they handle similarity and search, not explanation or analysis.

Bottom Line

Chase H's core argument is compelling: the ability to embed video directly into vector databases represents a genuine breakthrough in RAG architecture. The biggest misunderstanding isn't whether embedding works—it's that most people assume embedding automatically enables analysis. It doesn't. You need an augmentation pipeline during ingestion, not retrieval. The strongest part of this piece is the clear explanation of why naive implementations fail and what the actual solution looks like. Its vulnerability is that this architecture may become obsolete once multimodal models like Gemini advance further—the intermediate step might eventually disappear entirely.

The AI rag landscape completely changed a few days ago with the release of Google's brand new embedding 2 model. Now, this is huge because this finally allows us to do what we've all been trying to do for a few years now, which is directly embed videos and images into our vector databases. But here's what nobody is telling you. Being able to embed a video into a vector database is not the same as being able to analyze a video inside a vector database.

confused? Well, that's not surprising. Most people are, which is why none of these videos talking about embedding 2 get this right. It is not as simple as just hooking up embedding 2 to an existing rag structure and being able to say, "Oh, now I can ask questions about this video." That is not the case.

There are actually additional steps required to create a rag architecture that can get the most out of these embedded videos. And today, I'm going to show you exactly how to do that. On top of that, we are going to have a nuanced but thorough discussion about rag and embedding because this is a confusing topic and you do need to understand what's actually happening under the hood here to figure out this whole architecture mess. Lastly, I'll be giving you a GitHub repo with the proper rag architecture so you can just clone it, copy it, do whatever you want and essentially have a 90% solution ready to go so you can navigate this minefield right off the bat.

All right, so let's open with the good stuff. So this is the GitHub repo. I was talking about. Inside of here, you will find essentially the basic rag architecture that I'm going to be talking about in depth from now on.

You have two options. You can clone this thing and then point cloud code at it, or I have a markdown file here called claude code blueprint, which you could copy and paste in clawed code, and it will tell it exactly how it should set itself up. So, you can just take this and go forth and conquer, but I highly highly suggest you watch the rest of this so you actually understand the thought process behind it. Now, speaking of Claude Code, quick plug.

I just dropped my Cloud Code master class ...