← Back to Library

Google unveils Gemini

Casey Newton delivers a rare, unvarnished look inside the engine room of artificial intelligence, moving beyond the usual hype to dissect the specific mechanics of Google's new Gemini model. The piece's most striking revelation isn't just that the technology works, but that it represents a fundamental shift in how machines perceive the world—moving from processing text to understanding images, audio, and video simultaneously. For the busy professional, this isn't just a product update; it is a signal that the digital tools we rely on are about to become active partners rather than passive search engines.

The Multimodal Leap

Newton frames the launch not as a simple upgrade, but as a structural transformation in AI capability. The author notes that Gemini is "natively multimodal," meaning it can analyze the contents of a picture and answer questions about it, or create an image out of a text prompt. This distinction is crucial. While previous models struggled to connect the dots between a graph in a PDF and the text describing it, Gemini can reportedly identify a mistake in a student's math homework just by looking at a photo of the page. Newton writes, "The ability of AI systems to take different kinds of data (text, images, video, audio), analyze them using a single tool, and translate them in and out of various formats is the kind of foundational innovation that makes lots of other progress possible."

Google unveils Gemini

This framing is effective because it grounds the abstract concept of "intelligence" in a tangible, practical application: error correction and visual reasoning. However, one must be cautious about the benchmarks. The article highlights that the model "exceeds current state-of-the-art results on 30 of the 32 widely-used academic benchmarks," yet Newton acknowledges that competitive coding is "meaningfully different from day-to-day software engineering." Critics might note that excelling in controlled academic tests does not always translate to reliability in the messy, unstructured reality of enterprise workflows. Still, the trajectory is undeniable.

"Multimodal" can read like awkward jargon, but the term comes up constantly in conversation with Google executives.

The Race for the Crown

The commentary then pivots to the competitive landscape, where Newton carefully positions Gemini against its rivals without succumbing to tribalism. The author suggests that the "Pro" version is best seen as the company's answer to OpenAI's GPT-3.5, while the upcoming "Ultra" version is positioned as the top rival to GPT-4.5 Turbo. Newton observes that the Ultra model "won't be available next year so that Google can complete trust and safety testing," a move that signals a deliberate, albeit risky, attempt to prioritize stability over speed in a market that often rewards the former.

In an exclusive interview, Google CEO Sundar Pichai and DeepMind CEO Demis Hassabis discuss the future of this technology. When asked if the model will have capabilities beyond previous iterations, Hassabis states, "We're kind of in beta — to safety-check it, responsibility-check it, but also to see how else it can be fine-tuned." This admission of a "beta" status for a flagship product is significant. It suggests that even the architects of the system view the current release as a foundation rather than a finished product. Newton captures this tension well, noting that while the pace is "striking," the executives are aware of the need for "foresight... about the unintended consequences."

The Human Element and the Future of Work

Perhaps the most provocative section of the piece addresses the impact on the workforce. Newton asks Pichai directly if the company will hire fewer engineers given the model's ability to outperform 85 percent of humans in coding competitions. Pichai's response reframes the narrative from displacement to augmentation: "I really think of this as making programmers far more productive, and over time taking some of the grunt work out of the job." He argues that the bar will change, and "more people will be able to become programmers."

This is a compelling, if optimistic, take. Newton presents this as a shift in the "value proposition" of search and software development, where the technology becomes an "assistant for the user to help you with what you're doing." However, the counterargument is implicit in the speed of the rollout. If the technology can indeed parse scientific papers, interpret graphs, and write code at an expert level, the transition period for the workforce could be more turbulent than the "productivity boost" narrative suggests. As Hassabis puts it, "We're really close now to actual, practical, useful things in the real world, beyond just the games and stuff we used to do so well." The implication is that the era of AI as a novelty is over; the era of AI as infrastructure has begun.

"If I take a step back and say that the breakthroughs here may help us make progress against cancer more easily, I want it to move as fast as possible. Why wouldn't you?"

Bottom Line

Newton's coverage succeeds by stripping away the marketing veneer to reveal the genuine technical leap of Gemini, particularly its native ability to understand the world through multiple senses. The strongest part of the argument is the emphasis on "multimodality" as the key differentiator, not just raw processing power. The biggest vulnerability remains the uncertainty of how these capabilities will be governed once they permeate search, advertising, and enterprise tools. Readers should watch closely how the administration of these models evolves, as the gap between "safety-checking" and market demand will define the next year of the industry.

Sources

Google unveils Gemini

by Casey Newton · Platformer · Read full article

I.

Google this morning announced the rollout of Gemini, its largest and most capable large language model to date. Starting today, the company’s Bard chatbot will be powered by a version of Gemini, and will be available in English in more than 170 countries and territories. Developers and enterprise customers will get access to Gemini via API next week, with a more advanced version set to become available next year.

How good is Gemini? Google says the performance of its most capable model “exceeds current state-of-the-art results on 30 of the 32 widely-used academic benchmarks used in LLM research and development.” Gemini also scored 90.0% on a test known as “Massive Multitask Language Understanding,” or MMLU, which assesses capabilities across 57 subjects including math, physics, history and medicine. It is the first LLM to perform better than human experts on the test, Google said.

Gemini also appears to be a very good software engineer. Last year, using an older language model, DeepMind introduced an AI system named AlphaCode that outperformed 54 percent of human coders in coding competitions. Using Gemini, Google built a next-generation version named AlphaCode 2. The sequel outperformed an estimated 85 percent of humans, the company said.

Competitive coding is meaningfully different from day-to-day software engineering in some important ways: it can be both more and less difficult than what the typical engineer is asked to do. But still, the rate of progress here is striking.

Gemini is natively multimodal, meaning that it can analyze the contents of a picture and answer questions about it, or create an image out of a text prompt. During a briefing on Tuesday, a Google executive uploaded a photo of some math homework in which the student had shown their calculations leading up to the final answer. Gemini was able to identify at which step in the student’s process they had gone awry, and explained their mistake and how to answer the question correctly.

“Multimodal” can read like awkward jargon, but the term comes up constantly in conversation with Google executives. The ability of AI systems to take different kinds of data (text, images, video, audio), analyze them using a single tool, and translate them in and out of various formats is the kind of foundational innovation that makes lots of other progress possible. (All of which is a long way of saying: sorry for the number of times the word ...