The Two Best AI Models/Enemies Just Got Released Simultaneously
The two large language models that will dominate discussions about AI in the coming months just got released within 26 minutes of each other. That presented me with almost 250 pages of report cards to read. Not AI summarize by the way, read, and hundreds of tests to run. Here we are then, less than 24 hours later, and this video will have dozens of highlights that might be missed just by reading the headlines.
Some details, by the way, that even directly refute the company written headlines. So, this isn't mainly about OpenAI versus Anthropic or even Samman versus Dario Amade, the two respective CEOs. It's about your productivity, your job, and the development of the most interesting technology of my lifetime in my opinion. Stop bloody stalling, Phillip.
Give me the first interesting detail you might be thinking. So, okay, we're going to turn to page 13 of the 212 page system card from Anthropic about their new model, Claude Opus 4.6. I normally start with the benchmarks, but I find this more interesting because Anthropic wanted to know if Opus could automate its own self-improvement. Could it replace an entry-level remoteonly research or engineering role at Anthropic itself?
The headline result is that no, none of the 16 workers at Anthropic believed it could automate their research. That would put some cap to the hype because we're talking about an entry-level job, albeit at a very competitive company. But then it's only on page 185 of the same report that we learn that three of the respondents at Anthropic said actually it was likely possible within 3 months. With sufficient scaffolding, an entry-level researcher could be automated.
Two even said that such replacement was already possible. Why the discrepancy? Because those five respondents were reached out to directly by anthropic to clarify their views. Some of them were apparently talking about a different threshold while others had more pessimistic views upon reflection.
Why is anthropic relying on surveys? Well, because Opus 4.6 ICS seems to be acing many of their technical benchmarks for AI research. But the obvious follow-on question from me would be in a company of thousands of employees, why only rely on 16 respondents. Okay, so the new Claude can't automate its own self-improvement just yet, but what about on a more practical level?
Now that they're releasing, for example, Claude in PowerPoint and ...
Watch the full video by AI Explained on YouTube.