Intelligent Technology Partners

How a big shift in training LLMs led to a capability explosion

Johnathan K. LeeOllie Eastman
2025-07-01 12:00:00
3 min read

Reinforcement learning, explained with a minimum of math and jargon.

In April 2023, a few weeks after the launch of GPT-4, the Internet went wild for two new software projects with the audacious names BabyAGI and AutoGPT.

'Over the past week, developers around the world have begun building 'autonomous agents' that work with large language models (LLMs) such as OpenAI's GPT-4 to solve complex problems,' Mark Sullivan wrote for Fast Company. 'Autonomous agents can already perform tasks as varied as conducting web research, writing code, and creating to-do lists.'

BabyAGI and AutoGPT repeatedly prompted GPT-4 in an effort to elicit agent-like behavior. The first prompt would give GPT-4 a goal (like 'create a 7-day meal plan for me') and ask it to come up with a to-do list (it might generate items like “Research healthy meal plans, 'plan meals for the week,' and 'write the recipes for each dinner in diet.txt').

Then these frameworks would have GPT-4 tackle one step at a time. Their creators hoped that invoking GPT-4 in a loop like this would enable it to tackle projects that required many steps.

But after an initial wave of hype, it became clear that GPT-4 wasn't up to the task. Most of the time, GPT-4 could come up with a reasonable list of tasks. And sometimes it was able to complete a few individual tasks. But the model struggled to stay focused.

Sometimes GPT-4 would make a small early mistake, fail to correct it, and then get more and more confused as it went along. One early review complained that BabyAGI ´couldn't seem to follow through on its list of tasks and kept changing task number one instead of moving on to task number two.´

By the end of 2023, most people had abandoned AutoGPT and BabyAGI. It seemed that LLMs were not yet capable of reliable multi-step reasoning.

But that soon changed. In the second half of 2024, people started to create AI-powered systems that could consistently complete complex, multi-step assignments:

Vibe coding tools like Bolt.new, Lovable, and Replit allow someone with little to no programming experience to create a full-featured app with a single prompt.
Agentic coding tools like Cursor, Claude Code, Jules, and Codex help experienced programmers complete non-trivial programming tasks.
Computer-use tools from Anthropic, OpenAI, and Manus perform tasks on a desktop computer using a virtual keyboard and mouse.
Deep research tools from Google, OpenAI, and Perplexity can research a topic for five to 10 minutes and then generate an in-depth report.

According to Eric Simons, the CEO of the company that made Bolt.new, better models were crucial to its success. In a December podcast interview, Simons said his company, StackBlitz, tried to build a product like Bolt.new in early 2024. However, AI models ´just weren't good enough to actually do the code generation where the code was accurate.´

A new generation of models changed that in mid-2024. StackBlitz developers tested them and said, ´Oh my God, like, OK, we can build a product around this,´ Simons said.