
Reinforcement learning, explained with a minimum of math and jargon.
'Over the past week, developers around the world have begun building 'autonomous agents' that work with large language models (LLMs) such as OpenAI's GPT-4 to solve complex problems,' Mark Sullivan wrote for Fast Company. 'Autonomous agents can already perform tasks as varied as conducting web research, writing code, and creating to-do lists.'
BabyAGI and AutoGPT repeatedly prompted GPT-4 in an effort to elicit agent-like behavior. The first prompt would give GPT-4 a goal (like 'create a 7-day meal plan for me') and ask it to come up with a to-do list (it might generate items like “Research healthy meal plans, 'plan meals for the week,' and 'write the recipes for each dinner in diet.txt').
Then these frameworks would have GPT-4 tackle one step at a time. Their creators hoped that invoking GPT-4 in a loop like this would enable it to tackle projects that required many steps.
But after an initial wave of hype, it became clear that GPT-4 wasn't up to the task. Most of the time, GPT-4 could come up with a reasonable list of tasks. And sometimes it was able to complete a few individual tasks. But the model struggled to stay focused.
Sometimes GPT-4 would make a small early mistake, fail to correct it, and then get more and more confused as it went along. One early review complained that BabyAGI ´couldn't seem to follow through on its list of tasks and kept changing task number one instead of moving on to task number two.´
By the end of 2023, most people had abandoned AutoGPT and BabyAGI. It seemed that LLMs were not yet capable of reliable multi-step reasoning.
But that soon changed. In the second half of 2024, people started to create AI-powered systems that could consistently complete complex, multi-step assignments:
Vibe coding tools like Bolt.new, Lovable, and Replit allow someone with little to no programming experience to create a full-featured app with a single prompt.
Agentic coding tools like Cursor, Claude Code, Jules, and Codex help experienced programmers complete non-trivial programming tasks.
Computer-use tools from Anthropic, OpenAI, and Manus perform tasks on a desktop computer using a virtual keyboard and mouse.
Deep research tools from Google, OpenAI, and Perplexity can research a topic for five to 10 minutes and then generate an in-depth report.
According to Eric Simons, the CEO of the company that made Bolt.new, better models were crucial to its success. In a December podcast interview, Simons said his company, StackBlitz, tried to build a product like Bolt.new in early 2024. However, AI models ´just weren't good enough to actually do the code generation where the code was accurate.´
A new generation of models changed that in mid-2024. StackBlitz developers tested them and said, ´Oh my God, like, OK, we can build a product around this,´ Simons said.