Letting Go of Labeled Data for LLMs: We No Longer Need Data for Post-Training LLMs, We Need Environment

July 18, 2024 AI Research 5 min read

"师傅领进门，修行在个人。"
The Master Leads You to the Door, the Rest is Up to You

The last few months, I've been working extensively with LLMs in game and reinforcement learning environments. This is a casual blog recording my current predictions about what's next in the AI landscape.

Historically, reinforcement learning has thrived when there's a clear goal, such as DeepMind's AlphaGo conquering the game of Go. The objectives were straightforward: win the game, maximize the reward. But the current generation of LLMs, like Grok4, has hit a data wall, exhausting much of the available pretraining data. The next big leap isn't going to come from squeezing more juice out of existing datasets; it will emerge from LLMs learning to explore entirely on their own.

Think about it this way. When you first start learning something, imitation works wonders. You copy your teacher, classmates, or heroes, gaining baseline competence. But eventually, to truly excel, you have to find your own path, exploring and learning directly from your own experiences, your own "on-policy" trajectories.

Similarly, I believe the next major breakthrough for LLMs will come from self-directed exploration. Initially, we'll see this happen in structured game environments with clear, measurable objectives. These environments offer well-defined feedback, making it easier for models to refine their strategies. After mastering these structured games, the real excitement begins with LLMs stepping into complex, simulated worlds without explicit goals, guided solely by their internal signals.

One significant hurdle today is long-term planning. Currently, LLMs excel at tasks with immediate rewards but struggle with scenarios requiring patience or strategic foresight. However, I'm confident we'll soon see significant advancements. Just as humans learn to plan ahead through trial and error, sometimes painfully, LLMs will develop similar foresight through improved internal reinforcement mechanisms.

Speaking of pain, humans have a built-in learning mechanism, pain signals that instinctively teach us what to avoid or pursue. Interestingly, LLMs already experience a rudimentary form of "pain" through training losses and error signals. The next evolution involves refining these internal signals, allowing models to autonomously set their own goals, priorities, and strategies without external instruction.

In the next year, we'll witness significant breakthroughs first in text-based game environments, quickly generalizing to realistic simulations. This progression will steadily blur the line between virtual and real-world capabilities.

The future of LLMs isn't just about more data; it's about smarter, self-driven exploration. Welcome to the era of truly self-evolving Large Language Models.

Kevin's Blog

Categories

Recent Posts

Letting Go of Labeled Data for LLMs: We No Longer Need Data for Post-Training LLMs, We Need Environment