What Is Reinforcement Learning? How AI Learns From Rewards

Agents, environments, rewards: teaching AI through trial and error

Ad placeholder (leaderboard)

What reinforcement learning is

Reinforcement learning (RL) teaches an AI to make a sequence of decisions by trial and error. Instead of being shown the correct answer for each input, the system — called an agent — takes actions, observes what happens, and receives a reward signal telling it how good the outcome was. Over many attempts the agent learns a strategy that earns the most reward. It is how AI masters games, controls robots, and tunes complex systems where the right move depends on the situation and its long-term consequences.

The core framework

Every RL problem is described with the same handful of pieces:

  • Agent — the decision-maker being trained.
  • Environment — the world the agent acts in, which responds to its actions.
  • State — a snapshot of the situation the agent currently observes.
  • Action — a choice the agent makes from the options available.
  • Reward — a number signalling how good the immediate result was.

The loop runs continuously: the agent sees a state, takes an action, the environment returns a new state and a reward, and the cycle repeats until the episode ends.

Policies and value functions

Two concepts capture what the agent learns:

  • Policy — the agent’s strategy, mapping each state to the action it should take. Improving the policy is the ultimate goal of training.
  • Value function — an estimate of how much total future reward the agent can expect from a given state (or state-action pair). Values let the agent prefer actions that pay off later, not just immediately.

Crucially, RL optimises the cumulative reward over time, so an agent will accept a small short-term cost if it leads to a much larger payoff down the line.

Exploration vs exploitation

A central challenge is the exploration-exploitation tradeoff. The agent can exploit — repeat the action it currently believes is best — or explore — try something new that might turn out better. Pure exploitation gets stuck in mediocre habits; pure exploration never settles. Good algorithms balance the two, often exploring a lot early on and exploiting more as confidence grows.

Real-world examples

  • AlphaGo learned to play Go at superhuman level, partly by playing millions of games against itself and being rewarded for wins.
  • Robotic control uses RL so robots learn to walk, grasp objects, or balance by being rewarded for stable, successful movements.
  • Recommendation and operations systems use RL to optimise long-term outcomes like user retention or data-centre energy use.
  • RLHF (reinforcement learning from human feedback) fine-tunes large language models by rewarding responses humans prefer.

Reinforcement learning is powerful but tricky: rewards can be sparse or delayed, and training can take enormous amounts of trial and error. When it works, though, it produces agents that discover strategies no human explicitly programmed.

Ad placeholder (rectangle)