Deep Reinforcement Studying: 0 to 100

the way you’d train a robotic to land a drone with out programming each single transfer? That’s precisely what I got down to discover. I spent weeks constructing a sport the place a digital drone has to determine the right way to land on a platform—not by following pre-programmed directions, however by studying from trial and error, similar to the way you discovered to trip a motorcycle.

That is Reinforcement Studying (RL), and it’s essentially totally different from different machine studying approaches. As a substitute of displaying the AI 1000’s of examples of “right” landings, you give it suggestions: “Hey, that was fairly good, however possibly strive being extra light subsequent time?” or “Yikes, you crashed—most likely don’t do this once more.” Via numerous makes an attempt, the AI figures out what works and what doesn’t.

On this submit, I’m documenting my journey from RL fundamentals to constructing a working system that (largely!) teaches a drone to land. You’ll see the successes, the failures, and all of the bizarre behaviors I needed to debug alongside the way in which.

1. Reinforcement studying: Overview

Lots of the thought might be associated to Pavlov’s canine and Skinner’s rat experiments. The thought is that you just give the topic a ‘reward‘ when it does one thing you need it to do (constructive reinforcement) and a ‘penalty‘ when it does one thing dangerous (adverse reinforcement). Via many repeated makes an attempt, your topic learns from this suggestions, regularly discovering which actions result in success—much like how Skinner’s rat discovered which lever presses produced meals rewards.

Pavlov's classical conditioning experiment — Fig 1. Pavlov’s classical conditioning experiment (AI-generated picture by Google’s Gemini)

In the identical vogue, we would like a system that may study to do issues (or duties) such that it will possibly maximize the reward and decrease the penalty. Word this reality about maximizing reward, which is able to are available later.

1.1 Core Ideas

When speaking about techniques that may be applied programmatically on computer systems, the most effective observe is to put in writing clear definitions for concepts that may be abstracted. Within the research of AI (and extra particularly, Reinforcement studying), the core concepts might be boiled all the way down to the next:

Agent (or Actor): That is our topic from the earlier part. This may be the canine, a robotic attempting to navigate an enormous manufacturing unit, a online game NPC, and so on.
Atmosphere (or the world): This is usually a place, a simulation with restrictions, a online game’s digital sport world, and so on. I consider this like, “A field, actual or digital, the place the agent’s whole life is confined to; it solely is aware of of what occurs throughout the field. We, because the overlords, can alter this field, whereas the agent will assume that god is exacting his will on his world.”
Coverage: Identical to in governments, firms, and lots of extra comparable entities, ‘insurance policies’ dictate “What actions needs to be taken when given a sure state of affairs”.
State: That is what the agent “sees” or “is aware of” about its present state of affairs. Consider it because the agent’s snapshot of actuality at any given second—like the way you see the visitors mild coloration, your velocity, and the gap to the intersection when driving.
Motion: Now that our agent can ‘see’ issues in its setting, it could need to do one thing about its state. Perhaps it simply awakened from a protracted night time’s slumber, and now it desires to get a cup of espresso. On this case, the very first thing it should do is get off the bed. That is an motion that the agent will take to attain its aim, i.e., GET SOME COFFEE!
Reward: Each time the actor executes an motion (of its personal volition), one thing might change on the planet. For instance, our agent acquired off the bed and began strolling in the direction of the kitchen, however then, as a result of it’s so dangerous at strolling, it tripped and fell. On this state of affairs, the god (us) rewards it with a punishment for being dangerous at strolling (adverse reward). However then the agent makes it to the kitchen and will get the espresso, so the god (us) rewards it with a cookie (constructive reward).

Fig. 2 Illustration of a theoretical RL system

As you possibly can think about, most of those key elements should be tailor-made for the precise job/drawback that we would like the agent to resolve.

2. The Gymnasium

Now that we perceive the fundamentals, you is perhaps questioning: how can we really construct one in every of these techniques? Let me present you the sport I constructed.

For this submit, I’ve written a bespoke online game that anybody can entry and use to coach their very own machine studying agent to play the sport.

The complete code repository might be discovered on GitHub (please star this). I intend to make use of this repository for extra video games and simulation code, together with extra superior methods that I’ll implement in my subsequent installments of posts on RL.

Supply Drone

The supply drone is a sport the place the target is to fly a drone (possible containing deliveries) onto a platform. To win the sport, we’ve to land. To land, we’ve to satisfy the next standards:

Be in touchdown proximity to the platform
Be gradual sufficient
Be upright (Touchdown the other way up is extra like crashing than touchdown)

All data on the right way to run the sport might be discovered within the GitHub repository.

Right here’s what the sport seems like

Sample screenshot of the game — Fig. 3 A screenshot of the sport that I made for this venture

If the drone flies off the display screen or touches the bottom, it is going to be thought of a ‘crash’ case and thus result in a failure.

State description

The drone observes 15 steady values that utterly describe its state of affairs:

Touchdown Success Standards: The drone should concurrently obtain:

Horizontal alignment: inside platform bounds (|dx| < 0.0625)
Protected method velocity: lower than 0.3
Stage orientation: tilt lower than 20° (|angle| < 0.111)
Appropriate altitude: backside of drone touching platform high

It’s like parallel parking—you want the suitable place, proper angle, and shifting slowly sufficient to not crash!

How can somebody design a coverage?

There are numerous methods to design a coverage. It may be Bayesian (sustaining likelihood distributions over beliefs), it may be a easy lookup desk for discrete states, a hand-coded rule system (“if distance < 10, then brake”), a resolution tree, or—as we’ll discover—a neural community that learns the mapping from states to actions by means of gradient descent.

Successfully, we would like one thing that takes within the aforementioned state, performs some computation utilizing this state, and returns what motion needs to be carried out.

Deep Studying to construct a coverage?

So how can we design a coverage that may deal with steady states (like precise drone positions) and study complicated behaviors? That is the place neural networks are available.

In case of neural networks (or in deep studying), it’s usually finest to work with motion chances, i.e., “What motion is probably going the most effective given the present state?”. So, we will outline a neural community that may take within the state as a ‘vector’ or ‘assortment of vectors’ as enter. This vector or assortment of vectors must be constructed from the noticed state. For our supply drone sport, the state vector is:

State vector (from our 2D drone sport)

The drone observes its absolute place, velocities, orientation, gasoline, platform place, and derived metrics. Our steady state is:

The place every part represents:

All elements are normalized to roughly [0,1] or [-1,1] ranges for steady neural community coaching.

Motion area (three impartial binary thrusters)

As a substitute of discrete motion mixtures, we deal with every thruster independently:

Predominant thruster (upward thrust)
Left thruster (clockwise rotation)
Proper thruster (counter-clockwise rotation)

Every motion is sampled from a Bernoulli distribution, giving us 3 impartial binary choices per timestep.

Neural-network coverage (probabilistic with Bernoulli sampling)

Let f_θ(s) be the community outputs after sigmoid activation. The coverage makes use of impartial Bernoulli distributions:

Minimal Python sketch (from our implementation)

# construct state vector from DroneState
s = np.array([
    state.drone_x, state.drone_y,
    state.drone_vx, state.drone_vy,
    state.drone_angle, state.drone_angular_vel,
    state.drone_fuel,
    state.platform_x, state.platform_y,
    state.distance_to_platform,
    state.dx_to_platform, state.dy_to_platform,
    state.speed,
    float(state.landed), float(state.crashed)
])

# community outputs chances for every thruster (after sigmoid)
action_probs = coverage(torch.tensor(s, dtype=torch.float32))  # form: (3,)

# pattern every thruster independently from Bernoulli
dist = Bernoulli(probs=action_probs)
motion = dist.pattern()  # form: (3,), e.g., [1, 0, 1] means predominant+proper thrusters

This exhibits how we map the sport’s bodily observations right into a 15-dimensional normalized state vector and produce impartial binary choices for every thruster.

Code setup (half 1): Imports and sport socket setup

We first need our sport’s socket listener to start out. For this, you possibly can navigate to the delivery_drone listing in my repository and run the next command:

pip set up -r necessities.txt # run this as soon as for establishing the required modules
python socket_server.py --render human --port 5555 --num-games 1 # run this at any time when you could run the sport in socket mode

NOTE: You will have PyTorch to run the code. Please just be sure you have set it up beforehand

import os
import torch
import torch.nn as nn
import math
import numpy as np

from torch.distributions import Bernoulli

# Import the sport's socket shopper
from delivery_drone.sport.socket_client import DroneGameClient, DroneState

# setup the shopper and hook up with the server
shopper = DroneGameClient()
shopper.join()

Methods to design a reward operate?

So what makes an excellent reward operate? That is arguably the toughest a part of RL (and the place I spent a LOT of my debugging time 🫠).

The reward operate is the soul of any RL implementation (and belief me, get this unsuitable and your agent will do the weirdest issues). In principle, it ought to outline what ‘good’ behaviour needs to be learnt and what ‘dangerous’ behaviour shouldn’t be learnt. Every motion taken by our agent is characterised by the full collected reward for every behaviour trait exhibited by the motion. For instance, if you would like the drone to land gently, you may give constructive rewards for being near the platform and shifting slowly, whereas penalizing crashes or working out of gasoline—the agent then learns to maximise the sum of all these rewards over time.

Benefit: A greater approach to measure efficient reward

When coaching our coverage, we don’t simply need to know if an motion rewarded us—we need to know if it was higher than typical. That is the instinct behind the benefit.

The benefit tells us: “Was this motion higher or worse than what we usually count on?”

In our implementation, we:

Accumulate a number of episodes and calculate their returns (complete discounted rewards)
Compute the baseline because the imply return throughout all episodes
Calculate benefit = return – baseline for every timestep
Normalize benefits to have imply=0 and std=1 (for steady coaching)

Why this helps:

Actions with constructive benefit → higher than common → enhance their likelihood
Actions with adverse benefit → worse than common → lower their likelihood
Reduces variance in gradient updates (extra steady studying)

This straightforward baseline already provides us a lot better coaching than uncooked returns! It tries to weigh the complete sequence of actions towards the outcomes (crashed or landed) such that the coverage learns to take actions that result in higher benefit.

After a whole lot of trial and error, I’ve designed the next reward operate. The important thing perception was to situation rewards on each proximity AND vertical place – the drone have to be above the platform to obtain constructive rewards, stopping exploitation methods like hovering beneath the platform.

Quick notice on inversely (and non-linearly) scaling reward

Typically, we need to reward behaviors inversely proportional to sure state values. For instance, distance to the platform ranges from 0 to ~1.41 (normalized by window width). We wish a excessive reward when the gap ≈ 0 and a low reward when far-off. I take advantage of varied scaling capabilities for this:

Plot showing an exponentially decaying function — Fig. 4 Gaussian scalar operate

Examples for different helpful scaling capabilities

Helper capabilities:

def inverse_quadratic(x, decay=20, scaler=10, shifter=0):
    """Reward decreases quadratically with distance"""
    return scaler / (1 + decay * (x - shifter)**2)

def scaled_shifted_negative_sigmoid(x, scaler=10, shift=0, steepness=10):
    """Sigmoid operate scaled and shifted"""
    return scaler / (1 + np.exp(steepness * (x - shift)))

def calc_velocity_alignment(state: DroneState):
    """
    Calculate how effectively the drone's velocity is aligned with optimum course to platform.
    Returns cosine similarity: 1.0 = good alignment, -1.0 = wrong way
    """
    # Optimum course: from drone to platform
    optimal_dx = -state.dx_to_platform
    optimal_dy = -state.dy_to_platform
    optimal_norm = math.sqrt(optimal_dx**2 + optimal_dy**2)

    if optimal_norm < 1e-6:  # Already at platform
        return 1.0

    optimal_dx /= optimal_norm
    optimal_dy /= optimal_norm

    # Present velocity course
    velocity_norm = state.velocity
    if velocity_norm < 1e-6:  # Not shifting
        return 0.0

    velocity_dx = state.drone_vx / velocity_norm
    velocity_dy = state.drone_vy / velocity_norm

    # Cosine similarity
    return velocity_dx * optimal_dx + velocity_dy * optimal_dy

Code for the present reward operate:

def calc_reward(state: DroneState):
    rewards = {}
    total_reward = 0

    # 1. Time penalty - distance-based (penalize extra when far)
    minimum_time_penalty = 0.3
    maximum_time_penalty = 1.0
    rewards['time_penalty'] = -inverse_quadratic(
        state.distance_to_platform,
        decay=50,
        scaler=maximum_time_penalty - minimum_time_penalty
    ) - minimum_time_penalty
    total_reward += rewards['time_penalty']

    # 2. Distance & velocity alignment - ONLY when above platform
    velocity_alignment = calc_velocity_alignment(state)
    dist = state.distance_to_platform

    rewards['distance'] = 0
    rewards['velocity_alignment'] = 0

    # Key situation: drone have to be above platform (dy > 0) to get constructive rewards
    if dist > 0.065 and state.dy_to_platform > 0:
        # Reward motion towards platform when velocity is aligned
        if velocity_alignment > 0:
            rewards['distance'] = state.velocity * scaled_shifted_negative_sigmoid(dist, scaler=4.5)
            rewards['velocity_alignment'] = 0.5

    total_reward += rewards['distance']
    total_reward += rewards['velocity_alignment']

    # 3. Angle penalty - distance-based threshold
    abs_angle = abs(state.drone_angle)
    max_angle = 0.20
    max_permissible_angle = ((max_angle - 0.111) * dist) + 0.111
    extra = abs_angle - max_permissible_angle
    rewards['angle'] = -max(extra, 0)
    total_reward += rewards['angle']

    # 4. Velocity penalty - penalize extreme velocity
    rewards['speed'] = 0
    velocity = state.velocity
    max_speed = 0.4
    if dist < 1:
        rewards['speed'] = -2 * max(velocity - 0.1, 0)
    else:
        rewards['speed'] = -1 * max(velocity - max_speed, 0)
    total_reward += rewards['speed']

    # 5. Vertical place penalty - penalize being beneath platform
    rewards['vertical_position'] = 0
    if state.dy_to_platform > 0:  # Drone is above platform (GOOD)
        rewards['vertical_position'] = 0
    else:  # Drone is beneath platform (BAD!)
        rewards['vertical_position'] = state.dy_to_platform * 4.0  # Adverse penalty
    total_reward += rewards['vertical_position']

    # 6. Terminal rewards
    rewards['terminal'] = 0
    if state.landed:
        rewards['terminal'] = 500.0 + state.drone_fuel * 100.0
    elif state.crashed:
        rewards['terminal'] = -200.0
        # Further penalty for crashing removed from goal
        if state.distance_to_platform > 0.3:
            rewards['terminal'] -= 100.0
    total_reward += rewards['terminal']

    rewards['total'] = total_reward
    return rewards

And sure, these magic numbers like 4.5, 0.065, and 4.0? They got here from rather a lot of trial and error. Welcome to RL, the place hyperparameter tuning is half artwork, half science, and half luck (sure, I do know that’s three halves).

def compute_returns(rewards, gamma=0.99):
    """
    Compute discounted returns (G_t) for every timestep primarily based on the Bellman equation
    
    G_t = r_t + γ*r_{t+1} + γ²*r_{t+2} + ...
    """
    returns = []
    G = 0
    
    # Compute backwards (extra environment friendly)
    for r in reversed(rewards):
        G = r + gamma * G
        returns.insert(0, G)
    
    return returns

The necessary factor to notice is that reward capabilities are topic to cautious trial and error. One mistake or over-reward right here, and the agent goes off in optimizing behaviour that exploits the errors. This leads us to reward hacking.

Reward hacking

Reward hacking happens when an agent finds an unintended approach to maximize reward with out really fixing the duty you needed it to resolve. The agent isn’t “dishonest” on objective—it’s doing precisely what you instructed it to do, simply not what you meant for it to do.

Traditional instance: In the event you reward a cleansing robotic for “no seen grime,” it’d study to show off its digicam as a substitute of cleansing!

My painful studying expertise: I discovered this out the onerous approach. In an early model of my drone touchdown reward operate, I gave the drone factors for being “steady and gradual” anyplace close to the platform. Sounds affordable, proper? Mistaken! Inside 50 coaching episodes, my drone discovered to only hover in place perpetually, racking up free factors. It was technically optimum for my badly-designed reward operate—however really touchdown? Nope! I watched it hover for five minutes straight earlier than I spotted what was taking place.

Right here’s the problematic code I wrote:

# DO NOT COPY THIS!
# If drone is above the platform (|dx| < 0.0625) and shut (distance < 0.25):
corridor_reward = inverse_quadratic(distance, decay=20, scaler=15)  # As much as 15 factors
if steady and gradual:
    corridor_reward += 10  # Further 10 factors!
# Complete potential: 25 factors per step!

An instance of reward hacking in motion:

Fig. 5 The drone learnt to hover across the platform and farm rewards

Plot showing hacked rewards — Fig. 6 Plot that exhibits that the drone is clearly reward hacking

Making a coverage community

As mentioned above, we’re going to use a neural community because the coverage that powers the mind of our agent. Right here’s a easy implementation that takes within the state vector and computes a likelihood distribution over 3 impartial actions:

Activate the principle thruster
Activate the left thruster
Activate the suitable thruster

def state_to_array(state):
    """Helper operate to transform DroneState dataclass to numpy array"""
    information = np.array([
        state.drone_x,
        state.drone_y,
        state.drone_vx,
        state.drone_vy,
        state.drone_angle,
        state.drone_angular_vel,
        state.drone_fuel,
        state.platform_x,
        state.platform_y,
        state.distance_to_platform,
        state.dx_to_platform,
        state.dy_to_platform,
        state.speed,
        float(state.landed),
        float(state.crashed)
    ])
    
    return torch.tensor(information, dtype=torch.float32)

class DroneGamerBoi(nn.Module):
    def __init__(self, state_dim=15):
        tremendous().__init__()
        
        self.community = nn.Sequential(
            nn.Linear(state_dim, 128),
            nn.LayerNorm(128),
            nn.ReLU(),
            nn.Linear(128, 128),
            nn.LayerNorm(128),
            nn.ReLU(),
            nn.Linear(128, 64),
            nn.LayerNorm(64),
            nn.ReLU(),
            nn.Linear(64, 3),
            nn.Sigmoid()
        )
        
    def ahead(self, state):
        if isinstance(state, DroneState):
            state = state_to_array(state)
        
        return self.community(state)

Successfully, as a substitute of the motion area being a 2³ = 8 area, I lowered it to choices over the three impartial thrusters utilizing Bernoulli sampling. This discount makes optimization simpler by treating every thruster independently fairly than as one massive categorical selection (at the very least that’s what I feel—I could also be unsuitable, but it surely labored for me!).

Coaching a coverage with coverage gradients

Studying Methods: When Ought to We Replace?

Right here’s a query that tripped me up early on: ought to we replace the coverage after each single motion, or wait and see how the entire episode performs out? Seems, this selection issues rather a lot.

While you attempt to optimize primarily based purely on the reward acquired for an motion, it results in a excessive variance drawback (mainly, the coaching sign is tremendous noisy and the gradients level in random instructions!). What I imply by “excessive variance” is that the optimization algorithm receives extraordinarily combined indicators within the gradient that’s used to replace the parameters in our coverage community. For a similar motion, the system might emit a particular gradient course, however then for a barely totally different state (however identical motion) may yield one thing utterly reverse. This results in gradual, and probably no, coaching.

There are 3 ways we might replace our coverage:

Studying after each motion (Per-Step Updates)

The drone fires its thruster as soon as, will get a small reward, and instantly updates its whole technique. That is like adjusting your basketball type after each single shot—approach too reactive! One fortunate motion that will increase the reward doesn’t essentially imply that the agent did good, and one unfortunate motion doesn’t imply the agent did dangerous. The training sign is simply too noisy.

My first try: I attempted this method early on. The drone would wiggle round randomly, make one fortunate transfer that acquired a tiny bit extra reward, instantly overfit to that precise transfer, after which crash repeatedly attempting to breed it. It was painful to observe—like watching somebody study the unsuitable lesson from pure probability.

Studying after one full try (Per-Episode Updates)

Higher! Now we let the drone attempt to land (or crash), see how the entire try went, after which replace. That is like ending an episode after which occupied with what to enhance. A minimum of now we see the complete penalties of our actions. However right here’s the issue: what if that one touchdown was simply fortunate? Or unfortunate? We’re nonetheless basing our studying on a single information level.

Studying from a number of makes an attempt (Multi-Episode Batch Updates)

That is the candy spot. We run a number of (6 in my case) drone touchdown makes an attempt concurrently, see how all of them went, after which replace our coverage primarily based on the typical efficiency. Some makes an attempt may get fortunate, some unfortunate, however averaged collectively, we get a a lot clearer image of what really works. Though that is fairly heavy on the pc, should you can run it, it really works approach higher than any of the earlier strategies. In fact, this technique is definitely not the most effective, however it’s fairly easy to grasp and implement; there are different (and higher) strategies.

Right here’s the code to gather a number of episodes within the drone sport:

def collect_episodes(shopper: DroneGameClient, coverage: nn.Module, max_steps=300):
    """
    Accumulate episodes with early stopping
    
    Args:
        shopper: The sport's socket shopper
        coverage: PyTorch module
        max_steps: Most steps per episode (default: 300)
    """
    num_games = shopper.num_games
    
    # Initialize storage
    all_episodes = [{'states': [], 'actions': [], 'log_probs': [], 'rewards': [], 'executed': False} 
                    for _ in vary(num_games)]
    
    # Reset all video games
    game_states = [client.reset(game_id) for game_id in range(num_games)]
    step_counts = [0] * num_games  # Monitor steps per sport
    
    whereas not all(ep['done'] for ep in all_episodes):
        # Batch lively video games
        batch_states = []
        active_game_ids = []
        
        for game_id in vary(num_games):
            if not all_episodes[game_id]['done']:
                batch_states.append(state_to_array(game_states[game_id]))
                active_game_ids.append(game_id)
        
        if len(batch_states) == 0:
            break
        
        # Batched inference
        batch_states_tensor = torch.stack(batch_states)
        batch_action_probs = coverage(batch_states_tensor)
        batch_dist = Bernoulli(probs=batch_action_probs)
        batch_actions = batch_dist.pattern()
        batch_log_probs = batch_dist.log_prob(batch_actions).sum(dim=1)
        
        # Execute actions
        for i, game_id in enumerate(active_game_ids):
            motion = batch_actions[i]
            log_prob = batch_log_probs[i]
            
            next_state, _, executed, _ = shopper.step({
                "main_thrust": int(motion[0]),
                "left_thrust": int(motion[1]),
                "right_thrust": int(motion[2])
            }, game_id)
            
            reward = calc_reward(next_state)
            
            # Retailer information
            all_episodes[game_id]['states'].append(batch_states[i])
            all_episodes[game_id]['actions'].append(motion)
            all_episodes[game_id]['log_probs'].append(log_prob)
            all_episodes[game_id]['rewards'].append(reward['total'])
            
            # Replace state and step rely
            game_states[game_id] = next_state
            step_counts[game_id] += 1
            
            # Test executed circumstances
            if executed or step_counts[game_id] >= max_steps:
                # Apply timeout penalty if hit max steps with out touchdown
                if step_counts[game_id] >= max_steps and never next_state.landed:
                    all_episodes[game_id]['rewards'][-1] -= 500  # Timeout penalty
                
                all_episodes[game_id]['done'] = True
    
    # Return episodes
    return [(ep['states'], ep['actions'], ep['log_probs'], ep['rewards']) 
            for ep in all_episodes]

The Maximization-Minimization Puzzle

In typical deep studying (supervised studying), we decrease a loss operate:

We need to go “downhill” towards decrease loss (higher predictions).

However in reinforcement studying, we need to maximize complete reward! Our aim is:

The issue: Deep studying frameworks are constructed for minimization, not maximization. How can we flip “maximize reward” into “decrease loss”?

The easy trick: Maximize J(θ) = Decrease -J(θ)

So our loss operate turns into:

Now, gradient descent will climb up (extra like Gradient Ascend) the reward panorama (as a result of we’re taking place the adverse reward)!

The REINFORCE Algorithm (Coverage Gradient)

The coverage gradient theorem (Williams, 1992) tells us the right way to compute the gradient of anticipated reward:

(I do know, I do know—this seems intimidating. However persist with me, it’s really fairly elegant when you see what’s taking place!)

The place:

In plain English (as a result of that method is dense):

If motion a_t led to a excessive return G_t, enhance its likelihood
If motion a_t led to a low return G_t, lower its likelihood
The gradient tells us which course to regulate the neural community weights

Including a Baseline (Variance Discount)

Utilizing uncooked returns G_t results in excessive variance (noisy gradients). We enhance this by subtracting a baseline b(s_t):

The only baseline is the imply return:

This offers us the benefit: A_t=G_t-b

Optimistic benefit → motion was higher than common → enhance likelihood
Adverse benefit → motion was worse than common → lower likelihood

Why this helps: As a substitute of “this motion gave reward 100” (is that good?), we’ve “this motion gave 100 when the typical is 50” (that’s nice!). Relative efficiency is clearer than absolute.

Our Implementation

In our drone touchdown code, we use REINFORCE with baseline:

# 1. Accumulate episodes and compute returns
returns = compute_returns(rewards, gamma=0.99)  # G_t with discounting

# 2. Compute baseline (imply of all returns)
baseline = returns_tensor.imply()

# 3. Compute benefits
benefits = returns_tensor - baseline

# 4. Normalize benefits (further variance discount)
benefits = (benefits - benefits.imply()) / (benefits.std() + 1e-8)

# 5. Compute loss (notice the adverse signal!)
loss = -(log_probs_tensor * benefits).imply()

# 6. Gradient descent
optimizer.zero_grad()
loss.backward()
optimizer.step()

We repeat the above loop as many occasions as we would like or until the drone learns to land correctly. Take a look at this pocket book for extra code!

Present Outcomes (reward operate remains to be fairly flawed)

After numerous hours of tweaking rewards, adjusting hyperparameters, and watching my drone crash in artistic new methods, I lastly acquired it working (largely!). Though my designed reward operate just isn’t good, I do assume that it is ready to train a coverage community. Right here’s a profitable touchdown:

Gif showing a good run of the agent — Fig. 6 The drone learnt one thing!

Fairly cool, proper? However right here’s the place issues get fascinating (and irritating)…

The persistent hovering drawback: A elementary limitation

Even with the improved reward operate that circumstances rewards on vertical place (dy_to_platform > 0). The skilled coverage nonetheless displays a irritating conduct: when the drone misses the platform, it learns to descend towards it however then hovers beneath the platform fairly than trying to land.

I spent over every week gazing reward plots (and altering reward capabilities), questioning why my “mounted” reward operate was nonetheless producing this hovering conduct. After I lastly plotted the collected rewards, the sample turned crystal clear—and truthfully, I couldn’t even be mad on the agent for locating this technique.

What’s taking place?

By analyzing the collected rewards over an episode the place the drone hovers beneath the platform, I found one thing fascinating:

Fig. 7 Gif displaying “hovering beneath platform” drawback

Fig. 8 Plot that exhibits that the drone is clearly reward hacking

The plots reveal that:

Distance reward (orange): Accumulates to ~+70 early, then plateaus (no extra rewards)
Velocity alignment (inexperienced): Accumulates to ~+30 early, then plateaus
Time penalty (blue): Steadily accumulates to ~-250 (retains getting worse)
Vertical place (brown): Steadily accumulates to ~-200 (penalty for being beneath)
Complete reward: Ends round -400 to -600 (after timeout)

The important thing perception: The drone descends from above the platform (accumulating distance and velocity rewards on the way in which down), passes by means of the platform peak, after which settles into hovering beneath as a substitute of finishing the touchdown. As soon as beneath, it stops getting constructive rewards (discover how the gap and velocity strains plateau round step 50-60) however continues accumulating time penalties and vertical place penalties. Nevertheless, this technique remains to be viable as a result of trying to land dangers a right away -200 crash penalty, whereas hovering beneath “solely” prices ~-400 to -600 over the complete episode.

Why does this occur?

The basic subject is that our reward operate r(s', a) can solely see the present state, not the trajectory. Give it some thought: at any single timestep, the reward operate can’t inform the distinction between:

A drone making progress towards touchdown (approaching from above with managed descent)
A drone exploiting the reward construction (oscillating to farm rewards)

Each may need dy_to_platform > 0 at a given second, so that they obtain an identical rewards! The agent isn’t dumb—it’s simply optimizing precisely what you instructed it to optimize.

So what would really repair this?

To actually resolve this drawback, I personally assume that rewards ought to rely on state transitions: r(s, a, s') as a substitute of simply r(s, a). This could allow you to reward primarily based on (s being the present state, and s’ prime being the following state):

Progress: Solely reward if distance(s') < distance(s) (really getting nearer!)
Vertical enchancment: Solely reward if the drone is constantly shifting upward relative to the platform
Trajectory consistency: Penalize fast course modifications that point out oscillation

This can be a extra principled answer than attempting to patch the present reward operate with more and more harsh penalties (which is mainly what I attempted for some time, and it didn’t actually work). The oscillation exploit exists as a result of we’re essentially lacking details about the trajectory.

Within the subsequent submit, I’ll discover Actor-Critic strategies and methods that may incorporate temporal data to forestall these exploitation methods. Keep tuned!

In the event you discover a approach to repair this, please attain out to me!

This brings us to the tip of this submit on “the best approach to do Deep Reinforcement Studying.”

Subsequent on the record

Actor-Critic techniques
DQL
PPO & GRPO
Making use of this to techniques that require imaginative and prescient 👀

References

Foundational Stuff

Turing, A. M. (1950). “Computing Equipment and Intelligence.”.
- Authentic Turing Check paper
Williams, R. J. (1992). “Easy Statistical Gradient-Following Algorithms for Connectionist Reinforcement Studying.” Machine Studying.
Sutton, R. S., & Barto, A. G. (2018). Reinforcement Studying: An Introduction. MIT Press.

Classical Conditioning & Behavioral Psychology

Pavlov, I. P. (1927). Conditioned Reflexes: An Investigation of the Physiological Exercise of the Cerebral Cortex. Oxford College Press.
- Classical conditioning experiments
Skinner, B. F. (1938). The Conduct of Organisms: An Experimental Evaluation. Appleton-Century-Crofts.
- Operant conditioning and the Skinner Field

Coverage Gradient Strategies

Sutton, R. S., McAllester, D., Singh, S., & Mansour, Y. (1999). “Coverage Gradient Strategies for Reinforcement Studying with Operate Approximation.” Advances in Neural Data Processing Methods.
- Theoretical foundations of coverage gradients
Schulman, J., Moritz, P., Levine, S., Jordan, M., & Abbeel, P. (2015). “Excessive-Dimensional Steady Management Utilizing Generalized Benefit Estimation.” arXiv preprint arXiv:1506.02438.

Neural Networks & Deep Studying

Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Studying. MIT Press.

On-line Sources

Karpathy, A. “Deep Reinforcement Studying: Pong from Pixels.”
Spinning Up in Deep RL by OpenAI

Code Repository

Jumle, V. (2025). “Reinforcement Studying 101: Supply Drone Touchdown.”

Good friend

Singh, Navroop Kaur. (2025): For offering “Optimistic Vibes & Consideration”. Thanks!

All photographs on this article are both AI-generated (utilizing Gemini), personally made by me, or screenshots & plots that I made.

Deep Reinforcement Studying: 0 to 100

Internet hosting NVIDIA speech NIM fashions on Amazon SageMaker AI: Parakeet ASR

Constructing a multi-agent voice assistant with Amazon Nova Sonic and Amazon Bedrock AgentCore

Constructing a multi-agent voice assistant with Amazon Nova Sonic and Amazon Bedrock AgentCore

Leave a Reply Cancel reply

Popular News

Greatest practices for Amazon SageMaker HyperPod activity governance

Optimizing Mixtral 8x7B on Amazon SageMaker with AWS Inferentia2

Unlocking Japanese LLMs with AWS Trainium: Innovators Showcase from the AWS LLM Growth Assist Program

The Good-Sufficient Fact | In direction of Knowledge Science

How Aviva constructed a scalable, safe, and dependable MLOps platform utilizing Amazon SageMaker

About Us

Category

Recent Posts