🚀 PPO Lunar Lander Agent

Author: Ginni Garg

This repository contains a Reinforcement Learning (RL) model trained to safely land a spacecraft in the LunarLander-v2 environment.

Even if you are completely new to Reinforcement Learning, this README will help you understand:

What this project does
What PPO is
What LunarLander is
How the model was trained
How to use the model
How to reproduce results

🌕 What is This Project?

This project trains an AI agent to land a spacecraft safely on the moon.

The spacecraft must:

Control its engines
Avoid crashing
Land between two flags
Use fuel efficiently

The agent learns by trial and error — just like a human learning a video game.

🧠 What is Reinforcement Learning?

Reinforcement Learning (RL) is a type of Machine Learning where:

An agent interacts with an environment
It takes actions
It receives rewards or penalties
It learns to maximize total reward

Think of it like training a dog:

Good behavior → treat (reward)
Bad behavior → no treat (penalty)

Over time, the agent learns the best strategy.

🚀 What is LunarLander-v2?

LunarLander is a simulation environment from Gymnasium.

The goal:

Land a spacecraft safely between two flags.

The agent receives:

Positive reward for landing successfully
Negative reward for crashing
Small penalties for wasting fuel

🔍 Environment Details

Observation Space (What the Agent Sees)

The agent receives 8 values:

Index	Meaning
0	Horizontal position
1	Vertical position
2	Horizontal velocity
3	Vertical velocity
4	Angle
5	Angular velocity
6	Left leg touching ground (0 or 1)
7	Right leg touching ground (0 or 1)

These numbers describe the spacecraft’s current state.

Action Space (What the Agent Can Do)

There are 4 possible actions:

Action	Meaning
0	Do nothing
1	Fire left engine
2	Fire main engine
3	Fire right engine

The agent must choose the correct action at each time step.

🤖 What Algorithm Was Used?

Proximal Policy Optimization (PPO)

PPO is a popular and stable Reinforcement Learning algorithm.

Why PPO?

Stable training
Good performance
Widely used in industry
Balances exploration and exploitation

It updates the policy in small safe steps to avoid instability.

⚙️ Training Details

Model Architecture

Policy: MLP (Multi-Layer Perceptron)
Framework: Stable-Baselines3
Algorithm: PPO

Hyperparameters Used

mean_reward=212.56 +/- 94.25619506404215

PPO(
    policy="MlpPolicy",
    n_steps=1024,
    batch_size=64,
    n_epochs=4,
    gamma=0.999,
    gae_lambda=0.98,
    ent_coef=0.01,
    verbose=1
)

Downloads last month: 10

Video Preview

Reinforcement Learning