π PPO Lunar Lander Agent
Author: Ginni Garg
This repository contains a Reinforcement Learning (RL) model trained to safely land a spacecraft in the LunarLander-v2 environment.
Even if you are completely new to Reinforcement Learning, this README will help you understand:
- What this project does
- What PPO is
- What LunarLander is
- How the model was trained
- How to use the model
- How to reproduce results
π What is This Project?
This project trains an AI agent to land a spacecraft safely on the moon.
The spacecraft must:
- Control its engines
- Avoid crashing
- Land between two flags
- Use fuel efficiently
The agent learns by trial and error β just like a human learning a video game.
π§ What is Reinforcement Learning?
Reinforcement Learning (RL) is a type of Machine Learning where:
- An agent interacts with an environment
- It takes actions
- It receives rewards or penalties
- It learns to maximize total reward
Think of it like training a dog:
- Good behavior β treat (reward)
- Bad behavior β no treat (penalty)
Over time, the agent learns the best strategy.
π What is LunarLander-v2?
LunarLander is a simulation environment from Gymnasium.
The goal:
- Land a spacecraft safely between two flags.
The agent receives:
- Positive reward for landing successfully
- Negative reward for crashing
- Small penalties for wasting fuel
π Environment Details
Observation Space (What the Agent Sees)
The agent receives 8 values:
| Index | Meaning |
|---|---|
| 0 | Horizontal position |
| 1 | Vertical position |
| 2 | Horizontal velocity |
| 3 | Vertical velocity |
| 4 | Angle |
| 5 | Angular velocity |
| 6 | Left leg touching ground (0 or 1) |
| 7 | Right leg touching ground (0 or 1) |
These numbers describe the spacecraftβs current state.
Action Space (What the Agent Can Do)
There are 4 possible actions:
| Action | Meaning |
|---|---|
| 0 | Do nothing |
| 1 | Fire left engine |
| 2 | Fire main engine |
| 3 | Fire right engine |
The agent must choose the correct action at each time step.
π€ What Algorithm Was Used?
Proximal Policy Optimization (PPO)
PPO is a popular and stable Reinforcement Learning algorithm.
Why PPO?
- Stable training
- Good performance
- Widely used in industry
- Balances exploration and exploitation
It updates the policy in small safe steps to avoid instability.
βοΈ Training Details
Model Architecture
- Policy: MLP (Multi-Layer Perceptron)
- Framework: Stable-Baselines3
- Algorithm: PPO
Hyperparameters Used
- mean_reward=212.56 +/- 94.25619506404215
PPO(
policy="MlpPolicy",
n_steps=1024,
batch_size=64,
n_epochs=4,
gamma=0.999,
gae_lambda=0.98,
ent_coef=0.01,
verbose=1
)
- Downloads last month
- 10