Land on the Moon with Reinforcement Learning

Train an AI agent to fire rocket thrusters and touch down safely—in simulation.

Last reviewed: March 2026

Overview

In 2019, a NASA spacecraft called Beresheet attempted to land on the Moon and crashed. Landing a spacecraft is extraordinarily difficult—the vehicle must fire thrusters at exactly the right moment to slow down, stay balanced, and touch down gently. Humans write complex control algorithms to do this, but reinforcement learning (RL) offers a different approach: let an agent learn by trying thousands of times in simulation and accumulating rewards for good behavior.

The LunarLander environment is a physics-based simulation where your agent controls four thrusters and must land between two flags on a simulated lunar surface. The agent receives a reward for landing softly, a penalty for crashing, and small rewards or penalties for fuel use. Over tens of thousands of training steps, it discovers a landing strategy entirely on its own—no human ever tells it to "fire the main engine when altitude is low."

Stable-Baselines3 makes this accessible by providing polished implementations of state-of-the-art RL algorithms. You will use Proximal Policy Optimization (PPO), which is the same algorithm OpenAI used to train the first system to beat human experts at Dota 2. By the end of the project, you will have an agent that consistently lands safely and a solid conceptual grasp of how modern autonomous aerospace systems learn to operate.

What You'll Learn

  • Define agent, environment, state, action, reward, and policy in the RL framework.
  • Set up and interact with a Gymnasium environment using the standard reset/step API.
  • Train a PPO agent using Stable-Baselines3 and monitor reward progress during training.
  • Render and record a video of a trained agent to evaluate its landing behavior.
  • Explain why RL is attractive for autonomous aerospace guidance compared to hand-coded controllers.

Step-by-Step Guide

1

Set up the environment

Install the required packages: pip install gymnasium[box2d] stable-baselines3 matplotlib. The box2d extra provides the physics engine. On some systems you may also need pip install swig first. Test your install by running a random agent: create the environment with env = gymnasium.make("LunarLander-v3"), call env.reset(), and loop 200 steps calling env.step(env.action_space.sample()). Watch it crash spectacularly—that is the baseline you are about to beat.

2

Understand the observation and action spaces

Print env.observation_space and env.action_space. The observation is an 8-element vector: x position, y position, x velocity, y velocity, angle, angular velocity, and two booleans for leg contact. The action space has four discrete choices: do nothing, fire left thruster, fire main engine, fire right thruster. Sketch how you would land the lander manually—this builds intuition for what the RL agent needs to discover.

3

Train a PPO agent

Import from stable_baselines3 import PPO. Create your training environment and model: env = gymnasium.make("LunarLander-v3"), model = PPO("MlpPolicy", env, verbose=1, tensorboard_log="./lunar_tb/"). Call model.learn(total_timesteps=500_000). Training takes 5–15 minutes on a typical laptop. Watch the mean episode reward in the log—it should climb from around −200 (crashing every time) toward +200 (consistent safe landings).

4

Evaluate the trained agent

After training, load and evaluate your agent over 10 episodes: create a new environment, call obs, _ = env.reset(), then loop calling action, _ = model.predict(obs, deterministic=True) and obs, reward, terminated, truncated, _ = env.step(action). Accumulate total reward per episode and print the mean. A well-trained PPO agent should average above +200. Save your model with model.save("lunar_lander_ppo").

5

Record a video of successful landings

Use the RecordVideo wrapper: env = gymnasium.wrappers.RecordVideo(gymnasium.make("LunarLander-v3", render_mode="rgb_array"), video_folder="./videos/"). Run 3 episodes with your trained agent and watch the saved MP4 files. Annotate a screenshot identifying which thruster fires at each phase: descent, attitude correction, and final touchdown. This makes a compelling visual for a portfolio or science fair presentation.

6

Experiment and reflect

Try reducing training to 100,000 steps and observe how the agent performs—this illustrates the effect of training budget. Change the reward structure by switching to the continuous action version (LunarLanderContinuous-v3) and see if the agent still converges. Write a one-page reflection connecting RL concepts to real aerospace applications: SpaceX Grasshopper tests, autonomous drone landing, and Mars entry descent and landing are all relevant examples.

Go Further

  • Try the SAC (Soft Actor-Critic) algorithm on the continuous action version of the environment and compare final performance to PPO.
  • Add wind to the environment (available as a parameter in newer Gymnasium versions) and retrain—observe how the agent adapts its strategy.
  • Visualize the agent's learned policy by plotting the action it takes across a grid of (x, y) positions at fixed velocity—this is called a policy heatmap.
  • Write a brief research summary comparing your RL approach to the explicit guidance algorithms used in Apollo's powered descent—what are the trade-offs in interpretability and reliability?