RL Autopilot for JSBSim Flight Simulator
Train an agent to fly a real aircraft model — six degrees of freedom included.
Last reviewed: March 2026Overview
Most reinforcement learning tutorials train agents on stylised toy environments — CartPole, MuJoCo half-cheetah — that bear little resemblance to real engineering systems. JSBSim is a high-fidelity open-source flight dynamics model used by real flight simulators; it includes accurate aerodynamic models, propulsion dynamics, atmospheric effects, and landing gear physics for dozens of aircraft. Wrapping JSBSim in a Gymnasium interface creates an RL environment that is physically meaningful, making this project a rigorous bridge between the tutorial world and real aerospace applications.
You will implement a custom gymnasium.Env class that initialises JSBSim with a chosen aircraft (Cessna 172 or F-16 model), steps the simulation forward, extracts state observations (altitude, airspeed, pitch, roll, heading, angular rates), maps RL actions to control surface deflections and throttle, and computes a shaped reward that incentivises altitude hold, heading hold, and gentle control inputs. This reward shaping is where domain knowledge matters: a poorly shaped reward produces agents that find unintended shortcuts (such as climbing indefinitely to avoid heading error).
With the environment in place you will train PPO and SAC agents using Stable-Baselines3, analyse learning curves, compare converged policy performance, and visualise the resulting flight trajectories. The project develops simultaneous competence in flight dynamics, RL algorithm implementation, environment design, and experiment analysis — a combination that directly targets roles in autonomous flight system development at companies such as Boeing Phantom Works, Aurora Flight Sciences, DARPA contractor teams, and advanced air mobility start-ups.
What You'll Learn
- ✓ Implement a custom Gymnasium environment wrapping a high-fidelity physics simulator (JSBSim)
- ✓ Design a reward function that captures altitude hold, heading hold, and control smoothness objectives
- ✓ Train PPO and SAC agents using Stable-Baselines3 and interpret learning curves and episode statistics
- ✓ Evaluate trained agents on unseen initial conditions and compare robustness across algorithms
- ✓ Visualise 3D flight trajectories and analyse failure modes where the policy breaks down
Step-by-Step Guide
Install JSBSim and verify the Python bindings
Install the jsbsim Python package and verify it by running the bundled Cessna 172 simulation for 60 seconds and printing altitude and airspeed at each step. Read the JSBSim property tree documentation to understand how to get and set properties (control surface deflections, engine throttle, atmospheric conditions) via the Python API. Select the aircraft model you will use for the project.
Implement the custom Gymnasium environment
Subclass gymnasium.Env and implement __init__, reset, step, and render. In reset, initialise JSBSim with a random starting altitude (2,000–4,000 ft), airspeed (80–120 knots), and heading. In step, apply the RL action (aileron, elevator, rudder, throttle) to the JSBSim property tree, advance the simulation by 0.1 s, and return the observation vector. Define the observation space as a Box with altitude error, airspeed error, pitch, roll, heading error, and the three angular rates as components.
Design and tune the reward function
Implement a shaped reward: a large negative penalty for exceeding structural limits (bank angle > 60°, airspeed outside the envelope), a Gaussian reward centred on zero altitude error, a Gaussian reward centred on zero heading error, and a small negative penalty proportional to control input magnitude to encourage smooth flight. Test the reward manually by running a deterministic control input sequence and verifying the numerical reward values are in the expected range before starting RL training.
Train PPO and SAC agents
Use Stable-Baselines3 to train a PPO agent (MlpPolicy, 2×256 hidden layers) for 2 million environment steps. Log episode reward, episode length, and policy entropy using EvalCallback every 50,000 steps. Repeat with SAC (also 2M steps). Save checkpoints at regular intervals so you can evaluate intermediate policies and identify when learning stalls or diverges.
Evaluate and compare agents
Load the best checkpoint for each algorithm and run 50 evaluation episodes with randomised initial conditions. Compute mean and standard deviation of cumulative reward, episode survival time (episodes end on out-of-envelope violation), and RMS altitude and heading error over the episode. Plot learning curves side by side and create a summary table comparing PPO and SAC on all metrics.
Visualise trajectories and analyse failure modes
Record state trajectories for 10 representative evaluation episodes per agent. Plot 3D flight paths using Matplotlib and overlay altitude and heading error time series. Identify the initial conditions or wind perturbations (add wind gust noise in step 6 of training and repeat) that cause the policy to fail. Write a failure mode analysis section explaining what state combinations the agent has not generalised to and how the reward function or training distribution could be improved.
Career Connection
See how this project connects to real aerospace careers.
Aerospace Engineer →
Autonomous flight control engineers at UAM companies and defence contractors use RL-in-simulator workflows identical to this project for initial policy development.
Pilot →
Understanding how autopilot and autonomy systems are trained gives pilots insight into the failure modes and limitations of automation they rely on daily.
Drone & UAV Ops →
Fixed-wing UAV autopilot development increasingly uses RL for attitude control and path following; this project provides a directly applicable template.
Flight Dispatcher →
Dispatchers approving autonomous flight operations benefit from understanding how RL autopilots are evaluated and what their performance envelopes look like.
Go Further
- Add turbulence and wind shear to the JSBSim environment and retrain; compare the wind-robust policy against the nominal-only trained policy in windy conditions.
- Implement a curriculum learning schedule that starts with small perturbations and gradually increases initial condition randomisation as the agent improves.
- Extend the task to a full approach-and-landing task: the agent must descend to pattern altitude, align with a runway heading, and achieve a stable ILS-like glidepath.
- Export the trained policy to ONNX and integrate it into a ROS 2 node that commands a simulated aircraft in Gazebo via MAVLink.