Predict Engine Failure with Kaggle Data
Use real NASA sensor data to predict when a jet engine is about to fail.
Last reviewed: March 2026Overview
Unscheduled engine maintenance is one of the most expensive problems in commercial aviation—a single grounded aircraft can cost an airline tens of thousands of dollars per hour. Predictive maintenance uses sensor data to estimate how much useful life remains in an engine before it needs service, replacing costly "replace-on-schedule" approaches with "replace-when-needed" precision. The dataset you will use in this project was generated by NASA specifically to develop and benchmark these algorithms.
The C-MAPSS (Commercial Modular Aero-Propulsion System Simulation) dataset contains time-series sensor readings from hundreds of simulated turbofan engines, each run until failure. Your job is to explore the data, engineer useful features, and train a regression model that predicts RUL for engines it has never seen. You will work entirely in Kaggle Notebooks—a free, browser-based Python environment that requires no local installation.
This project gives you direct experience with the kind of industrial machine learning that aerospace companies like Pratt & Whitney, GE Aviation, and Rolls-Royce are actively deploying on their fleets. Predictive maintenance is a growing field, and the ability to work with multivariate time-series sensor data is a genuinely marketable skill even at the high school level.
What You'll Learn
- ✓ Load, inspect, and visualize multivariate time-series sensor data with pandas and Matplotlib.
- ✓ Engineer meaningful features from raw sensor readings (rolling means, degradation trends).
- ✓ Build and evaluate a Random Forest regression model using scikit-learn.
- ✓ Interpret RMSE and MAE as model performance metrics in a safety-critical context.
- ✓ Explain what Remaining Useful Life means and why accurate prediction matters for aviation safety.
Step-by-Step Guide
Set up a Kaggle account and download the dataset
Create a free account at kaggle.com. Search for "NASA CMAPSS Jet Engine Simulated Data" and download it, or open it directly in a Kaggle Notebook to avoid any local setup. The dataset contains four training files (FD001–FD004) with different fault modes and operating conditions. Start with FD001—the simplest subset with one operating condition and one fault mode.
Load and explore the data
Load train_FD001.txt into a pandas DataFrame. The file has no header, so assign column names: unit number, cycle, three operational settings, and 21 sensor measurements. Use df.describe() and df.groupby("unit")["cycle"].max() to understand the distribution of engine lifetimes. Plot a few sensors over time for a single engine to see how readings drift as the engine degrades toward failure.
Create the Remaining Useful Life target column
For each engine unit, RUL at any cycle = (max cycle for that unit) − (current cycle). Compute this with a group max merge: df["max_cycle"] = df.groupby("unit")["cycle"].transform("max"), then df["RUL"] = df["max_cycle"] - df["cycle"]. Plot the distribution of RUL values and note that most engines fail between 100 and 350 cycles. Many practitioners also cap RUL at 125 for the training labels—try both approaches.
Engineer features and split the data
Sensors like s1, s5, s10, and s16 show little variation and can be dropped. For remaining sensors, add rolling-mean features over a window of 10 cycles to smooth noise: df["s2_roll10"] = df.groupby("unit")["s2"].transform(lambda x: x.rolling(10, min_periods=1).mean()). Split by unit number: use units 1–80 for training and 81–100 for validation so that no single engine appears in both sets (a critical point in time-series ML).
Train a Random Forest and evaluate predictions
Drop the unit and cycle columns, then fit a RandomForestRegressor(n_estimators=100, random_state=42) on your training features and RUL labels. Predict on the validation set and compute mean_squared_error(y_val, y_pred, squared=False) (RMSE) and mean_absolute_error. Plot predicted vs. actual RUL as a scatter plot—points near the diagonal indicate accurate predictions. Note which engines are hardest to predict accurately.
Apply the model to the test set and submit
Load test_FD001.txt and RUL_FD001.txt. The test file contains only the last observed cycle for each engine (not the full history), so apply your rolling features carefully using only data up to that point. Predict RUL for each test engine and compare to the true values in the RUL file. Compute your final RMSE and compare it to published baselines in the Kaggle community notebooks—scores below 20 cycles are considered strong.
Career Connection
See how this project connects to real aerospace careers.
Aerospace Engineer →
Engine health management systems are a growing sub-discipline; engineers who can design and validate RUL models are in high demand at MRO companies and OEMs.
Aviation Maintenance →
Predictive maintenance tools built on models like this one are changing how maintenance crews prioritize inspections and order parts ahead of need.
Avionics Technician →
Modern FADEC systems log hundreds of sensor parameters; technicians who understand ML-based diagnostics can interpret health monitoring alerts more effectively.
Aerospace Engineer →
Data-driven prognostics is a core component of next-generation aircraft health management, connecting structural, aerodynamic, and propulsion engineering teams.
Go Further
- Train an LSTM (Long Short-Term Memory) neural network using Keras on the full time-series rather than handcrafted features, and compare RMSE to your Random Forest baseline.
- Apply your model to FD002, FD003, and FD004 datasets and analyze why performance changes with multiple operating conditions.
- Build a simple dashboard in Streamlit that takes sensor readings as input and displays a predicted RUL with a warning threshold.
- Read the original NASA technical report "Damage Propagation Modeling for Aircraft Engine Run-to-Failure Simulation" and connect each sensor to its physical meaning in the engine.