Predict Clear-Air Turbulence from Weather Data
Build a classifier that warns pilots about invisible rough air
Last reviewed: March 2026Overview
Clear-air turbulence (CAT) is one of aviation's most dangerous weather phenomena because it's invisible — it doesn't show up on radar and occurs in cloudless skies. CAT causes hundreds of injuries to airline passengers and crew every year, and climate change research suggests it's getting worse. The ability to predict CAT from atmospheric data is an active area of research at NOAA, NASA, and major airlines.
In this project, you'll combine two real data sources: PIREPs (Pilot Reports of turbulence encounters, submitted by flight crews in real-time) and ERA5 reanalysis data (a global atmospheric dataset from the European Centre for Medium-Range Weather Forecasts). You'll match each turbulence report with the atmospheric conditions at that location and altitude, then train a classifier to predict turbulence severity from meteorological variables like wind shear, temperature gradient, and jet stream proximity.
This is a genuinely useful classification problem with real-world impact. The atmospheric features you'll work with — Richardson number, vertical wind shear, deformation — are the same variables used in operational turbulence forecasting systems like the GTG (Graphical Turbulence Guidance) product that airlines use today.
What You'll Learn
- ✓ Acquire and process real pilot turbulence reports (PIREPs) from NOAA archives
- ✓ Work with gridded atmospheric reanalysis data (ERA5) and extract features at specific locations
- ✓ Engineer physically meaningful features for turbulence prediction (Richardson number, wind shear, deformation)
- ✓ Handle class imbalance in a multi-class classification problem (most reports are "smooth")
- ✓ Evaluate classifier performance with confusion matrices, ROC curves, and domain-appropriate metrics
Step-by-Step Guide
Understand Clear-Air Turbulence Physics
CAT is caused by Kelvin-Helmholtz instability — when wind shear across a temperature inversion becomes strong enough to overcome the stabilizing effect of the temperature gradient. This is quantified by the Richardson number (Ri): when Ri drops below 0.25, the flow becomes dynamically unstable and turbulence breaks out.
Study the atmospheric conditions that produce CAT: jet stream boundaries (especially the jet's north side), mountain waves propagating into the stratosphere, and upper-level fronts. Understanding the physics will guide your feature engineering and help you interpret model results.
Collect and Process PIREPs
Download pilot turbulence reports from the NOAA Aviation Weather Center archive or the Iowa Environmental Mesonet PIREP database. Each PIREP contains: location (lat/lon), altitude (flight level), time, turbulence severity (smooth/light/moderate/severe/extreme), and aircraft type.
Clean the data aggressively. Discard reports with missing locations or altitudes. Note that turbulence perception varies by aircraft — a "moderate" report from a Boeing 737 would be "severe" in a Cessna 172. For this project, treat the reported severity at face value, but document this limitation.
Obtain ERA5 Atmospheric Data
Register for a free Copernicus Climate Data Store account and download ERA5 pressure-level data for your PIREP time period. Request variables on relevant pressure levels (250, 300, 200 hPa — typical cruise altitudes): u-wind, v-wind, temperature, geopotential, and vertical velocity.
Use Python's xarray library to work with the NetCDF files. For each PIREP, extract the atmospheric state at the nearest grid point and pressure level. This spatial/temporal matching step is critical — be precise about interpolation methods.
Engineer Turbulence Prediction Features
Compute physically meaningful features from the raw atmospheric variables. Key features include: vertical wind shear (wind speed change between pressure levels), horizontal temperature gradient (computed via finite differences on the grid), Richardson number (Ri = N²/S² where N is Brunt-Väisälä frequency and S is vertical shear), deformation (stretching and shearing of the flow), and divergence.
Also compute jet stream proximity: distance to the nearest wind speed maximum above 60 knots at 250 hPa. CAT frequency peaks at the edges of the jet stream, not at its core. These derived features encode the physical mechanisms that cause turbulence.
Handle Class Imbalance and Train
Turbulence is rare — most of the atmosphere is smooth at any given time. Your dataset will likely be heavily imbalanced: ~70% smooth, ~20% light, ~8% moderate, ~2% severe. If you train naively, the model will just predict "smooth" for everything and get 70% accuracy.
Address this with techniques from scikit-learn: SMOTE (synthetic oversampling of minority classes), class weights (penalizing misclassification of rare events more heavily), or a cost-sensitive Random Forest. Train with stratified cross-validation to ensure each fold contains all severity levels. Try both Random Forest and Gradient Boosting (XGBoost) classifiers.
Evaluate with Aviation-Appropriate Metrics
Accuracy alone is misleading for imbalanced problems. Report the confusion matrix, per-class precision and recall, and ROC-AUC for each severity level. In aviation, recall for moderate/severe turbulence is the most important metric — missing a real turbulence event is far worse than issuing a false warning.
Calculate the Probability of Detection (POD) and False Alarm Rate (FAR) for a binary "turbulence yes/no" threshold. Compare your model against a simple baseline using Richardson number alone (Ri < 1.0 = turbulence). A well-engineered ML model should meaningfully outperform this single-variable threshold.
Interpret Results and Document Limitations
Use scikit-learn's feature importance and SHAP values to understand what the model learned. Do the most important features align with turbulence physics? If vertical wind shear and Richardson number dominate, that's good — the model discovered real physical relationships.
Document key limitations honestly: PIREPs only exist along flight routes (no reports where planes don't fly), aircraft type affects perceived severity, ERA5 resolution (31 km) is too coarse to resolve small-scale turbulence features, and the temporal matching between PIREPs and ERA5 snapshots introduces noise. These are the same challenges faced by operational turbulence forecasting systems.
Career Connection
See how this project connects to real aerospace careers.
Flight Dispatcher →
Dispatchers route flights around forecast turbulence — understanding CAT prediction directly improves flight safety and passenger comfort
Pilot →
Pilots read turbulence forecasts and submit PIREPs — knowing how these systems work makes reports more useful
Aerospace Engineer →
Aircraft structural loads and fatigue analysis require understanding turbulence statistics and encounter rates
Air Traffic Control →
Controllers relay turbulence reports between aircraft and may reroute traffic around severe turbulence areas
Go Further
Advance your atmospheric ML research:
- Add mountain wave features — compute terrain-relative variables to capture orographic turbulence near mountain ranges
- Use deep learning on gridded data — instead of point features, feed the full 2D atmospheric field into a CNN to learn spatial patterns
- Build a real-time prediction map — ingest current GFS forecast data and display predicted turbulence on a web map with flight routes
- Compare against GTG — the NOAA Graphical Turbulence Guidance product is the operational standard; see how your model compares