Predict Clear-Air Turbulence from Weather Data

Build a classifier that warns pilots about invisible rough air

Undergraduate Weather 4–6 weeks

Last reviewed: March 2026

Overview

Clear-air turbulence (CAT) is one of aviation's most dangerous weather phenomena because it's invisible — it doesn't show up on radar and occurs in cloudless skies. CAT causes hundreds of injuries to airline passengers and crew every year, and climate change research suggests it's getting worse. The ability to predict CAT from atmospheric data is an active area of research at NOAA, NASA, and major airlines.

In this project, you'll combine two real data sources: PIREPs (Pilot Reports of turbulence encounters, submitted by flight crews in real-time) and ERA5 reanalysis data (a global atmospheric dataset from the European Centre for Medium-Range Weather Forecasts). You'll match each turbulence report with the atmospheric conditions at that location and altitude, then train a classifier to predict turbulence severity from meteorological variables like wind shear, temperature gradient, and jet stream proximity.

This is a genuinely useful classification problem with real-world impact. The atmospheric features you'll work with — Richardson number, vertical wind shear, deformation — are the same variables used in operational turbulence forecasting systems like the GTG (Graphical Turbulence Guidance) product that airlines use today.

What You'll Learn

✓ Acquire and process real pilot turbulence reports (PIREPs) from NOAA archives
✓ Work with gridded atmospheric reanalysis data (ERA5) and extract features at specific locations
✓ Engineer physically meaningful features for turbulence prediction (Richardson number, wind shear, deformation)
✓ Handle class imbalance in a multi-class classification problem (most reports are "smooth")
✓ Evaluate classifier performance with confusion matrices, ROC curves, and domain-appropriate metrics

Step-by-Step Guide

Understand Clear-Air Turbulence Physics

CAT is caused by Kelvin-Helmholtz instability — when wind shear across a temperature inversion becomes strong enough to overcome the stabilizing effect of the temperature gradient. This is quantified by the Richardson number (Ri): when Ri drops below 0.25, the flow becomes dynamically unstable and turbulence breaks out.

Study the atmospheric conditions that produce CAT: jet stream boundaries (especially the jet's north side), mountain waves propagating into the stratosphere, and upper-level fronts. Understanding the physics will guide your feature engineering and help you interpret model results.

Collect and Process PIREPs

Download pilot turbulence reports from the NOAA Aviation Weather Center archive or the Iowa Environmental Mesonet PIREP database. Each PIREP contains: location (lat/lon), altitude (flight level), time, turbulence severity (smooth/light/moderate/severe/extreme), and aircraft type.

Clean the data aggressively. Discard reports with missing locations or altitudes. Note that turbulence perception varies by aircraft — a "moderate" report from a Boeing 737 would be "severe" in a Cessna 172. For this project, treat the reported severity at face value, but document this limitation.

Obtain ERA5 Atmospheric Data

Register for a free Copernicus Climate Data Store account and download ERA5 pressure-level data for your PIREP time period. Request variables on relevant pressure levels (250, 300, 200 hPa — typical cruise altitudes): u-wind, v-wind, temperature, geopotential, and vertical velocity.

Use Python's xarray library to work with the NetCDF files. For each PIREP, extract the atmospheric state at the nearest grid point and pressure level. This spatial/temporal matching step is critical — be precise about interpolation methods.

Engineer Turbulence Prediction Features

Compute physically meaningful features from the raw atmospheric variables. Key features include: vertical wind shear (wind speed change between pressure levels), horizontal temperature gradient (computed via finite differences on the grid), Richardson number (Ri = N²/S² where N is Brunt-Väisälä frequency and S is vertical shear), deformation (stretching and shearing of the flow), and divergence.

Also compute jet stream proximity: distance to the nearest wind speed maximum above 60 knots at 250 hPa. CAT frequency peaks at the edges of the jet stream, not at its core. These derived features encode the physical mechanisms that cause turbulence.

Handle Class Imbalance and Train

Turbulence is rare — most of the atmosphere is smooth at any given time. Your dataset will likely be heavily imbalanced: ~70% smooth, ~20% light, ~8% moderate, ~2% severe. If you train naively, the model will just predict "smooth" for everything and get 70% accuracy.

Address this with techniques from scikit-learn: SMOTE (synthetic oversampling of minority classes), class weights (penalizing misclassification of rare events more heavily), or a cost-sensitive Random Forest. Train with stratified cross-validation to ensure each fold contains all severity levels. Try both Random Forest and Gradient Boosting (XGBoost) classifiers.

Evaluate with Aviation-Appropriate Metrics

Accuracy alone is misleading for imbalanced problems. Report the confusion matrix, per-class precision and recall, and ROC-AUC for each severity level. In aviation, recall for moderate/severe turbulence is the most important metric — missing a real turbulence event is far worse than issuing a false warning.

Calculate the Probability of Detection (POD) and False Alarm Rate (FAR) for a binary "turbulence yes/no" threshold. Compare your model against a simple baseline using Richardson number alone (Ri < 1.0 = turbulence). A well-engineered ML model should meaningfully outperform this single-variable threshold.

Interpret Results and Document Limitations

Use scikit-learn's feature importance and SHAP values to understand what the model learned. Do the most important features align with turbulence physics? If vertical wind shear and Richardson number dominate, that's good — the model discovered real physical relationships.

Document key limitations honestly: PIREPs only exist along flight routes (no reports where planes don't fly), aircraft type affects perceived severity, ERA5 resolution (31 km) is too coarse to resolve small-scale turbulence features, and the temporal matching between PIREPs and ERA5 snapshots introduces noise. These are the same challenges faced by operational turbulence forecasting systems.

Career Connection

See how this project connects to real aerospace careers.

Go Further

Advance your atmospheric ML research:

Add mountain wave features — compute terrain-relative variables to capture orographic turbulence near mountain ranges
Use deep learning on gridded data — instead of point features, feed the full 2D atmospheric field into a CNN to learn spatial patterns
Build a real-time prediction map — ingest current GFS forecast data and display predicted turbulence on a web map with flight routes
Compare against GTG — the NOAA Graphical Turbulence Guidance product is the operational standard; see how your model compares

Related Projects

High School Predict Tomorrow's Wind Speed for Flight Planning Build a weather model that helps pilots make go/no-go decisions View Project → High School Satellite Image Classification Teach a computer to read the Earth from space View Project →

← Back to All Projects More Undergraduate → Undergraduate Projects