Analyze Aviation Incident Patterns with NLP

Apply modern NLP to uncover hidden patterns in safety data

Undergraduate Human Factors 4–6 weeks

Python scikit-learnHugging Face Transformers

Last reviewed: March 2026

Overview

The NASA Aviation Safety Reporting System (ASRS) contains over 1.8 million voluntary safety reports — a vast, largely untapped resource for understanding human error, system failures, and organizational breakdowns in aviation. Traditional safety analysis involves experts reading reports one at a time. Natural language processing can process the entire database, finding patterns that no human could discover manually.

In this project, you'll build a sophisticated NLP pipeline that goes beyond simple keyword matching. You'll start with TF-IDF features and classical classifiers (SVM, gradient boosting) to establish baselines, then fine-tune a pre-trained transformer (DistilBERT) on the ASRS corpus. The transformer approach captures context and meaning — it knows that "lost separation" and "aircraft converging" describe similar events even though they share no words.

Beyond classification, you'll use topic modeling (LDA) and embedding visualization (UMAP) to discover latent themes in the data: groups of reports that cluster together not because of their assigned category, but because of underlying similarities in what went wrong. This unsupervised analysis can reveal emerging safety issues before they appear in formal statistics.

What You'll Learn

✓ Build a complete NLP pipeline from raw text to trained classifier
✓ Compare classical (TF-IDF + SVM) and deep learning (transformer) approaches to text classification
✓ Fine-tune a pre-trained language model on domain-specific data
✓ Apply topic modeling to discover latent themes in a text corpus
✓ Visualize high-dimensional text embeddings using dimensionality reduction

Step-by-Step Guide

Build the ASRS Dataset

Use the ASRS database query interface to download a large, balanced dataset: at least 1,000 reports from each of 6–8 incident categories (e.g., Airspace Violation, CFIT, Runway Incursion, Equipment Malfunction, Weather, Human Factors, ATC Issues). Export as CSV, extracting both the narrative text and the structured metadata (phase of flight, aircraft type, reporter function).

Clean the narratives: remove header/footer boilerplate, normalize whitespace, and handle the ASRS convention of replacing specific identifiers with codes (e.g., "ZZZ" for airport names, "Company X" for airlines). Decide whether to keep or remove these anonymization tokens — they carry structural information even without specific identities.

Classical NLP Baseline

Build a TF-IDF + SVM classifier as your baseline. Use scikit-learn's TfidfVectorizer with bigrams enabled, sublinear TF scaling, and a vocabulary cap of 20,000 features. Train a LinearSVC with cross-validation to tune the regularization parameter C.

Also train a gradient-boosted classifier (scikit-learn's HistGradientBoostingClassifier) on the same TF-IDF features. Compare with SVM using macro-averaged F1 score — this metric treats all categories equally, which is important when some incident types are rarer than others. Expect F1 scores in the 0.75–0.85 range.

Topic Modeling with LDA

Apply Latent Dirichlet Allocation (LDA) to discover topics that cut across the formal incident categories. Use scikit-learn's LatentDirichletAllocation with 15–25 topics. Examine the top words in each topic and assign interpretive labels (e.g., "communication breakdown," "fatigue-related," "visual approach errors").

Some topics will align neatly with incident categories; others will reveal cross-cutting themes. For example, fatigue-related language might appear across multiple categories — suggesting that fatigue is an underlying factor that the category system doesn't explicitly capture. This is exactly the kind of insight that makes NLP valuable for safety research.

Fine-Tune a Transformer

Install Hugging Face Transformers and load distilbert-base-uncased — a compact transformer that's fast to fine-tune on a single GPU or even a CPU (though slower). Tokenize the ASRS narratives using the DistilBERT tokenizer, truncating to 512 tokens. Most ASRS narratives fit within this limit.

Fine-tune for 3–5 epochs using the Trainer API with a learning rate of 2e-5 and a batch size of 16. Use a classification head on the [CLS] token output. Monitor validation F1 score per epoch — transformers tend to overfit quickly on small datasets, so early stopping is important. Expect the transformer to outperform the TF-IDF baseline by 3–8% in F1 score.

Embedding Visualization

Extract the [CLS] token embeddings from your fine-tuned DistilBERT model for all reports in the test set. These 768-dimensional vectors encode the semantic meaning of each narrative. Use UMAP (Uniform Manifold Approximation and Projection) to reduce them to 2D and create a scatter plot colored by incident category.

Well-separated clusters indicate categories with distinct language. Overlapping clusters suggest categories that are genuinely similar or ambiguous. Zoom in on the boundary regions — the reports that sit between clusters are often the most interesting from a safety perspective because they involve multiple contributing factors.

Error Analysis and Confusion Patterns

Generate a confusion matrix for both the TF-IDF and transformer classifiers. Where do they agree and disagree? The transformer typically resolves ambiguities that trip up the TF-IDF model — for example, distinguishing "communication error with ATC" (an ATC issue) from "communication error between crew members" (a human factors issue) based on contextual understanding.

Manually review the 20 most confidently wrong predictions (highest predicted probability but wrong category). Are these true model errors, or are they labeling inconsistencies in the ASRS data? In safety databases, labeling is often imperfect because reporters self-select categories and may not choose the most appropriate one.

Temporal Analysis and Reporting

Plot the distribution of incident types over time (by year and month). Are certain categories trending up or down? Are there seasonal patterns? Correlate spikes with known events (new regulations, major accidents that changed reporting behavior, COVID-19 traffic reduction).

Write a comprehensive report structured as a short research paper: introduction (why NLP for safety), methods (both approaches), results (performance comparison and topic analysis), discussion (what the model reveals about incident patterns), and future work. This is portfolio-quality work that demonstrates both ML skills and domain understanding.

Career Connection

See how this project connects to real aerospace careers.

Go Further

Named entity recognition — train a custom NER model to extract aircraft types, airport codes, and phases of flight from unstructured narratives
Cross-database analysis — apply your models to NTSB accident reports or FAA incident data and compare the language and themes with ASRS voluntary reports
Retrieval-augmented generation — build a system where analysts can ask natural language questions about the ASRS database and get AI-generated summaries with cited reports
Causal chain extraction — use dependency parsing to extract cause-effect relationships from narratives (e.g., "fatigue led to missed checklist item which caused...")

Related Projects

High School Build a Word Cloud and Classifier from Aviation Safety Reports Mine real incident reports to discover what goes wrong in the cockpit View Project →

← Back to All Projects More Undergraduate → Undergraduate Projects