From Takeoff to Touchdown: Dissecting Data on Air Disasters

Proposal

Data visualization
A shiny app integration with aircraft crash analysis
Author
Affiliation

Infographic Innovators - Antonio, Bharath, Eshaan, Thanoosha

School of Information, University of Arizona

Installed Packages
# GETTING THE LIBRARIES
if (!require(pacman))
  install.packages(pacman)

pacman::p_load(formattable,
               tidyverse,
               janitor,
               dlookr,
               here)

High Level Goal

Build a Shiny app where users can look at different plots to analyze the flight crashes occurred in the United States form 1980 to 2022.

Dataset

# Reading the data using read_csv
flights_ntsb <- read_csv(here("data", "flight_crash_data_NTSB.csv"))

#selecting columns
flights_ntsb <- flights_ntsb |>
  select(
    EventType, EventDate,
    City, State,
    HasSafetyRec,
    ReportType, HighestInjuryLevel,
    FatalInjuryCount, SeriousInjuryCount,
    MinorInjuryCount, ProbableCause,
    Latitude, Longitude,
    AirCraftCategory, AirportID,
    AirportName, AmateurBuilt,
    NumberOfEngines, AirCraftDamage,
    WeatherCondition
  ) |>
  clean_names()
flights_ntsb

In preparation for our analysis, we loaded the data and preliminary reviewed its structure and contents. For example, using inline code such as clean_names, diagnose, describe, and formattable. We obtained an overview of the dataset’s variables and initial statistics. This preliminary step confirmed the dataset’s suitability for answering our research questions, which revolve around identifying temporal patterns and finding relations between parameters that effect crashes.

Background

The dataset central to our investigation was procured through a request to the National Transportation Safety Board (NSTB). It encompasses detailed records of aircraft crashes in the U.S. from January 1, 1980, to December 31, 2022. The dataset is structured as a tibble with 89,134 rows and 38 columns, out of which a few unnecessary columns has been removed for our purpose which resulted in a tibble with 89,134 rows and 20 columns. This rich dataset is ideal for our purpose because it enables both time-series, geo-spatial analysis and also other factors that affect the crashes - methods we believe are critical for understanding the dynamics and geo-spatial distribution of aircraft accidents over time.

Flights Report Data Diagnosis Code
# Getting basic information about the dataset

flights_ntsb |>
  diagnose() |>
  formattable()
variables types missing_count missing_percent unique_count unique_rate
event_type character 8 0.008975251 3 3.365719e-05
event_date POSIXct 0 0.000000000 80322 9.011376e-01
city character 8 0.008975251 19038 2.135885e-01
state character 155 0.173895483 58 6.507057e-04
has_safety_rec logical 0 0.000000000 2 2.243813e-05
report_type character 1 0.001121906 5 5.609532e-05
highest_injury_level character 126 0.141360199 5 5.609532e-05
fatal_injury_count numeric 0 0.000000000 52 5.833913e-04
serious_injury_count numeric 0 0.000000000 24 2.692575e-04
minor_injury_count numeric 0 0.000000000 49 5.497341e-04
probable_cause character 29568 33.172526757 56119 6.296026e-01
latitude numeric 0 0.000000000 61321 6.879642e-01
longitude numeric 0 0.000000000 62243 6.983082e-01
air_craft_category character 13 0.014584782 38 4.263244e-04
airport_id character 40230 45.134292189 9656 1.083313e-01
airport_name character 32675 36.658289766 25807 2.895304e-01
amateur_built character 0 0.000000000 7 7.853344e-05
number_of_engines character 1546 1.734467207 40 4.487625e-04
air_craft_damage character 21 0.023560033 29 3.253528e-04
weather_condition character 173 0.194089797 7 7.853344e-05
Flights Report Data Describe Code
# Getting basic statistical information about the dataset

flights_ntsb |>
  describe() |>
  formattable()
described_variables n na mean sd se_mean IQR skewness kurtosis p00 p01 p05 p10 p20 p25 p30 p40 p50 p60 p70 p75 p80 p90 p95 p99 p100
fatal_injury_count 89134 0 0.3635538 2.251023e+00 7.539771e-03 0.00000 65.05258 5829.500 0 0.00 0.0000 0.0000 0.00000 0.00 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 1.00000 2.00000 4.00000 2.650000e+02
serious_injury_count 89134 0 0.1950210 7.612223e-01 2.549704e-03 0.00000 37.67354 3044.159 0 0.00 0.0000 0.0000 0.00000 0.00 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 1.00000 1.00000 2.00000 8.100000e+01
minor_injury_count 89134 0 0.3146386 1.399480e+00 4.687541e-03 0.00000 42.29602 3134.592 0 0.00 0.0000 0.0000 0.00000 0.00 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 1.00000 2.00000 3.00000 1.370000e+02
latitude 89134 0 988.6591476 2.011036e+05 6.735939e+02 12.22944 211.29325 44673.978 0 0.00 0.0000 0.0000 21.31861 29.18 31.18077 33.94250 35.97083 38.55095 40.43017 41.40944 42.56583 46.36235 58.41937 64.40744 4.351118e+07
longitude 89134 0 -123.9442394 6.343362e+03 2.124701e+01 34.84990 -149.65240 22435.028 -1004241 -158.58 -147.7193 -122.1001 -116.60010 -111.85 -106.08916 -96.58913 -90.31158 -84.50032 -80.45880 -77.00008 -70.81000 0.00000 0.00000 0.00000 1.741242e+02

Value

Our project’s primary goal is to conduct an in-depth analysis of aircraft crash incidents within the United States to uncover temporal trends and geographical patterns. This objective is fueled by our diverse personal experiences with air travel, which range from enthusiasm to apprehension. This stems from talking about places we’ve traveled to, we chose the data as some of us are more comfortable with flying than others. We decided by looking at aircraft crash data that it could be a chance to get more informed about aircraft crashes within the United States. By examining the data, we hope to gain a clearer understanding of the factors contributing to these incidents and alleviate some of the concerns regarding aviation safety. We wanted do time series analysis as well as geo-spatial analysis and this data presented itself as a great opportunity to do both. We intend to design a shiny app so that we can display the a time series change of crashes throughout the years.

Problem Statement

We conducted Exploratory Data Analysis on the dataset and categorized our analysis into three main areas:

  • Examining Aircraft Crashes, with a focus on their locations, timings, and consequences.
  • Investigating the Causes of Crashes.
  • Assessing the Influence of Weather Conditions on Crashes.

Plan of Action

Examining Aircraft Crashes, with a focus on their locations, timings, and consequences

  1. Choropleth Map - An Animated Choropleth Map of Aircraft Crashes in the United States Over Time:

    The animated Choropleth Map provides a dynamic and visually compelling representation of aircraft crashes across the United States. This data visualization leverages the US states of crash locations and spans multiple years to reveal temporal patterns. This Choropleth Map not only facilitates the identification of regions with higher crash frequencies but also offers insights into the evolving dynamics of air travel safety over time.

  2. Time-series Plot- Analysis of Aircraft Crashes and Fatalities:

    We visualize the trend in aircraft crashes historically. We are going to plot this trend based on different aspects like total fatalities which is represented using column fatal_injury_count, serious_injury_count and minor_injury_count. Here they classify a particular injury count as a minor, serious and fatal injury based on the level of casualties that occurred in the crash. We also would like to highlight few notable airplane crashes in the history in our time-series analysis.

  3. Radial bar Plot - A Radial Perspective on Aircraft Crashes During Flight Phases:

    We will be representing the count of crashes on the x-axis and the phases of flight (takeoff, landing etc,.) on the y-axis in a circular manner. We can transform the bar plot into a radial form, emphasizing the distribution of crashes during these critical flight stages. This visualization technique allows for a quick understanding of the relative frequency of crashes during takeoff and landing, enabling insights into the safety challenges faced during these flight phases.

Analysis of Causes of Crashes

  1. Waffle plot - Common Causes of Aircraft Crashes:

    The waffle plot offers an overview of the most common causes of aircraft crashes throughout the United States. Leveraging the probable_cause column, which contains concise descriptions of crash events, the waffle plot employs text summarization techniques to extract and categorize these causes into manageable groups. This visualization provides a straightforward and insightful representation of the key factors contributing to aviation incidents, aiding in the identification of the primary causes that warrant further investigation.

  2. Density plot - Causes and Severity of Aircraft Crashes:

The density polot delves into the relationship between specific crash cause ("pilot's failure") and the severity of injuries incurred. Drawing from the `probable_cause` column, as well as the `severity` and `event_date` columns, this visualization quantifies the number of crashes attributed to cause "pilot's failure"while distinguishing between varying levels of severity. By visualizing this data in a density plot, it becomes evident how severe are the crashes.

Assessing the Influence of Weather Conditions on Crashes

  1. Radar Plot - Analysis of Aircraft Crashes by Month and Weather Conditions:

    Using radar plot we would like to display multiple weather conditions on the axes of the plot, while the spokes represent different months. Each radar plot point signifies the frequency of crashes occurring in a specific month under a particular weather condition. This comprehensive visualization enables a quick assessment of the relationship between weather conditions, crash occurrences, and the month when the crash has occurred.

Ultimately, we plan to utilize the interactive features of the Shiny application to present these plots in a user-friendly manner.

Variables of focus

Variable Description
event_type Type of event - accident, incident or occurrence
event_date Date time of when the event has occurred
city The city or place location closest to the site of the event
state The state in which the site of the event is present
highest_injury_level Indicate the highest level of injury among all injuries sustained as a result of the event
fatal_injury_count The total number of fatal injuries from an event
serious_injury_count The total number of serious injuries from an event
minor_injury_count The total number of minor injuries from an event
probable_cause The probable cause for the aircraft crash as per the NTSB report
latitude Latitude for the event site in degrees and decimal degrees.
longitude Longitude for the event site in degrees and decimal degrees.
weather_condition The basic weather conditions at the time of the event

Implementation

Weekly Plan

Week Weekly Tasks Persons in Charge Backup
until November 8th Explore and finalize the data set and the problem statements Everyone Everyone
- Complete the proposal and assign some high-level tasks Everyone Everyone
November 9th to 15th Getting to know about Shiny application Antonio Bharath
- Data cleaning and Data pre-processing Thanoosha Eshaan
- Question specific exploration and data categorization Thanoosha, Bharath Antonio, Eshaan
November 16th to 22nd Generating plots for Heat-Map and Time-Series Antonio, Bharath Eshaan
- Generating plots for Radial bar Plot and Bar Plot Eshaan, Antonio Thanoosha
- Generating plots for Stacked Area Chart and Radar Plot Thanoosha Bharath
- Exploring on how to integrate our specific visualizations and shiny Eshaan Antonio
November 23rd to 29th Generating remaining parts of the plots for all the plots Everyone Everyone
- Improving the generated plots Bharath Thanoosha
- Start integrating shiny and our visualizations Eshaan Antonio
November 30th to December 6th Refining the code for code review with comments Everyone Everyone
- Continue with the integration of shiny and our plots Bharath Thanoosha
December 7th to 13th Complete the shiny application with multiple user functionality Antonio Eshaan
- Review the generated plots and shiny integration Thanoosha Bharath
- Write-up and presentation for the project Everyone Everyone

Repo Organization

The following are the folders involved in the Project repository.

  • ‘data/’: Used for storing any necessary data files for the project, such as input files.

  • ‘images/’: Used for storing image files used in the project.

  • ‘presentation_files/’: Folder for having presentation related files.

  • ‘_extra/’: Used to brainstorm our analysis which won’t impact our project workflow.

  • ‘_freeze/’: This folder is used to store the generated files during the build process. These files represent the frozen state of the website at a specific point in time.

  • ‘_site/’: Folder used to store the generated static website files after the site generator processes the quarto document.

  • ‘.github/’: Folder for storing github templates and workflow.

We will be creating few folders inside images/ folder for storing question specific images and presentation related images which are generated through out the plot. We will be creating images/Q1, images/Q2 and images/Presentation for those respective files.

Note:

These are the planned approaches, and we intend to explore and solve the problem statement which we came up with. Parts of our approach might change in the final implementation.