Client Report - Project 2: Late flights and missing data (JSON files)

Course DS 250

Author

Leonardo Alvarino

Elevator pitch

paste your elevator pitch here A SHORT (4-5 SENTENCES) PARAGRAPH THAT DESCRIBES KEY INSIGHTS TAKEN FROM METRICS IN THE PROJECT RESULTS THINK TOP OR MOST IMPORTANT RESULTS.

In this week’s project we cleaned and analized flights data, which included delayed flights and how long these were. We determined that the Chicago O’Hare international airport had the worst delays (longest) while San Diego International Airport had the best. And we also identified September as the best month to fly, with the lowest proportion of delays (16.35%). This information could be really valuable for those who want to travel soon.

The identification of September as the best month to fly, with the lowest proportion of delays (16.35%), offers valuable information for travel planning.

Show the code
df = pd.read_json("flights_missing.json")

Highlight the Questions and Tasks

QUESTION|TASK 1

Fix all of the varied missing data types in the data to be consistent (all missing values should be displayed as “NaN”). In your report include one record example (one row) from your new data, in the raw JSON format. Your example should display the “NaN” for at least one missing value.

Here, in this row, we can see a missing value in the ‘minutes_delayed_carrier’.

Show the code
#fill Null values with 'nan' in year
df['year'] = df['year'].fillna('nan')
#drop all the 'nan'
df = df[df['year'] != 'nan']
#changing year from float to int
df['year'] = df['year'].astype(np.int64)
# extract the null rows from 'minutes_delayed_carrier'  
df_mdc_null = df[df['minutes_delayed_carrier'].isnull()]
# show first row
df_mdc_null.head(1)
airport_code airport_name month year num_of_flights_total num_of_delays_carrier num_of_delays_late_aircraft num_of_delays_nas num_of_delays_security num_of_delays_weather num_of_delays_total minutes_delayed_carrier minutes_delayed_late_aircraft minutes_delayed_nas minutes_delayed_security minutes_delayed_weather minutes_delayed_total
2 IAD January 2005 12381 414 1058 895 4 61 2430 NaN 70919 35660.0 208 4497 134881

QUESTION|TASK 2

Which airport has the worst delays? Discuss the metric you chose, and why you chose it to determine the “worst” airport. Your answer should include a summary table that lists (for each airport) the total number of flights, total number of delayed flights, proportion of delayed flights, and average delay time in hours.

Eventhough ‘SFO’ airport has the worst proportion of number of delays (26%). This doesn’t mean that their delays are the worst or the longest. In this case it will make more sense to look at the hours delayed mean, which represent the average of total hours delayed per month, and the aiport ‘ORD’ has the highest average with 7134 hours.

Show the code
#group by per airport_code and summarize 'num_of_flights_total' and 'num_of_delays_total'
df2 = df.groupby(['airport_code'])[['num_of_flights_total','num_of_delays_total']].agg(['sum'])
#get a new column: proportion in %
df2['proportion'] = round(df2['num_of_delays_total'] /df2['num_of_flights_total'] *100,2)
#add the mean of hours_delayed to the table with group by
df2['hours_delayed_mean'] = (df.groupby(['airport_code'])[['minutes_delayed_total']].agg(['mean']) / 60).round(2)

df2_sorted = df2.sort_values('hours_delayed_mean', ascending=False)
df2_sorted
num_of_flights_total num_of_delays_total proportion hours_delayed_mean
sum sum
airport_code
ORD 3597588 830825 23.09 7115.67
ATL 3648864 749291 20.53 6816.83
SFO 1630945 425604 26.10 3352.33
DEN 2513974 468519 18.64 3178.46
IAD 851571 168467 19.78 1298.42
SLC 1403384 205160 14.62 1278.20
SAN 917862 175132 19.08 1044.98

Findings: - ‘ORD’ airport has the longest average delayed times per month with 7134 while ‘SAN’ airport has the shortest with only 1047. - ‘SFO’ has the highest proportion of delayed flights with 26.09% while ‘SLC’ has the best with only 14.61%

QUESTION|TASK 3

What is the best month to fly if you want to avoid delays of any length? Discuss the metric you chose and why you chose it to calculate your answer. Include one chart to help support your answer, with the x-axis ordered by month. (To answer this question, you will need to remove any rows that are missing the Month variable.)

From the chart, we can see that the best time to avoid delays of any length would be september with 16.35% of delayed flights from the total. This metric is the best in this case since we only want to know the chances that we have to have a delayed flight, without considering the time of this delays.

Show the code
#drop all the n/a values in month column
df = df[df['month'] != 'n/a']
#summarize num of flights and num of delays per month 
df3 = df.groupby(['month'])[['num_of_flights_total','num_of_delays_total']].agg(['sum'])
#create proportion of delays/total of flights column
df3['proportion'] = round(df3['num_of_delays_total'] /df3['num_of_flights_total'] *100,2)
#get the month column to b able to make a bar chart
df3=df3.reset_index()
#list to sort
sort_order=['January', 'Febuary', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']
df3.index = pd.CategoricalIndex(df3['month'], categories = sort_order, ordered = True)
#graph sorted by the index
px.bar(df3.sort_index(), x="month" ,y = "proportion")

Findings: - Best month to travel would be September with 16.35% of having a delayed flight according to the data in this airports. - Worst time to travel would be December (25.68%), followed by June with 24.58%.

QUESTION|TASK 4

According to the BTS website, the “Weather” category only accounts for severe weather delays. Mild weather delays are not counted in the “Weather” category, but are actually included in both the “NAS” and “Late-Arriving Aircraft” categories. Your job is to create a new column that calculates the total number of flights delayed by weather (both severe and mild). You will need to replace all the missing values in the Late Aircraft variable with the mean. Show your work by printing the first 5 rows of data in a table. Use these three rules for your calculations:

Here we can see the 5 top rows with the new column.

Show the code
#take off the negatives -999 to calculate the mean without them
df['num_of_delays_late_aircraft']=df['num_of_delays_late_aircraft'].replace(-999, np.nan)

#assign the mean to the missing values
df['num_of_delays_late_aircraft'] = df['num_of_delays_late_aircraft'].replace(np.nan,  df['num_of_delays_late_aircraft'].mean())

#create new column with assign
df = round((df.assign(delayed_by_weather_sev_mild = 
           np.where( 
                df['month'].isin(['April,May,June,July,August']),
                #if month is within april and August, execute this
                df['num_of_delays_nas'] * .40 + df['num_of_delays_weather'] + 0.3 * df['num_of_delays_late_aircraft'],
                #else
                df['num_of_delays_nas'] * .65 + df['num_of_delays_weather'] + 0.3 * df['num_of_delays_late_aircraft'] 
           )
    )
),1)
df.head(5)
airport_code airport_name month year num_of_flights_total num_of_delays_carrier num_of_delays_late_aircraft num_of_delays_nas num_of_delays_security num_of_delays_weather num_of_delays_total minutes_delayed_carrier minutes_delayed_late_aircraft minutes_delayed_nas minutes_delayed_security minutes_delayed_weather minutes_delayed_total delayed_by_weather_sev_mild
0 ATL Atlanta, GA: Hartsfield-Jackson Atlanta Intern... January 2005 35048 1500+ 1088.9 4598 10 448 8355 116423.0 104415 207467.0 297 36931 465533 3763.4
1 DEN Denver, CO: Denver International January 2005 12687 1041 928.0 935 11 233 3153 53537.0 70301 36817.0 363 21779 182797 1119.2
2 IAD January 2005 12381 414 1058.0 895 4 61 2430 NaN 70919 35660.0 208 4497 134881 960.2
3 ORD Chicago, IL: Chicago O'Hare International January 2005 28194 1197 2255.0 5415 5 306 9178 88691.0 160811 364382.0 151 24859 638894 4502.2
4 SAN San Diego, CA: San Diego International January 2005 7283 572 680.0 638 7 56 1952 27436.0 38445 21127.0 218 4326 91552 674.7
  1. 100% of delayed flights in the Weather category are due to weather

b)30% of all delayed flights in the Late-Arriving category are due to weather.

c)From April to August, 40% of delayed flights in the NAS category are due to weather. The rest of the months, the proportion rises to 65%.

QUESTION|TASK 5

Using the new weather variable calculated above, create a barplot showing the proportion of all flights that are delayed by weather at each airport. Discuss what you learn from this graph.

From this graph we can interpret that ‘SFO’ airport has the highest proportion of delays because of climate issues with 11.04%, this means that 11 out of 100 flights have delays from severe or mild climate. In the other hand, SLC airport has the lowest ratio with only 4.63%.

Show the code
#summarize by airport, getting the sum of number of flights and delayed by weather
df_agg5 = df.groupby(['airport_code'])[['num_of_flights_total','delayed_by_weather_sev_mild']].agg(['sum'])
#calculate the proportion of delayed flights
df_agg5['proportion'] =  df_agg5['delayed_by_weather_sev_mild'] /df_agg5['num_of_flights_total'] * 100
#getting the airport code as a column
df_agg5 = df_agg5.reset_index()
#plot
px.bar(df_agg5, x="airport_code" ,y = "proportion")