Show the code
= pd.read_json("flights_missing.json") df
paste your elevator pitch here A SHORT (4-5 SENTENCES) PARAGRAPH THAT DESCRIBES KEY INSIGHTS
TAKEN FROM METRICS IN THE PROJECT RESULTS THINK TOP OR MOST IMPORTANT RESULTS.
In this week’s project we cleaned and analized flights data, which included delayed flights and how long these were. We determined that the Chicago O’Hare international airport had the worst delays (longest) while San Diego International Airport had the best. And we also identified September as the best month to fly, with the lowest proportion of delays (16.35%). This information could be really valuable for those who want to travel soon.
The identification of September as the best month to fly, with the lowest proportion of delays (16.35%), offers valuable information for travel planning.
= pd.read_json("flights_missing.json") df
Highlight the Questions and Tasks
Fix all of the varied missing data types in the data to be consistent (all missing values should be displayed as “NaN”). In your report include one record example (one row) from your new data, in the raw JSON format. Your example should display the “NaN” for at least one missing value.
Here, in this row, we can see a missing value in the ‘minutes_delayed_carrier’.
#fill Null values with 'nan' in year
'year'] = df['year'].fillna('nan')
df[#drop all the 'nan'
= df[df['year'] != 'nan']
df #changing year from float to int
'year'] = df['year'].astype(np.int64)
df[# extract the null rows from 'minutes_delayed_carrier'
= df[df['minutes_delayed_carrier'].isnull()]
df_mdc_null # show first row
1) df_mdc_null.head(
airport_code | airport_name | month | year | num_of_flights_total | num_of_delays_carrier | num_of_delays_late_aircraft | num_of_delays_nas | num_of_delays_security | num_of_delays_weather | num_of_delays_total | minutes_delayed_carrier | minutes_delayed_late_aircraft | minutes_delayed_nas | minutes_delayed_security | minutes_delayed_weather | minutes_delayed_total | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2 | IAD | January | 2005 | 12381 | 414 | 1058 | 895 | 4 | 61 | 2430 | NaN | 70919 | 35660.0 | 208 | 4497 | 134881 |
Which airport has the worst delays? Discuss the metric you chose, and why you chose it to determine the “worst” airport. Your answer should include a summary table that lists (for each airport) the total number of flights, total number of delayed flights, proportion of delayed flights, and average delay time in hours.
Eventhough ‘SFO’ airport has the worst proportion of number of delays (26%). This doesn’t mean that their delays are the worst or the longest. In this case it will make more sense to look at the hours delayed mean, which represent the average of total hours delayed per month, and the aiport ‘ORD’ has the highest average with 7134 hours.
#group by per airport_code and summarize 'num_of_flights_total' and 'num_of_delays_total'
= df.groupby(['airport_code'])[['num_of_flights_total','num_of_delays_total']].agg(['sum'])
df2 #get a new column: proportion in %
'proportion'] = round(df2['num_of_delays_total'] /df2['num_of_flights_total'] *100,2)
df2[#add the mean of hours_delayed to the table with group by
'hours_delayed_mean'] = (df.groupby(['airport_code'])[['minutes_delayed_total']].agg(['mean']) / 60).round(2)
df2[
= df2.sort_values('hours_delayed_mean', ascending=False)
df2_sorted df2_sorted
num_of_flights_total | num_of_delays_total | proportion | hours_delayed_mean | |
---|---|---|---|---|
sum | sum | |||
airport_code | ||||
ORD | 3597588 | 830825 | 23.09 | 7115.67 |
ATL | 3648864 | 749291 | 20.53 | 6816.83 |
SFO | 1630945 | 425604 | 26.10 | 3352.33 |
DEN | 2513974 | 468519 | 18.64 | 3178.46 |
IAD | 851571 | 168467 | 19.78 | 1298.42 |
SLC | 1403384 | 205160 | 14.62 | 1278.20 |
SAN | 917862 | 175132 | 19.08 | 1044.98 |
Findings: - ‘ORD’ airport has the longest average delayed times per month with 7134 while ‘SAN’ airport has the shortest with only 1047. - ‘SFO’ has the highest proportion of delayed flights with 26.09% while ‘SLC’ has the best with only 14.61%
What is the best month to fly if you want to avoid delays of any length? Discuss the metric you chose and why you chose it to calculate your answer. Include one chart to help support your answer, with the x-axis ordered by month. (To answer this question, you will need to remove any rows that are missing the Month variable.)
From the chart, we can see that the best time to avoid delays of any length would be september with 16.35% of delayed flights from the total. This metric is the best in this case since we only want to know the chances that we have to have a delayed flight, without considering the time of this delays.
#drop all the n/a values in month column
= df[df['month'] != 'n/a']
df #summarize num of flights and num of delays per month
= df.groupby(['month'])[['num_of_flights_total','num_of_delays_total']].agg(['sum'])
df3 #create proportion of delays/total of flights column
'proportion'] = round(df3['num_of_delays_total'] /df3['num_of_flights_total'] *100,2)
df3[#get the month column to b able to make a bar chart
=df3.reset_index()
df3#list to sort
=['January', 'Febuary', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']
sort_order= pd.CategoricalIndex(df3['month'], categories = sort_order, ordered = True)
df3.index #graph sorted by the index
="month" ,y = "proportion") px.bar(df3.sort_index(), x
Findings: - Best month to travel would be September with 16.35% of having a delayed flight according to the data in this airports. - Worst time to travel would be December (25.68%), followed by June with 24.58%.
According to the BTS website, the “Weather” category only accounts for severe weather delays. Mild weather delays are not counted in the “Weather” category, but are actually included in both the “NAS” and “Late-Arriving Aircraft” categories. Your job is to create a new column that calculates the total number of flights delayed by weather (both severe and mild). You will need to replace all the missing values in the Late Aircraft variable with the mean. Show your work by printing the first 5 rows of data in a table. Use these three rules for your calculations:
Here we can see the 5 top rows with the new column.
#take off the negatives -999 to calculate the mean without them
'num_of_delays_late_aircraft']=df['num_of_delays_late_aircraft'].replace(-999, np.nan)
df[
#assign the mean to the missing values
'num_of_delays_late_aircraft'] = df['num_of_delays_late_aircraft'].replace(np.nan, df['num_of_delays_late_aircraft'].mean())
df[
#create new column with assign
= round((df.assign(delayed_by_weather_sev_mild =
df
np.where( 'month'].isin(['April,May,June,July,August']),
df[#if month is within april and August, execute this
'num_of_delays_nas'] * .40 + df['num_of_delays_weather'] + 0.3 * df['num_of_delays_late_aircraft'],
df[#else
'num_of_delays_nas'] * .65 + df['num_of_delays_weather'] + 0.3 * df['num_of_delays_late_aircraft']
df[
)
)1)
),5) df.head(
airport_code | airport_name | month | year | num_of_flights_total | num_of_delays_carrier | num_of_delays_late_aircraft | num_of_delays_nas | num_of_delays_security | num_of_delays_weather | num_of_delays_total | minutes_delayed_carrier | minutes_delayed_late_aircraft | minutes_delayed_nas | minutes_delayed_security | minutes_delayed_weather | minutes_delayed_total | delayed_by_weather_sev_mild | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ATL | Atlanta, GA: Hartsfield-Jackson Atlanta Intern... | January | 2005 | 35048 | 1500+ | 1088.9 | 4598 | 10 | 448 | 8355 | 116423.0 | 104415 | 207467.0 | 297 | 36931 | 465533 | 3763.4 |
1 | DEN | Denver, CO: Denver International | January | 2005 | 12687 | 1041 | 928.0 | 935 | 11 | 233 | 3153 | 53537.0 | 70301 | 36817.0 | 363 | 21779 | 182797 | 1119.2 |
2 | IAD | January | 2005 | 12381 | 414 | 1058.0 | 895 | 4 | 61 | 2430 | NaN | 70919 | 35660.0 | 208 | 4497 | 134881 | 960.2 | |
3 | ORD | Chicago, IL: Chicago O'Hare International | January | 2005 | 28194 | 1197 | 2255.0 | 5415 | 5 | 306 | 9178 | 88691.0 | 160811 | 364382.0 | 151 | 24859 | 638894 | 4502.2 |
4 | SAN | San Diego, CA: San Diego International | January | 2005 | 7283 | 572 | 680.0 | 638 | 7 | 56 | 1952 | 27436.0 | 38445 | 21127.0 | 218 | 4326 | 91552 | 674.7 |
b)30% of all delayed flights in the Late-Arriving category are due to weather.
c)From April to August, 40% of delayed flights in the NAS category are due to weather. The rest of the months, the proportion rises to 65%.
Using the new weather variable calculated above, create a barplot showing the proportion of all flights that are delayed by weather at each airport. Discuss what you learn from this graph.
From this graph we can interpret that ‘SFO’ airport has the highest proportion of delays because of climate issues with 11.04%, this means that 11 out of 100 flights have delays from severe or mild climate. In the other hand, SLC airport has the lowest ratio with only 4.63%.
#summarize by airport, getting the sum of number of flights and delayed by weather
= df.groupby(['airport_code'])[['num_of_flights_total','delayed_by_weather_sev_mild']].agg(['sum'])
df_agg5 #calculate the proportion of delayed flights
'proportion'] = df_agg5['delayed_by_weather_sev_mild'] /df_agg5['num_of_flights_total'] * 100
df_agg5[#getting the airport code as a column
= df_agg5.reset_index()
df_agg5 #plot
="airport_code" ,y = "proportion") px.bar(df_agg5, x