Client Report - Project 2: Late flights and missing data (JSON files)

Elevator pitch

paste your elevator pitch here A SHORT (4-5 SENTENCES) PARAGRAPH THAT DESCRIBES KEY INSIGHTS TAKEN FROM METRICS IN THE PROJECT RESULTS THINK TOP OR MOST IMPORTANT RESULTS.

In this week’s project we cleaned and analized flights data, which included delayed flights and how long these were. We determined that the Chicago O’Hare international airport had the worst delays (longest) while San Diego International Airport had the best. And we also identified September as the best month to fly, with the lowest proportion of delays (16.35%). This information could be really valuable for those who want to travel soon.

The identification of September as the best month to fly, with the lowest proportion of delays (16.35%), offers valuable information for travel planning.

Show the code

df = pd.read_json("flights_missing.json")

Highlight the Questions and Tasks

QUESTION|TASK 1

Fix all of the varied missing data types in the data to be consistent (all missing values should be displayed as “NaN”). In your report include one record example (one row) from your new data, in the raw JSON format. Your example should display the “NaN” for at least one missing value.

Here, in this row, we can see a missing value in the ‘minutes_delayed_carrier’.

Show the code

#fill Null values with 'nan' in year
df['year'] = df['year'].fillna('nan')
#drop all the 'nan'
df = df[df['year'] != 'nan']
#changing year from float to int
df['year'] = df['year'].astype(np.int64)
# extract the null rows from 'minutes_delayed_carrier'  
df_mdc_null = df[df['minutes_delayed_carrier'].isnull()]
# show first row
df_mdc_null.head(1)

	airport_code	airport_name	month	year	num_of_flights_total	num_of_delays_carrier	num_of_delays_late_aircraft	num_of_delays_nas	num_of_delays_security	num_of_delays_weather	num_of_delays_total	minutes_delayed_carrier	minutes_delayed_late_aircraft	minutes_delayed_nas	minutes_delayed_security	minutes_delayed_weather	minutes_delayed_total
2	IAD		January	2005	12381	414	1058	895	4	61	2430	NaN	70919	35660.0	208	4497	134881

QUESTION|TASK 2

Which airport has the worst delays? Discuss the metric you chose, and why you chose it to determine the “worst” airport. Your answer should include a summary table that lists (for each airport) the total number of flights, total number of delayed flights, proportion of delayed flights, and average delay time in hours.

Eventhough ‘SFO’ airport has the worst proportion of number of delays (26%). This doesn’t mean that their delays are the worst or the longest. In this case it will make more sense to look at the hours delayed mean, which represent the average of total hours delayed per month, and the aiport ‘ORD’ has the highest average with 7134 hours.

Show the code

#group by per airport_code and summarize 'num_of_flights_total' and 'num_of_delays_total'
df2 = df.groupby(['airport_code'])[['num_of_flights_total','num_of_delays_total']].agg(['sum'])
#get a new column: proportion in %
df2['proportion'] = round(df2['num_of_delays_total'] /df2['num_of_flights_total'] *100,2)
#add the mean of hours_delayed to the table with group by
df2['hours_delayed_mean'] = (df.groupby(['airport_code'])[['minutes_delayed_total']].agg(['mean']) / 60).round(2)

df2_sorted = df2.sort_values('hours_delayed_mean', ascending=False)
df2_sorted

	num_of_flights_total	num_of_delays_total	proportion	hours_delayed_mean
	sum	sum
airport_code
ORD	3597588	830825	23.09	7115.67
ATL	3648864	749291	20.53	6816.83
SFO	1630945	425604	26.10	3352.33
DEN	2513974	468519	18.64	3178.46
IAD	851571	168467	19.78	1298.42
SLC	1403384	205160	14.62	1278.20
SAN	917862	175132	19.08	1044.98

Findings: - ‘ORD’ airport has the longest average delayed times per month with 7134 while ‘SAN’ airport has the shortest with only 1047. - ‘SFO’ has the highest proportion of delayed flights with 26.09% while ‘SLC’ has the best with only 14.61%

QUESTION|TASK 3

What is the best month to fly if you want to avoid delays of any length? Discuss the metric you chose and why you chose it to calculate your answer. Include one chart to help support your answer, with the x-axis ordered by month. (To answer this question, you will need to remove any rows that are missing the Month variable.)

From the chart, we can see that the best time to avoid delays of any length would be september with 16.35% of delayed flights from the total. This metric is the best in this case since we only want to know the chances that we have to have a delayed flight, without considering the time of this delays.

Show the code

#drop all the n/a values in month column
df = df[df['month'] != 'n/a']
#summarize num of flights and num of delays per month 
df3 = df.groupby(['month'])[['num_of_flights_total','num_of_delays_total']].agg(['sum'])
#create proportion of delays/total of flights column
df3['proportion'] = round(df3['num_of_delays_total'] /df3['num_of_flights_total'] *100,2)
#get the month column to b able to make a bar chart
df3=df3.reset_index()
#list to sort
sort_order=['January', 'Febuary', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']
df3.index = pd.CategoricalIndex(df3['month'], categories = sort_order, ordered = True)
#graph sorted by the index
px.bar(df3.sort_index(), x="month" ,y = "proportion")

Findings: - Best month to travel would be September with 16.35% of having a delayed flight according to the data in this airports. - Worst time to travel would be December (25.68%), followed by June with 24.58%.

QUESTION|TASK 4

According to the BTS website, the “Weather” category only accounts for severe weather delays. Mild weather delays are not counted in the “Weather” category, but are actually included in both the “NAS” and “Late-Arriving Aircraft” categories. Your job is to create a new column that calculates the total number of flights delayed by weather (both severe and mild). You will need to replace all the missing values in the Late Aircraft variable with the mean. Show your work by printing the first 5 rows of data in a table. Use these three rules for your calculations:

Here we can see the 5 top rows with the new column.

Show the code

#take off the negatives -999 to calculate the mean without them
df['num_of_delays_late_aircraft']=df['num_of_delays_late_aircraft'].replace(-999, np.nan)

#assign the mean to the missing values
df['num_of_delays_late_aircraft'] = df['num_of_delays_late_aircraft'].replace(np.nan,  df['num_of_delays_late_aircraft'].mean())

#create new column with assign
df = round((df.assign(delayed_by_weather_sev_mild = 
           np.where( 
                df['month'].isin(['April,May,June,July,August']),
                #if month is within april and August, execute this
                df['num_of_delays_nas'] * .40 + df['num_of_delays_weather'] + 0.3 * df['num_of_delays_late_aircraft'],
                #else
                df['num_of_delays_nas'] * .65 + df['num_of_delays_weather'] + 0.3 * df['num_of_delays_late_aircraft'] 
           )
    )
),1)
df.head(5)

	airport_code	airport_name	month	year	num_of_flights_total	num_of_delays_carrier	num_of_delays_late_aircraft	num_of_delays_nas	num_of_delays_security	num_of_delays_weather	num_of_delays_total	minutes_delayed_carrier	minutes_delayed_late_aircraft	minutes_delayed_nas	minutes_delayed_security	minutes_delayed_weather	minutes_delayed_total	delayed_by_weather_sev_mild
0	ATL	Atlanta, GA: Hartsfield-Jackson Atlanta Intern...	January	2005	35048	1500+	1088.9	4598	10	448	8355	116423.0	104415	207467.0	297	36931	465533	3763.4
1	DEN	Denver, CO: Denver International	January	2005	12687	1041	928.0	935	11	233	3153	53537.0	70301	36817.0	363	21779	182797	1119.2
2	IAD		January	2005	12381	414	1058.0	895	4	61	2430	NaN	70919	35660.0	208	4497	134881	960.2
3	ORD	Chicago, IL: Chicago O'Hare International	January	2005	28194	1197	2255.0	5415	5	306	9178	88691.0	160811	364382.0	151	24859	638894	4502.2
4	SAN	San Diego, CA: San Diego International	January	2005	7283	572	680.0	638	7	56	1952	27436.0	38445	21127.0	218	4326	91552	674.7

100% of delayed flights in the Weather category are due to weather

b)30% of all delayed flights in the Late-Arriving category are due to weather.

c)From April to August, 40% of delayed flights in the NAS category are due to weather. The rest of the months, the proportion rises to 65%.

QUESTION|TASK 5

Using the new weather variable calculated above, create a barplot showing the proportion of all flights that are delayed by weather at each airport. Discuss what you learn from this graph.

From this graph we can interpret that ‘SFO’ airport has the highest proportion of delays because of climate issues with 11.04%, this means that 11 out of 100 flights have delays from severe or mild climate. In the other hand, SLC airport has the lowest ratio with only 4.63%.

Show the code

#summarize by airport, getting the sum of number of flights and delayed by weather
df_agg5 = df.groupby(['airport_code'])[['num_of_flights_total','delayed_by_weather_sev_mild']].agg(['sum'])
#calculate the proportion of delayed flights
df_agg5['proportion'] =  df_agg5['delayed_by_weather_sev_mild'] /df_agg5['num_of_flights_total'] * 100
#getting the airport code as a column
df_agg5 = df_agg5.reset_index()
#plot
px.bar(df_agg5, x="airport_code" ,y = "proportion")