In this week’s project, we discovered that players with very few at-bats rank at the top in batting average. The top 5 performers have only 2 or 3 at-bats each. However, as we examined players with more at-bats, we observed a decrease in hits and batting average. Additionally, we explored the variations in average salary between two highly popular baseball players throughout the years, discovering almost a 9x and 7x increase in around 30 years.
q =''' SELECT name FROM sqlite_master WHERE type='table' '''table = pd.read_sql_query(q,con)table
name
0
allstarfull
1
appearances
2
awardsmanagers
3
awardsplayers
4
awardssharemanagers
5
awardsshareplayers
6
batting
7
battingpost
8
collegeplaying
9
divisions
10
fielding
11
fieldingof
12
fieldingofsplit
13
fieldingpost
14
halloffame
15
homegames
16
leagues
17
managers
18
managershalf
19
parks
20
people
21
pitching
22
pitchingpost
23
salaries
24
schools
25
seriespost
26
teams
27
teamsfranchises
28
teamshalf
Highlight the Questions and Tasks
QUESTION|TASK 1
Write an SQL query to create a new dataframe about baseball players who attended BYU-Idaho. The new table should contain five columns: playerID, schoolID, salary, and the yearID/teamID associated with each salary. Order the table by salary (highest to lowest) and print out the table in your report.
From the table we can see that only 2 players of the dataset attended BYUI, being Lindsma01 the one that had the highest salary in 2014 with 4,000,000 while stephga01 had the lowest salary in 1997 with 150,000. This is a huge difference, but we have to keep in mind that after almost 20 years the inflation plays a huge role in the real value of those salaries.
Show the code
q =''' SELECT DISTINCT s.playerID, cp.schoolID, s.yearID, cast(s.salary as int) as salary FROM salaries s JOIN collegeplaying cp ON s.playerID = cp.playerID WHERE cp.schoolID ='idbyuid' ORDER BY s.salary desc '''table = pd.read_sql_query(q,con)table
playerID
schoolID
yearID
salary
0
lindsma01
idbyuid
2014
4000000
1
lindsma01
idbyuid
2012
3600000
2
lindsma01
idbyuid
2011
2800000
3
lindsma01
idbyuid
2013
2300000
4
lindsma01
idbyuid
2010
1625000
5
stephga01
idbyuid
2001
1025000
6
stephga01
idbyuid
2002
900000
7
stephga01
idbyuid
2003
800000
8
stephga01
idbyuid
2000
550000
9
lindsma01
idbyuid
2009
410000
10
lindsma01
idbyuid
2008
395000
11
lindsma01
idbyuid
2007
380000
12
stephga01
idbyuid
1999
215000
13
stephga01
idbyuid
1998
185000
14
stephga01
idbyuid
1997
150000
Findings:
- Both players never played in the same year, stephga01 had his last salary recorded in 2003 while lindsma01 had his first salary recorded in 2007.
- BYUI players only have 15 years recorded (15 rows)
- The highest salary was recorder in the latest year while the lowest in the oldest.
QUESTION|TASK 2
__This three-part question requires you to calculate batting average (number of hits divided by the number of at-bats)
Write an SQL query that provides playerID, yearID, and batting average for players with at least 1 at bat that year. Sort the table from highest batting average to lowest, and then by playerid alphabetically. Show the top 5 results in your report.
We can see in the following table that the Top 5 players by batting average have just a few at bats, 2 or 3, this makes it easier to have a 100%.
Show the code
q ='''SELECT playerID, yearID, sum(H),sum(AB),(sum(H)/sum (AB)) as BA from batting group by playerID, yearID HAVING sum(AB) > 1 ORDER BY BA desc,playerID limit 5 '''table = pd.read_sql_query(q,con)table
playerID
yearID
sum(H)
sum(AB)
BA
0
aybarma01
2001
3
3
1
1
birasst01
1944
2
2
1
2
brideji01
1953
3
3
1
3
brownha01
1951
2
2
1
4
couside01
1923
2
2
1
Findings:
- All the five players have the same batting average.
- Four out of five of the players showed in the table are from the 50's or before.
- Aybarman01 was the only player in the BA top5 withing the 2000's.
- There aren't any players with more than 3 hits.
Use the same query as above, but only include players with at least 10 at bats that year. Print the top 5 results.
However, if we only take into account players with 10 or more at bats, we start seeing lower batting averages.
Show the code
q ='''SELECT playerID, yearID, sum(H), sum(AB), (CAST(sum(H) as float) /cast (sum (AB) as float)) as BA from batting group by playerID, yearID HAVING sum(AB) > 9 ORDER BY BA desc, playerID limit 5 '''table = pd.read_sql_query(q,con)table
playerID
yearID
sum(H)
sum(AB)
BA
0
nymanny01
1974
9
14
0.642857
1
carsoma01
2013
7
11
0.636364
2
altizda01
1910
6
10
0.600000
3
silvech01
1948
8
14
0.571429
4
puccige01
1930
9
16
0.562500
Findings:
- Carsonma01, who got second place, is the only player with a BA within the 2000's.
- Nobody has a 100% batting average, everyone is below 65%.
- Nymanny01 holds the record of highest BA (for players with more 9 AB a year) since 1974.
Now calculate the batting average for players over their entire careers (all years combined). Only include players with at least 100 at bats, and print the top 5 results.
The same happens if we filter only players with 100 or more at bats, the top batting average keeps decreasing with only 36.6%. It is hard to mantain a high batting average when you keep hiting.
Show the code
q ='''SELECT playerID, yearID, sum(H), sum(AB), (CAST(sum(H) as float) /cast (sum (AB) as float)) as BA from batting group by playerID HAVING sum(AB) > 99 ORDER BY BA desc,playerID limit 5 '''table = pd.read_sql_query(q,con)table
playerID
yearID
sum(H)
sum(AB)
BA
0
cobbty01
1905
4189
11436
0.366299
1
barnero01
1871
860
2391
0.359682
2
hornsro01
1915
2930
8173
0.358497
3
jacksjo01
1908
1772
4981
0.355752
4
meyerle01
1871
513
1443
0.355509
Findings:
- All the top 5 players hold these records for more than 100 years.
- Nobody has a 100% batting average like in the a) section, everyone is below 37%.
- Cobbty01 holds the record of highest BA (for players with 100 or more AB a year) since 1905, almost 120 years ago
QUESTION|TASK 3
Pick any two baseball teams and compare them using a metric of your choice (average salary, home runs, number of wins, etc). Write an SQL query to get the data you need, then make a graph using Plotly Express to visualize the comparison. What do you learn?
In the next graph we are comparing the average salary of New York Yankees and Boston Red Sox players from 1985 to 2016. We learn that the salaries were very close until the 2000’s and since then, the New York Yankees have mantained the lead with around $2M more in average. We can also see that the average salary has increased over the years, from less than $1M to over $6M. This is partially explained with inflation but there must be other factors too.
Show the code
q ='''SELECT teamID, yearID, cast(avg(salary) as int) as Avg_Salary FROM salaries where teamID = 'BOS' or teamid = 'NYA' group by teamID, yearID '''table = pd.read_sql_query(q,con)px.line(table,x='yearID',y='Avg_Salary', color ='teamID')
Findings:
- Boston Red Sox always had a lower average salary compared to the Yankees, except in some eyars in the early 90's and 80's.
- New York Yankees had their highest average salary in 2010 with $8.25 M.
- Boston Red Sox had their highest average the last year recorded in the data (2016) with $6.5 M.