Show the code
# %%
= 'lahmansbaseballdb.sqlite'
sqlite_file = sqlite3.connect(sqlite_file) con
In this week’s project, we discovered that players with very few at-bats rank at the top in batting average. The top 5 performers have only 2 or 3 at-bats each. However, as we examined players with more at-bats, we observed a decrease in hits and batting average. Additionally, we explored the variations in average salary between two highly popular baseball players throughout the years, discovering almost a 9x and 7x increase in around 30 years.
# %%
= 'lahmansbaseballdb.sqlite'
sqlite_file = sqlite3.connect(sqlite_file) con
= '''
q SELECT name
FROM sqlite_master
WHERE type='table'
'''
= pd.read_sql_query(q,con)
table table
name | |
---|---|
0 | allstarfull |
1 | appearances |
2 | awardsmanagers |
3 | awardsplayers |
4 | awardssharemanagers |
5 | awardsshareplayers |
6 | batting |
7 | battingpost |
8 | collegeplaying |
9 | divisions |
10 | fielding |
11 | fieldingof |
12 | fieldingofsplit |
13 | fieldingpost |
14 | halloffame |
15 | homegames |
16 | leagues |
17 | managers |
18 | managershalf |
19 | parks |
20 | people |
21 | pitching |
22 | pitchingpost |
23 | salaries |
24 | schools |
25 | seriespost |
26 | teams |
27 | teamsfranchises |
28 | teamshalf |
Highlight the Questions and Tasks
Write an SQL query to create a new dataframe about baseball players who attended BYU-Idaho. The new table should contain five columns: playerID, schoolID, salary, and the yearID/teamID associated with each salary. Order the table by salary (highest to lowest) and print out the table in your report.
From the table we can see that only 2 players of the dataset attended BYUI, being Lindsma01 the one that had the highest salary in 2014 with 4,000,000 while stephga01 had the lowest salary in 1997 with 150,000. This is a huge difference, but we have to keep in mind that after almost 20 years the inflation plays a huge role in the real value of those salaries.
= '''
q SELECT DISTINCT s.playerID, cp.schoolID, s.yearID, cast(s.salary as int) as salary
FROM salaries s
JOIN collegeplaying cp
ON s.playerID = cp.playerID
WHERE cp.schoolID ='idbyuid'
ORDER BY s.salary desc
'''
= pd.read_sql_query(q,con)
table table
playerID | schoolID | yearID | salary | |
---|---|---|---|---|
0 | lindsma01 | idbyuid | 2014 | 4000000 |
1 | lindsma01 | idbyuid | 2012 | 3600000 |
2 | lindsma01 | idbyuid | 2011 | 2800000 |
3 | lindsma01 | idbyuid | 2013 | 2300000 |
4 | lindsma01 | idbyuid | 2010 | 1625000 |
5 | stephga01 | idbyuid | 2001 | 1025000 |
6 | stephga01 | idbyuid | 2002 | 900000 |
7 | stephga01 | idbyuid | 2003 | 800000 |
8 | stephga01 | idbyuid | 2000 | 550000 |
9 | lindsma01 | idbyuid | 2009 | 410000 |
10 | lindsma01 | idbyuid | 2008 | 395000 |
11 | lindsma01 | idbyuid | 2007 | 380000 |
12 | stephga01 | idbyuid | 1999 | 215000 |
13 | stephga01 | idbyuid | 1998 | 185000 |
14 | stephga01 | idbyuid | 1997 | 150000 |
Findings: - Both players never played in the same year, stephga01 had his last salary recorded in 2003 while lindsma01 had his first salary recorded in 2007. - BYUI players only have 15 years recorded (15 rows) - The highest salary was recorder in the latest year while the lowest in the oldest.
__This three-part question requires you to calculate batting average (number of hits divided by the number of at-bats)
We can see in the following table that the Top 5 players by batting average have just a few at bats, 2 or 3, this makes it easier to have a 100%.
= '''SELECT playerID, yearID, sum(H),sum(AB),(sum(H)/sum (AB)) as BA
q from batting
group by playerID, yearID
HAVING sum(AB) > 1
ORDER BY BA desc,playerID
limit 5
'''
= pd.read_sql_query(q,con)
table table
playerID | yearID | sum(H) | sum(AB) | BA | |
---|---|---|---|---|---|
0 | aybarma01 | 2001 | 3 | 3 | 1 |
1 | birasst01 | 1944 | 2 | 2 | 1 |
2 | brideji01 | 1953 | 3 | 3 | 1 |
3 | brownha01 | 1951 | 2 | 2 | 1 |
4 | couside01 | 1923 | 2 | 2 | 1 |
Findings: - All the five players have the same batting average. - Four out of five of the players showed in the table are from the 50’s or before. - Aybarman01 was the only player in the BA top5 withing the 2000’s. - There aren’t any players with more than 3 hits.
However, if we only take into account players with 10 or more at bats, we start seeing lower batting averages.
= '''SELECT playerID, yearID, sum(H), sum(AB), (CAST(sum(H) as float) /cast (sum (AB) as float)) as BA
q from batting
group by playerID, yearID
HAVING sum(AB) > 9
ORDER BY BA desc, playerID
limit 5
'''
= pd.read_sql_query(q,con)
table table
playerID | yearID | sum(H) | sum(AB) | BA | |
---|---|---|---|---|---|
0 | nymanny01 | 1974 | 9 | 14 | 0.642857 |
1 | carsoma01 | 2013 | 7 | 11 | 0.636364 |
2 | altizda01 | 1910 | 6 | 10 | 0.600000 |
3 | silvech01 | 1948 | 8 | 14 | 0.571429 |
4 | puccige01 | 1930 | 9 | 16 | 0.562500 |
Findings: - Carsonma01, who got second place, is the only player with a BA within the 2000’s. - Nobody has a 100% batting average, everyone is below 65%. - Nymanny01 holds the record of highest BA (for players with more 9 AB a year) since 1974.
The same happens if we filter only players with 100 or more at bats, the top batting average keeps decreasing with only 36.6%. It is hard to mantain a high batting average when you keep hiting.
= '''SELECT playerID, yearID, sum(H), sum(AB), (CAST(sum(H) as float) /cast (sum (AB) as float)) as BA
q from batting
group by playerID
HAVING sum(AB) > 99
ORDER BY BA desc,playerID
limit 5
'''
= pd.read_sql_query(q,con)
table table
playerID | yearID | sum(H) | sum(AB) | BA | |
---|---|---|---|---|---|
0 | cobbty01 | 1905 | 4189 | 11436 | 0.366299 |
1 | barnero01 | 1871 | 860 | 2391 | 0.359682 |
2 | hornsro01 | 1915 | 2930 | 8173 | 0.358497 |
3 | jacksjo01 | 1908 | 1772 | 4981 | 0.355752 |
4 | meyerle01 | 1871 | 513 | 1443 | 0.355509 |
Findings: - All the top 5 players hold these records for more than 100 years. - Nobody has a 100% batting average like in the a) section, everyone is below 37%. - Cobbty01 holds the record of highest BA (for players with 100 or more AB a year) since 1905, almost 120 years ago
Pick any two baseball teams and compare them using a metric of your choice (average salary, home runs, number of wins, etc). Write an SQL query to get the data you need, then make a graph using Plotly Express to visualize the comparison. What do you learn?
In the next graph we are comparing the average salary of New York Yankees and Boston Red Sox players from 1985 to 2016. We learn that the salaries were very close until the 2000’s and since then, the New York Yankees have mantained the lead with around $2M more in average. We can also see that the average salary has increased over the years, from less than $1M to over $6M. This is partially explained with inflation but there must be other factors too.
= '''SELECT teamID, yearID, cast(avg(salary) as int) as Avg_Salary
q FROM salaries
where teamID = 'BOS' or teamid = 'NYA'
group by teamID, yearID
'''
= pd.read_sql_query(q,con)
table ='yearID',y='Avg_Salary', color = 'teamID') px.line(table,x
Findings: - Boston Red Sox always had a lower average salary compared to the Yankees, except in some eyars in the early 90’s and 80’s. - New York Yankees had their highest average salary in 2010 with $8.25 M. - Boston Red Sox had their highest average the last year recorded in the data (2016) with $6.5 M.