Client Report - Project 3: Finding relationships in baseball

Elevator pitch

In this week’s project, we discovered that players with very few at-bats rank at the top in batting average. The top 5 performers have only 2 or 3 at-bats each. However, as we examined players with more at-bats, we observed a decrease in hits and batting average. Additionally, we explored the variations in average salary between two highly popular baseball players throughout the years, discovering almost a 9x and 7x increase in around 30 years.

Show the code

# %%
url = 'https://byuistats.github.io/DS250-Course/Data/lahmansbaseballdb.sqlite'
response = requests.get(url)

with open('lahmansbaseballdb.sqlite', 'wb') as f:
    f.write(response.content)

con = sqlite3.connect('lahmansbaseballdb.sqlite')

Show the code

q = '''
    SELECT name
    FROM sqlite_master 
    WHERE type='table'
    '''
table = pd.read_sql_query(q,con)
table

	name
0	allstarfull
1	appearances
2	awardsmanagers
3	awardsplayers
4	awardssharemanagers
5	awardsshareplayers
6	batting
7	battingpost
8	collegeplaying
9	divisions
10	fielding
11	fieldingof
12	fieldingofsplit
13	fieldingpost
14	halloffame
15	homegames
16	leagues
17	managers
18	managershalf
19	parks
20	people
21	pitching
22	pitchingpost
23	salaries
24	schools
25	seriespost
26	teams
27	teamsfranchises
28	teamshalf

Highlight the Questions and Tasks

QUESTION|TASK 1

Write an SQL query to create a new dataframe about baseball players who attended BYU-Idaho. The new table should contain five columns: playerID, schoolID, salary, and the yearID/teamID associated with each salary. Order the table by salary (highest to lowest) and print out the table in your report.

From the table we can see that only 2 players of the dataset attended BYUI, being Lindsma01 the one that had the highest salary in 2014 with 4,000,000 while stephga01 had the lowest salary in 1997 with 150,000. This is a huge difference, but we have to keep in mind that after almost 20 years the inflation plays a huge role in the real value of those salaries.

Show the code

q = '''
    SELECT DISTINCT s.playerID, cp.schoolID, s.yearID, cast(s.salary as int) as salary
    FROM salaries s
    JOIN collegeplaying cp
    ON s.playerID = cp.playerID
    WHERE cp.schoolID ='idbyuid'
    ORDER BY s.salary desc
    '''
table = pd.read_sql_query(q,con)
table

	playerID	schoolID	yearID	salary
0	lindsma01	idbyuid	2014	4000000
1	lindsma01	idbyuid	2012	3600000
2	lindsma01	idbyuid	2011	2800000
3	lindsma01	idbyuid	2013	2300000
4	lindsma01	idbyuid	2010	1625000
5	stephga01	idbyuid	2001	1025000
6	stephga01	idbyuid	2002	900000
7	stephga01	idbyuid	2003	800000
8	stephga01	idbyuid	2000	550000
9	lindsma01	idbyuid	2009	410000
10	lindsma01	idbyuid	2008	395000
11	lindsma01	idbyuid	2007	380000
12	stephga01	idbyuid	1999	215000
13	stephga01	idbyuid	1998	185000
14	stephga01	idbyuid	1997	150000

Findings:

- Both players never played in the same year, stephga01 had his last salary recorded in 2003 while lindsma01 had his first salary recorded in 2007.

- BYUI players only have 15 years recorded (15 rows)

- The highest salary was recorder in the latest year while the lowest in the oldest.

QUESTION|TASK 2

__This three-part question requires you to calculate batting average (number of hits divided by the number of at-bats)

Write an SQL query that provides playerID, yearID, and batting average for players with at least 1 at bat that year. Sort the table from highest batting average to lowest, and then by playerid alphabetically. Show the top 5 results in your report.

We can see in the following table that the Top 5 players by batting average have just a few at bats, 2 or 3, this makes it easier to have a 100%.

Show the code

q = '''SELECT playerID, yearID, sum(H),sum(AB),(sum(H)/sum (AB)) as BA
        from batting
        group by playerID, yearID
        HAVING sum(AB) > 1
        ORDER BY BA desc,playerID
        limit 5
    ''' 

table = pd.read_sql_query(q,con)
table

	playerID	yearID	sum(H)	sum(AB)	BA
0	aybarma01	2001	3	3	1
1	birasst01	1944	2	2	1
2	brideji01	1953	3	3	1
3	brownha01	1951	2	2	1
4	couside01	1923	2	2	1

Findings:

- All the five players have the same batting average.

- Four out of five of the players showed in the table are from the 50's or  before.

- Aybarman01 was the only player in the BA top5 withing the 2000's.

- There aren't any players with more than 3 hits.

Use the same query as above, but only include players with at least 10 at bats that year. Print the top 5 results.

However, if we only take into account players with 10 or more at bats, we start seeing lower batting averages.

Show the code

q = '''SELECT playerID, yearID, sum(H), sum(AB), (CAST(sum(H) as float) /cast (sum (AB) as float)) as BA
        from batting
        group by playerID, yearID
        HAVING sum(AB) > 9
        ORDER BY BA desc, playerID
        limit 5
    ''' 

table = pd.read_sql_query(q,con)
table

	playerID	yearID	sum(H)	sum(AB)	BA
0	nymanny01	1974	9	14	0.642857
1	carsoma01	2013	7	11	0.636364
2	altizda01	1910	6	10	0.600000
3	silvech01	1948	8	14	0.571429
4	puccige01	1930	9	16	0.562500

Findings:

- Carsonma01, who got second place, is the only player with a BA within the 2000's.

- Nobody has a 100% batting average, everyone is below 65%.

- Nymanny01 holds the record of highest BA (for players with more 9 AB a year) since 1974.

Now calculate the batting average for players over their entire careers (all years combined). Only include players with at least 100 at bats, and print the top 5 results.

The same happens if we filter only players with 100 or more at bats, the top batting average keeps decreasing with only 36.6%. It is hard to mantain a high batting average when you keep hiting.

Show the code

q = '''SELECT playerID, yearID, sum(H), sum(AB), (CAST(sum(H) as float) /cast (sum (AB) as float)) as BA
        from batting
        group by playerID
        HAVING sum(AB) > 99
        ORDER BY BA desc,playerID
        limit 5
    ''' 

table = pd.read_sql_query(q,con)
table

	playerID	yearID	sum(H)	sum(AB)	BA
0	cobbty01	1905	4189	11436	0.366299
1	barnero01	1871	860	2391	0.359682
2	hornsro01	1915	2930	8173	0.358497
3	jacksjo01	1908	1772	4981	0.355752
4	meyerle01	1871	513	1443	0.355509

Findings:

- All the top 5 players hold these records for more than 100 years.

- Nobody has a 100% batting average like in the a) section, everyone is below 37%.

- Cobbty01 holds the record of highest BA (for players with 100 or more AB a year) since 1905, almost 120 years ago

QUESTION|TASK 3

Pick any two baseball teams and compare them using a metric of your choice (average salary, home runs, number of wins, etc). Write an SQL query to get the data you need, then make a graph using Plotly Express to visualize the comparison. What do you learn?

In the next graph we are comparing the average salary of New York Yankees and Boston Red Sox players from 1985 to 2016. We learn that the salaries were very close until the 2000’s and since then, the New York Yankees have mantained the lead with around $2M more in average. We can also see that the average salary has increased over the years, from less than $1M to over $6M. This is partially explained with inflation but there must be other factors too.

Show the code

q = '''SELECT teamID, yearID, cast(avg(salary) as int) as Avg_Salary
    FROM salaries
    where teamID = 'BOS' or teamid = 'NYA'
    group by teamID, yearID
    ''' 
table = pd.read_sql_query(q,con)
px.line(table,x='yearID',y='Avg_Salary', color = 'teamID')

Findings:

- Boston Red Sox always had a lower average salary compared to the Yankees, except in some eyars in the early 90's and 80's.

- New York Yankees had their highest average salary in 2010 with $8.25 M.

- Boston Red Sox had their highest average the last year recorded in the data (2016) with $6.5 M.