Client Report - Project 3: Finding relationships in baseball

Course DS 250

Author

Leonardo Alvarino

Elevator pitch

In this week’s project, we discovered that players with very few at-bats rank at the top in batting average. The top 5 performers have only 2 or 3 at-bats each. However, as we examined players with more at-bats, we observed a decrease in hits and batting average. Additionally, we explored the variations in average salary between two highly popular baseball players throughout the years, discovering almost a 9x and 7x increase in around 30 years.

Show the code
# %%
sqlite_file = 'lahmansbaseballdb.sqlite'
con = sqlite3.connect(sqlite_file)
Show the code
q = '''
    SELECT name
    FROM sqlite_master 
    WHERE type='table'
    '''
table = pd.read_sql_query(q,con)
table
name
0 allstarfull
1 appearances
2 awardsmanagers
3 awardsplayers
4 awardssharemanagers
5 awardsshareplayers
6 batting
7 battingpost
8 collegeplaying
9 divisions
10 fielding
11 fieldingof
12 fieldingofsplit
13 fieldingpost
14 halloffame
15 homegames
16 leagues
17 managers
18 managershalf
19 parks
20 people
21 pitching
22 pitchingpost
23 salaries
24 schools
25 seriespost
26 teams
27 teamsfranchises
28 teamshalf

Highlight the Questions and Tasks

QUESTION|TASK 1

Write an SQL query to create a new dataframe about baseball players who attended BYU-Idaho. The new table should contain five columns: playerID, schoolID, salary, and the yearID/teamID associated with each salary. Order the table by salary (highest to lowest) and print out the table in your report.

From the table we can see that only 2 players of the dataset attended BYUI, being Lindsma01 the one that had the highest salary in 2014 with 4,000,000 while stephga01 had the lowest salary in 1997 with 150,000. This is a huge difference, but we have to keep in mind that after almost 20 years the inflation plays a huge role in the real value of those salaries.

Show the code
q = '''
    SELECT DISTINCT s.playerID, cp.schoolID, s.yearID, cast(s.salary as int) as salary
    FROM salaries s
    JOIN collegeplaying cp
    ON s.playerID = cp.playerID
    WHERE cp.schoolID ='idbyuid'
    ORDER BY s.salary desc
    '''
table = pd.read_sql_query(q,con)
table 
playerID schoolID yearID salary
0 lindsma01 idbyuid 2014 4000000
1 lindsma01 idbyuid 2012 3600000
2 lindsma01 idbyuid 2011 2800000
3 lindsma01 idbyuid 2013 2300000
4 lindsma01 idbyuid 2010 1625000
5 stephga01 idbyuid 2001 1025000
6 stephga01 idbyuid 2002 900000
7 stephga01 idbyuid 2003 800000
8 stephga01 idbyuid 2000 550000
9 lindsma01 idbyuid 2009 410000
10 lindsma01 idbyuid 2008 395000
11 lindsma01 idbyuid 2007 380000
12 stephga01 idbyuid 1999 215000
13 stephga01 idbyuid 1998 185000
14 stephga01 idbyuid 1997 150000

Findings: - Both players never played in the same year, stephga01 had his last salary recorded in 2003 while lindsma01 had his first salary recorded in 2007. - BYUI players only have 15 years recorded (15 rows) - The highest salary was recorder in the latest year while the lowest in the oldest.

QUESTION|TASK 2

__This three-part question requires you to calculate batting average (number of hits divided by the number of at-bats)

  1. Write an SQL query that provides playerID, yearID, and batting average for players with at least 1 at bat that year. Sort the table from highest batting average to lowest, and then by playerid alphabetically. Show the top 5 results in your report.

We can see in the following table that the Top 5 players by batting average have just a few at bats, 2 or 3, this makes it easier to have a 100%.

Show the code
q = '''SELECT playerID, yearID, sum(H),sum(AB),(sum(H)/sum (AB)) as BA
        from batting
        group by playerID, yearID
        HAVING sum(AB) > 1
        ORDER BY BA desc,playerID
        limit 5
    ''' 

table = pd.read_sql_query(q,con)
table
playerID yearID sum(H) sum(AB) BA
0 aybarma01 2001 3 3 1
1 birasst01 1944 2 2 1
2 brideji01 1953 3 3 1
3 brownha01 1951 2 2 1
4 couside01 1923 2 2 1

Findings: - All the five players have the same batting average. - Four out of five of the players showed in the table are from the 50’s or before. - Aybarman01 was the only player in the BA top5 withing the 2000’s. - There aren’t any players with more than 3 hits.

  1. Use the same query as above, but only include players with at least 10 at bats that year. Print the top 5 results.

However, if we only take into account players with 10 or more at bats, we start seeing lower batting averages.

Show the code
q = '''SELECT playerID, yearID, sum(H), sum(AB), (CAST(sum(H) as float) /cast (sum (AB) as float)) as BA
        from batting
        group by playerID, yearID
        HAVING sum(AB) > 9
        ORDER BY BA desc, playerID
        limit 5
    ''' 

table = pd.read_sql_query(q,con)
table
playerID yearID sum(H) sum(AB) BA
0 nymanny01 1974 9 14 0.642857
1 carsoma01 2013 7 11 0.636364
2 altizda01 1910 6 10 0.600000
3 silvech01 1948 8 14 0.571429
4 puccige01 1930 9 16 0.562500

Findings: - Carsonma01, who got second place, is the only player with a BA within the 2000’s. - Nobody has a 100% batting average, everyone is below 65%. - Nymanny01 holds the record of highest BA (for players with more 9 AB a year) since 1974.

  1. Now calculate the batting average for players over their entire careers (all years combined). Only include players with at least 100 at bats, and print the top 5 results.__

The same happens if we filter only players with 100 or more at bats, the top batting average keeps decreasing with only 36.6%. It is hard to mantain a high batting average when you keep hiting.

Show the code
q = '''SELECT playerID, yearID, sum(H), sum(AB), (CAST(sum(H) as float) /cast (sum (AB) as float)) as BA
        from batting
        group by playerID
        HAVING sum(AB) > 99
        ORDER BY BA desc,playerID
        limit 5
    ''' 

table = pd.read_sql_query(q,con)
table
playerID yearID sum(H) sum(AB) BA
0 cobbty01 1905 4189 11436 0.366299
1 barnero01 1871 860 2391 0.359682
2 hornsro01 1915 2930 8173 0.358497
3 jacksjo01 1908 1772 4981 0.355752
4 meyerle01 1871 513 1443 0.355509

Findings: - All the top 5 players hold these records for more than 100 years. - Nobody has a 100% batting average like in the a) section, everyone is below 37%. - Cobbty01 holds the record of highest BA (for players with 100 or more AB a year) since 1905, almost 120 years ago

QUESTION|TASK 3

Pick any two baseball teams and compare them using a metric of your choice (average salary, home runs, number of wins, etc). Write an SQL query to get the data you need, then make a graph using Plotly Express to visualize the comparison. What do you learn?

In the next graph we are comparing the average salary of New York Yankees and Boston Red Sox players from 1985 to 2016. We learn that the salaries were very close until the 2000’s and since then, the New York Yankees have mantained the lead with around $2M more in average. We can also see that the average salary has increased over the years, from less than $1M to over $6M. This is partially explained with inflation but there must be other factors too.

Show the code
q = '''SELECT teamID, yearID, cast(avg(salary) as int) as Avg_Salary
    FROM salaries
    where teamID = 'BOS' or teamid = 'NYA'
    group by teamID, yearID
    ''' 
table = pd.read_sql_query(q,con)
px.line(table,x='yearID',y='Avg_Salary', color = 'teamID')

Findings: - Boston Red Sox always had a lower average salary compared to the Yankees, except in some eyars in the early 90’s and 80’s. - New York Yankees had their highest average salary in 2010 with $8.25 M. - Boston Red Sox had their highest average the last year recorded in the data (2016) with $6.5 M.