Chapter 2 Dataset
The data contains 20 CSV files with team names, matchups, box score stats, and rankings dating back to 2003. I started by looking at the box scores of each team since these statistics would be the basis of my models. Each row had a Winning Team and Losing Team along with their respective stats. I pivoted this dataset longer in order to have a row for each team in a game (one game has two associated rows, one for each team). These rows contained offensive statistics for the team like points and defensive statistics like points allowed. Then, I performed feature creation on the game stats: adding offensive rebounds and defensive rebounds together to calculate total rebound, subtracting points scored and allowed for point differential, and calculating possessions from shot attempts. After creating these columns, I rolled the game data into season data for each team by calculating the averages of each statistic over a season. I also created a few other features, including Points per Possession and effective Field Goal Percentage.
After calculating the season statistics for each team, I joined the dataset with the ranking CSV. This file had rankings for each team, including the AP Poll, NET ranking, and Pomeroy ranking. I also created a sort of heat index, which showed the change in a team’s Pom ranking over the last month before the tournament. I joined this table with seasonal and ranking data for each team with the file of team matchups for the 2003 to 2021 tournaments.
Each row, now, contained Team 1 and Team 2 and their respective statistics for the year. The results were set up as a binary indicator coinciding with a 1 if Team 1 won and a 0 if Team 2 won. This means the probability of Team 1 winning is just p, while the probability of Team 2 winning is 1-p. The dataset is also evenly split between games that Team 1 and Team 2 won.
Lastly, I filtered the training data to only include data from the 2008 to 2018 seasons. This ensured that the model had enough data points (around 600) to find patterns but did not look back too far at seasons in which the college basketball landscape was vastly different. My validation and test sets were the 2019 and 2021 seasons and tournaments (NOTE: There was no tournament in 2020 due to COVID-19)