Chapter 3 Variable Selection
The dataset contained 112 different variables or metrics for each team. Using all of these variables provided the modeling techniques with too much noise. Many modeling techniques predicted the training data well with all of the variables but performed poorly on the test and validation. Using a combination of variable importance plots, modeling assessment metrics, and knowledge of college basketball, I choose the following variables:
- Points Per Possession (Offensive Efficiency)- Average number of points a team scores per possession
- Opponents Points Per Possession (Defensive Efficiency)- Average number of points allowed per possession
- Point Differential- Margin of victory, number of points scored minus number of points allowed
- Pomeroy Ranking- Ranking of teams by legendary college basketball statistician Ken Pom (lower the ranking, the better)
- Opponent Three Point Field Goal Percentage- Number of three points allowed divided by number of three points attempted by opponent
- Free Throw Percentage- Number of free throws made/ number of free throws attempted
- Offensive Rebound Difference- Difference between a team’s number of offensive rebounds and their opponent’s number of offensive rebounds
- Opponent Turnovers- Number of turnovers by a team’s opponent
These variables account for a team’s defensive, offensive, rebounding, and overall abilities.