Background

In this report I have been tasked with predicting the players final batting averages for the 2018 season based upon batting data from March and April 2018. The dataset provided comes from fangraphs and includes 29 columns of information on 309 players. In the context of this study, a player's final batting average at the end of the season is the response (or dependent) variable, and the batting statistics included in the dataset constitute the independent variables.

Batting average is numeric rather than categorical, so the models will be evaluated on the R^2 and Mean Absolute Error scores. R^2 measures the percentage of variation in the response variable that is explained by the independent variables in the model. An $R^2$ goes from 0.0 to 1.0, and the closer the score is to 1.0 the more correlation it implies between the dependent and independent variables. The Mean absolute error represents the average amount each predicted final batting average differs from a player's true final batting average. Unlike the $R^2$, the closer this value is to zero, the more accurate the predictions are.

Data

To supplement the data provided I brought in two additional datasets that I believe to be correlated to batting average. The first was statcast ball tracking data, and the second being batted ball information. These datasets are also from fangraphs, but sourced from Statcast. I filtered both datasets to include only information from March and April of the 2018 MLB season. One of the primary motivations behind including these dataset is the theory that batted ball data offers a truer indication of a players underlying skill set than a relatively small sample size of counting statistics. If the models can find a correlation between batted ball data and final batting average, it may be more predictive than factors like babip, home runs, and even current batting average which have a considerable amount of variance in just a month's worth of at bats.

The first datasource I brought in is Statcast ball tracking data. This dataset includes average exit velocity, launch angle, barrel%, hard hit%, and wOBA. These statistics are effectively measuring how hard a player hits the ball, and the quality and consistency of contact. These factors should play a role in batting average because how hard a player hits a ball and what angle the ball leaves the bat at often determines whether or not a batted ball turns into a hit or an out. Barrel and hard hit rates offer insight into how consistently a player makes quality contact. My expectation is that players who consistently make quality contact tend to produce higher batting averages.

I do, however, harbor several reservations about the effectiveness of this dataset. Statistics like launch angle and exit velocity are extremely predictive of the outcome of an individual batted ball, but averaged over the course of many at bats may not reveal much. Players like Joey Gallo have high strikeout rates but tend to crush the ball when they make contact. Such players likely have high average exit velocities and launch angles but still have poor batting averages. It may also punish high contact players who put the ball in play more often but may make weaker contact. Their exit velocity and launch angle would be subsequently suppressed when looked at on average, but putting the ball in play more often leads to more chances to get a hit.

The second dataset I brought in was Fangraphs batted ball data. This includes pull%, center%, opposite%, soft%, med%, and hard%. This represents where each batter tends to hit the ball, and how hard they hit it. With the adoption of the shift and extreme pull hitting fly ball batting approaches, attempting to hit to contact and spray the ball around the field is becoming more or a lost art in baseball. It intuitively makes sense that players with a more even distribution on where they hit the ball tend to have a higher batting average. The soft%, medium% and hard% try to measure how hard a player tends to hit the ball. This may overlap with exit velocity and hard hit rate, but it also goes into more detail about the consistency with which a player makes good contact.

Preprocessing and Data Exploration

After merging the two external dataset, I needed to prepare the dataset for modeling. The data had no missing values for any of the observations. Almost all of the new columns I brought in from the new dataset required formatting because they were stored as percentage strings. I cleaned these columns and reformatted them as decimals.

I had several takeaways from my data exploration. As batting averages are calculated from at bats, one of my concerns was the amount of variance observations with few plate appearances might cause within the model. The number of plate appearances spanned from a minimum of 22 to a maximum of 125. This constitutes a relatively wide range, my concern was that the model would treat these observations as equal when looking at the relationship between their independent and dependent variables when in reality certain observations have a much larger sample size for their underlying metrics and batting average to smooth out. Because of this, I identified a linear regression weighted by the number of bats as one of the models I would like to look at when creating my predictions. The graph below shows the distribution of at bats per player in the dataset.

Another problem I identified with the data was the fact that there were 41 columns and just 307 observations in the dataset. That many independent variables relative to that number of observations means that it is extremely likely the model will overfit to the training data. Each new independent variable adds a dimension to the dataset, and it makes it impossible for the model to determine the relationship between each variable and the response variable when there are so many dimensions any change can be attributed to. This is not a problem when there are thousands of observations, but with only 309 in our dataset this will almost certainly be an issue. Additionally, many of these variables are directly correlated to one another which will lead to problems with multicollinearity down the road.

Most of these columns are likely to have little to no predictive power on full season batting average, but their mere presence can lead to the modeling picking up any noise from them and mistakenly attributing significance to it in the model. This noise will actually improve the $R^2$ score and Mean Absolute Error of the model fit and tested on this specific dataset, but having these unimportant variables makes it less likely it will effectively translate to predictioning test datasets. Because of this, I dropped several columns that I feel confident should not have any predictive signal on a player's batting average. These columns were PlayerID, Name, Team, Runs, RBI’s, and stolen bases. After removing these, there are 34 total independent variables remaining in the model. This is still significantly too many variables for the size of the dataset. To deal with the problems of overfitting and multicollinearity, I decided I would also test a Lasso regression, and other models that introduce a normalization penalty.

Modeling

Linear regression is perhaps the simplest and least sexy model in Data science, but it is often the most effective in situations where the response variable is continuous and numeric. I tried four different iterations of a linear regression using the SciKitLearn library for python.

  1. Linear Regression on just the default dataset provided - Although I do believe that the datasets I added contain useful information, I wanted to make sure that the default dataset did not lead to a better model. I am concerned about overfitting the model due to the number of dependent variables in the columns, and by adding over ten more columns I could have created an even bigger problem

  2. Linear Regression on the entire dataset I created including the statcast and batted ball data.

  3. Linear Regression weighted by number of at bats - The SciKitLearn Linear Regression model allows you to weigh particular observations more when fitting a model. In our dataset each observation represents one player's statistics. But since the number of at bats, each player has a different number of observations making up their statistics going into the model. Because of this, I believe it could be useful to give extra weight to the players with more at bats making up their statistics because it offers a larger sample size.

  4. Lasso Linear Regression - Lasso linear regression uses an L1 penalty on model coefficients to shrink the slope of certain variables, and in some cases removes them from the model entirely. The strength of the regularization penalty is determined by the Lambda value. To find the optimal lambda value leading to the optimized score in my Lasso model, I used the SKlearn random search package, which randomly goes through a list of designated parameters you want to optimize and reports back which set of parameters performed best.

When constructing and testing a model, a dataset is typically split into a training and test set for validation. This is an attempt to simulate how a model will perform on an unseen dataset after training it on your initial dataset. However, since the dataset is small to begin with and I am already concerned about overfitting to the data, I do not want to sacrifice a portion of my training data to create a separate validation test set. Instead, I used the K-fold cross validation technique. I used 10 folds, meaning that I divided the data into 10ths, fit the data on nine tenths of the data and tested it on the first tenth of the data and got the $R^2$ and MAE scores. I then reincorporate the first tenth I previously tested the model on, and set aside the second tenth of data and trained on the other nine tenths. This process is repeated ten times for each subset of the data. After this process is complete, you are left with ten $R^2$ and MAE scores for each model. I took the average of the scores for each model and used these to compare them to one another. Using cross validation lets you train on the entire dataset rather than sacrificing a large portion of it for testing. The downside is that it is expensive computationally and in terms of time. However Linear regressions are cheap in terms of both time and computational requirements, making it a feasible option for the models listed above. Below is the graph of how each of the four models listed above performed when trained and scored on the entire dataset, and then when scored on the K-fold cross validation set. The second bar represents how well I believe it would work when applied to a new dataset.

Unsurprisingly, the augmented dataset with the statcast and batted ball data performed best on the training data. Having more columns almost always leads to better training data performance. However this model had the worst performance on the test data. This is almost surely a sign that the model was significantly overfit. As I hypothesized in my initial data exploration, the Lasso regression performed best on the test data by a fairly significant margin. The model achieved an $R^2$ score of .313 on the validation data, significantly better than the second best score of .262. The Lasso model performed worse on the training data as expected because the normalization penalty generally sacrifices an increase in bias for a decrease in variance. The mean absolute error scores follow the same trend as the $R^2$ scores, but are significantly closer to one another. In the context of this study, an $R^2$ of .313 means that 31% of the variance in a player's final batting average can be explained by the variables in this model.

I tested several additional models to see how they performed on the dataset. They included:

  1. Random Forest Regressor - Random forest regressors construct decision trees while training the data. In the context of this study the model is looking for players with similar statistical profiles, grouping similar players and taking the average of each group's final batting average and using this as the prediction.

  2. Support Vector Regressor (SVR) - The support vector regressor applies an L2 normalization penalty on the data. It does this by creating an acceptable margin of error rather, and attempting to minimize the error only for data points that fall outside the acceptable error margin. This helps to prevent the model from reacting to random noise in the dataset and avoid overfitting. Since its goal is not to minimize overall error in the training data, it generally performs worse on training data relative to most other models, but translates well to test data if the training data is noisy.

  3. Gradient Boosting Regressor - Gradient boosting functions similarly to the random forest classifier. They differ in that gradient boosted regressors use a collection of less complex trees with high bias and low variance and attempt to minimize bias in these trees. Conversely, decision trees build a complex decision tree that is naturally low bias and high variance and attempt to minimize the variance.

The graphs below represent these models $R^2$ and mean absolute error performance on validation data with a simple linear regression, and the Lasso model from the last section included as reference points. Unlike the regression models above, the data for these models was split into a training and test set rather than using a cross validation score. The models in this section are significantly more computationally expensive than the regression models previously discussed, and cross validating these would require a significant amount of time.

All the models tested above perform better than a basic Linear Regression, however only the Support vector regressor outperforms the Lasso regression model. The SVR performed significantly better than any other model with an $R^2$ score of .345, and a MAE .0014 lower than the regression model. The $R^2$ score outperforms even the best performing Lasso regression model from the previous section by a fairly significant margin.

The Support Vector Regressor performs best because it avoids overfitting to the training data. The inclusion of an acceptable margin of error and focusing on minimizing the error of the data points falling outside of that margin means that the model is not as susceptible to the noise and randomness of data in the model. Having this margin acts as a built in regularization penalty. The inclusion of so many independent variables in this model leads to a lot of noise, the support vector is able to compensate for this and produce strong test predictions.

Conclusion

The results show a continuation of the trend that the normalized more biased models are doing best on the test data. From the initial data analysis, my biggest concern was the number of independent variables relative to the number of total observations would lead to overfitting. This played out in the modeling. Ultimately the two models that performed best - Lasso regression and Support Vector regression - were the models that included regularization penalties to reduce the amount of noise in the dataset.

If the number of independent variables was causing problems, it may raise the question of why I did not prune back the number of variables in the data set I trained on?

The simple answer is that I did try this. I had several iterations of datasets that contained a significantly reduced number of columns, and while these did help the test scores of some of the simple linear regressions and random forest regressor, their scores still did not come close to the performance of the Lasso and SVR models. The problem with this approach is that we do not know which variables have the most predictive signal. Additionally, we do not know if one variable that appears to have little predictive ability on its own works well in tandem with another variable to boost the overall predictive power of the model. The SVR and Lasso models are able to answer these questions for us and tell us what combination of variables work best, and which variables should be excluded.

An important consideration when reflecting on the results of this study is what the purpose behind it is. If the question had been rephrased to “What factors in the dataset predict batting average”, the report would change significantly. Rather than focusing on models that generated the best overall prediction scores, I would focus on model interpretability. This would include focusing on the model assumptions to ensure they were met, and looking at the significance of the individual factors in the dataset. The results of a study like that would be more focused on the individual variables that have strong correlation with batting average rather than the overall MAE and $R^2$ of the models.

Final Batting Average Predictions

Code Appendix

Cleaning, EDA, Merging

Modeling - Regression

Modeling - Other ML
Misc Datasubsets used