Predicting Stats in the 2021-22 NBA Season using a Neural Network

I got a little carried away in trying to win my fantasy basketball league this year and created this project. Initially, I just tried to build a simple linear model to get a rough estimate of how some players would perform devoid of my biases. The model performed terribly, so I decided to step it up a notch and built a Multi Layer Perceptron (MLP) Regressor Neural Network to do the task. Despite, my painful labours dealing with never-ending data structures, I learnt a lot about the functioning of the seemingly complicated models that are supposed to take over the World soon, and more importantly really enjoyed doing it. (I also managed to build a killer fantasy team that dominated until it was subsequently decimated by injuries but that’s irrelevant.) The final cherry on top was that I achieved an R^2 value (coefficient of determination) just shy of 0.8. For reference, the next highest model online that I could find was barely cracking an R^2 of 0.7.

I won’t get into the technical details of the project on this page, however you can see the full working of the model on my github. The rough details will be mentioned below. For those of you just here to use the model, I built a simple widget in p5 javascript to help. Just enter the name of the 607 eligible player you want to see the predicted points per game of in the 2021-22 season and click enter. The model will immediately spit back out the value you need. Note that rookies are not included as there is no data of professional basketball to base their predictions on. Also ensure that spellings are accurate for the widget to work.

The Data

I scraped and collected all of my data from NBAStuffer.com. I used 15 different statistics at first to predict player performance. I used the traditional box score stats coupled with some relatively mainstream advanced statistics like True Shooting % (A less efficient player will be made to shoot less by a good coach in most scenarios) and Usage Rate (to see how often the player ends a possession). I used data for every player from the 2010-11 season till the 2020-21 season for this project and constructed a 52692 X 15 matrix ( + 52692 dimension vector with the points or value to be predicted) as my dataset, with players repeating every year, which was understandably a pain to deal with. A randomly selected 67% of this data was used to train the models mentioned below and 33% was used for testing.

The Linear Model

I started out with a simple linear regressor. This model essentially just returns a linear function of all variables fed into it (think 15-dimensional line?). Unsurprisingly it did not function too well, but I have to admit that I was surprised at how badly it performed. The model achieved an R^2 of just 0.0079, however this minuscule number still fails to capture the essence of the magnitude of its failure. The picture below does it justice.

The Predicted PPG of LeBron James and Stephen Curry by the Linear Model Over the Years

The Y-axis represents average points per game for a given season, while the X-axis represents different seasons. The red line here is the players actual PPG. The blue line was what was predicted by the model. It’s no surprise as to why the linear model failed as it all violated linearity, normality, independence and homoscedasticity (the 4 conditions needed for linear regression to work), however I’m still in shock as to how it managed to predict a 150 PPG season for LeBron and a -20 PPG season for Steph. These results made it clear that a more powerful approach was necessary.

The Neural Network

At this point I was convinced that I had to build a better model. I simply couldn’t settle for one that predicted Stephen Curry scoring 10 own baskets a game for a full year. An MLP Regressor is a much more powerful tool that seemed up for the task. It operates in two phases - forward and backwards propagation. The network itself is nothing but a big matrix initialised with arbitrary values (called weights). Its size was chosen through trial and error until to maximise performance.

The Network Architecture

The input data is fed through the network and multiplies with the weights assigned at each node as shown above. At each node an activation function (a rectified linear unit or ReLu function was used here) was applied to the product.

The model then iteratively updates this weights matrix by using an algorithm called gradient descent. In gradient descent a cost or error function is made, which is then minimised using some simple calculus until it converges to a minimum. The raw algorithm is very computationally expensive and not feasible so an L-BFGS algorithm was used which uses Newtonian methods to approximate a Hessian matrix (Jacobian of the gradient vector so think second derivative but as a matrix) of the cost function and minimise the weights.

The neural network algorithm performed far better than the linear one with an R^2 of 0.7016 (roughly in line with the best publically available models online). I was particularly relieved that the network managed to predict a value within a reasonable range (around 0-30 PPG which matches with virtually all modern NBA seasons). The ReLu activation function can be thanked for achieving this far more reasonable range.

Adding Team Statistics

At this point I was convinced that I could improve my model, making it the best one available and made some minor tweaks. I decided to experiment with the data as well (and got sucked into another dataframe hellhole) adding and subtracting some variables. Finally it dawned upon me. The biggest difference maker that differentiated my model from the rest was the use of team statistics. Most papers and blog posts out there rely solely on individual production. The reason that team statistics matter is because teammate performance is inherently tied to an individuals performance. If your team averages more assists, expect the scoring to be fairly well spread out with many key contributors. If your team averages a lot of points per game expect a player to average more points per game and lastly if a team plays at a faster pace, that means more potential possessions where the player in question can score. Adding these team statistics to the mix caused the R^2 value to skyrocketed to 0.7981. The same graphs of Steph and LeBron (and some others) over the years began to look significantly more accurate.

The Predicted PPG of Some Players By the Neural Network Over the Years