Automatic Ratings
Apr 27, 2020 14:38:42 GMT -5
Allan Houston, Stuff The Magic Dragon, and 1 more like this
Post by billy on Apr 27, 2020 14:38:42 GMT -5
nicidob.github.io/automatic_bbgm/
We could switch to these in the future. This is similar to what Allan Houston used to do, but automated.
Automatically Generating NBA Rosters for BasketballGM
BasketballGM is a remarkable, free, browser-based game. I decided to automatically generate roster files from real NBA data using some basic data science methods. The results are all on this GitHub repo and the roster files are available here
BasketballGM Ratings
As outlined on the player customization page, players in BBGM have 15 ratings which determine their performance.
hgt: height, which factors into pretty much everything
stre: strength, which influences defense, rebounding, and low post scoring
spd: speed, which influences ball handling, fast breaking, and defense
jmp: jumping, which influences finishing at the rim, rebounding, blocking shots, and defense
endu: endurance, which governs how fast a player’s skills degrade as he gets tired
ins: low post scoring
dnk: dunking/layups
ft: free throw shooting
fg: 2 point jump shot ability
tp: 3 point shooting
oiq: offensive IQ
diq: defensive IQ
drb: dribbling
pss: passing
reb: rebounding
This is a sensible decomposition of latent performance variables. Unfortunately, it’s not totally clear how to set these manually. For example, take 3Pt shooting. Shooting percentage? Domantas Sabonis shot over 50% on 3P (on 17 attempts). Total makes? Harden made 378 3P but Steph Curry made 354, with 200 less 3PA. Even worse are variables like jmp or drb, which I have no idea how to set correctly for large numbers of players.
So instead, we’ll build a model to understand these underlying variables.
Building a model
What type of model? In general, we want a model that takes data we have, and generates data we want. It’s clear we want the underlying variables. As for what data do we have? Box score statistics, like this page for LeBron James. This is your typical stat sheet, Rebounds, Points, Steals, etc. Since we’re using modern NBA data, we also have per possession and per minute versions of these statistics.
Our basic model will be a linear regression, which is just a weighted sum. For example, we might get that a player’s jump score can be derived as Jump = 0.3 * Blk_p36 - 0.4 * Height - 0.4 * Age + 0.2 * MP. Stated in plain words, this means that a player who is young, plays a lot of minutes, and gets a lot of blocks would tend to have a high jump rating. We’ll whiten the data first, so all of these are in consistent units, which are standard devations above/below the mean. Thus, all these stats are always measured against league average.
This idea is hardly new. This is Box Plus Minus basically works, it’s a model from box score data to something like Ridge-Regressed APM. There are many versions of this, such as those from Dan Rosenbaum and Scott Sereday. Further, Ben Taylor has built models to measure Shot Creation, Passer Rating and Offensive Load.
Getting NBA data
I decided to use the wonderful Basketball Reference as a data source. I have a small script (save_season.py) which downloads the team-by-team pages for a given season, along with the contract files (which are always “as of today” and don’t provide historical data). I extract all the tables from the pages and save them into Python pickle objects.
BasketballGM is a remarkable, free, browser-based game. I decided to automatically generate roster files from real NBA data using some basic data science methods. The results are all on this GitHub repo and the roster files are available here
BasketballGM Ratings
As outlined on the player customization page, players in BBGM have 15 ratings which determine their performance.
hgt: height, which factors into pretty much everything
stre: strength, which influences defense, rebounding, and low post scoring
spd: speed, which influences ball handling, fast breaking, and defense
jmp: jumping, which influences finishing at the rim, rebounding, blocking shots, and defense
endu: endurance, which governs how fast a player’s skills degrade as he gets tired
ins: low post scoring
dnk: dunking/layups
ft: free throw shooting
fg: 2 point jump shot ability
tp: 3 point shooting
oiq: offensive IQ
diq: defensive IQ
drb: dribbling
pss: passing
reb: rebounding
This is a sensible decomposition of latent performance variables. Unfortunately, it’s not totally clear how to set these manually. For example, take 3Pt shooting. Shooting percentage? Domantas Sabonis shot over 50% on 3P (on 17 attempts). Total makes? Harden made 378 3P but Steph Curry made 354, with 200 less 3PA. Even worse are variables like jmp or drb, which I have no idea how to set correctly for large numbers of players.
So instead, we’ll build a model to understand these underlying variables.
Building a model
What type of model? In general, we want a model that takes data we have, and generates data we want. It’s clear we want the underlying variables. As for what data do we have? Box score statistics, like this page for LeBron James. This is your typical stat sheet, Rebounds, Points, Steals, etc. Since we’re using modern NBA data, we also have per possession and per minute versions of these statistics.
Our basic model will be a linear regression, which is just a weighted sum. For example, we might get that a player’s jump score can be derived as Jump = 0.3 * Blk_p36 - 0.4 * Height - 0.4 * Age + 0.2 * MP. Stated in plain words, this means that a player who is young, plays a lot of minutes, and gets a lot of blocks would tend to have a high jump rating. We’ll whiten the data first, so all of these are in consistent units, which are standard devations above/below the mean. Thus, all these stats are always measured against league average.
This idea is hardly new. This is Box Plus Minus basically works, it’s a model from box score data to something like Ridge-Regressed APM. There are many versions of this, such as those from Dan Rosenbaum and Scott Sereday. Further, Ben Taylor has built models to measure Shot Creation, Passer Rating and Offensive Load.
Getting NBA data
I decided to use the wonderful Basketball Reference as a data source. I have a small script (save_season.py) which downloads the team-by-team pages for a given season, along with the contract files (which are always “as of today” and don’t provide historical data). I extract all the tables from the pages and save them into Python pickle objects.
We could switch to these in the future. This is similar to what Allan Houston used to do, but automated.