Monday, March 3, 2008

Technical Post: How will the blog work? How does the simulation work?

How will the blog work?

Each weekday, I'll begin by listing the roster for the two competing teams. During the first round, I'll write a bit about the players on each team, and the way in which these players fit together. I'll make a note of players that were left off, or if there was a difficult decision between one player or another in the construction of a team.

I'll then run a simulation of 1,000 games between the two teams, using the software that I've written. This happens in about twenty seconds, once the data for the players has been entered. The software will keep track of team and individual statistics for points, assists, rebounds, turnovers, free throws, three pointers, and shooting, three point, and free throw percentages for the simulated games--available on a total and a per game basis. I'll present to you the wins and losses for each side, and team statistics for points per game, rebounds per game, assists per game, and turnovers per game during every post. I'll mention other team statistics and individual statistics as they seem relevant in my discussion of the results.

Using the simulation results as a starting point, I'll pick a tentative winner of the game: the team that won more of the 1,000 simulated games. But I'll then try to figure out whether the results of the simulation are misleading. All simulations are limited; I'll be up front about the limitations of this simulation below. Perhaps obviously, the closer that the 1,000 games are split between the two teams, the more likely it will be that this discussion will counteract the result of the simulation. I'll pick a final winner of the matchup and update the bracket, and leave comments open so that you can discuss the results.

How does the simulation work?
The simulation starts with nine different metrics for each player. I'll name the metric, explain it, and present the high number, low number, and median for the 165 players in the sample. Each of these metrics is either available or derivable from information presented at espn.com. I used players and metrics from last season: the 2006-07 season. Note that you won't see many very good players at the extremes here; very good players tend to have a broad range of skills; these guys are generally either specialists, on the high end, or guys on their way out of the league, on the low end.

Two point percentage: The percentage of the player's two point shots that he makes. Players who take many jump shots will have a lower two point percentage; players who shoot almost exclusively layups and dunks will have a higher percentage.
High: Alan Henderson, Indiana, .642
Low: Michael Ruffin, .278, Colorado/South Dakota
Median: .475

Three point percentage: The percentage of the player's three point shots that he makes.
High: Jason Terry, .438, Washington
Low: Several players in the sample made no three point shots in 06-07
Median: .308

Free throw percentage: The percentage of the player's free throws that he makes.
High: Lindsey Hunter, Mississippi, .909
Low: Lorenzen Wright, Tennessee, .287 (!)
Median: .763

Two point rate: The percentage of this player's shots are two point shots, as opposed to three point shots. Many big men take two point shots at a rate of 90% and up; a two point rate in the 70s or 80s is normal for a guard. Percentages in the sixties or lower indicate that the player is a three point specialist.
High: 100%; several players in the sample have never taken a three point shot in the NBA
Low: Travis Diener, Wisconsin, .375
Median: .814

Free Throw Rate: Free Throw Attempts Divided By (Field Goal Attempts Times 2). This number is used to determine the times when a player is fouled while shooting. If you're a hulking lay-up and dunk only type shooter, this number will be high; if you take many shots and they're all open threes, it will be lower.
High: Dwayne Jones, West Virginia/Connecticut, .698
Low: Roger Mason, Washington D.C., .045
Median: .158

**Interlude**
These last four statistics were created by John Hollinger, who now works for espn.com. The stats are defined here(Remember that I'm using last year's stats, so the stats that you see there won't match up with what I'm using). They measure rebounding, passing, turnovers, and possession usage in two ways that are important for simulation purposes:
1) NBA statistics are traditionally measured in terms of "x per game": assists per game, points per game, rebounds per game. No one knows that Kobe Bryant and Manu Ginobli are nearly equally valuable players on the court, in part because Kobe plays so many more minutes than Ginobli, so his per game averages are higher. These statistics are minutes independent; they allow me to compare players who played many minutes with those who played few.
2) A team like Golden State runs up and down the floor and takes many quick shots; a team like Houston walks the ball up and down the court. A player on Golden State therefore has many more opportunities to score points, make assists, and gather rebounds than a player on Houston, which clouds the usefulness of per game statistics. These statistics are "pace" neutral; they don't penalize a player who happens to play on a team that plays slowly, or reward players who play on teams that use possessions as quickly as possible.
**End Interlude**

Rebounding Rate: Hollinger metric: the percentage of missed shots that happen while a player is on the floor that he rebounds.
High: Tyson Chandler, Northern California; David Lee, Missouri; 20.7
Low: Darrick Martin, Colorado/South Dakota, 3.2
Median: 9.3

Assist Ratio: Hollinger metric: the percentage of a player's possessions that end in an assist.
High: Anthony Carter, Wisconsin, 38.4
Low: Alonzo Mourning, Virginia/Delaware, 2.5
Median: 14.6

Turnover Ration: Hollinger metric: the percentage of a player's possessions that end in a turnover.
High: Joel Pryzbilla, Minnesota, 28.2
Low: James White, Washington D.C., 4.4
Median: 11.2

Usage Rate: Hollinger metric: the number of possessions the player uses per 40 minutes. Basically, this stat tells us how often does the player does something noteworthy with the basketball. This is an important statistic for player evaluation purposes because of the NBA shot clock--because time is so limited, a team needs players who can create shots from nothing. The best players tend to have high usage rates--though the correlation is not ironclad.
High: Tracy McGrady, Florida, 32.9
Low: Michael Ruffin, Colorado/South Dakota, 5.8
Median: 18.5

*****

I'll chart briefly now how the simulation works. In short, the simulation generates many random numbers and compares them with player metrics to determine outcomes. Each basketball possession must end with someone doing one of two things: either shooting or turning the ball over. The simulation uses Usage Rate to determine who does something, and Turnover ratio to determine whether that something is a turnover--in which case, possession switches to the other team. Otherwise, it's a shot, and the simulation uses the Two Point Rate of the player taking the shot to determine whether it is a two point or three point shot, and the relevant percentage to determine whether the shot goes in, and the Free Throw Rate to determine whether the shooter was fouled. If the shot is made, the simulation uses the Assist Ratio and Usage Rate of other players on the team to determine whether the basket was assisted. If the shooter was fouled, the shooter's free throw percentage determines how many of the relevant number of free throws the player makes. If the shot misses, or if the last free throw misses, the Rebound Ratios for each player, modified with a penalty for the offensive team to reflect the difficulty of offensive rebounding, determines which team and player get the rebound. That's probably enough detail; I'll make a further technical post somewhere along the line if there is interest.

Limitations of the Simulation

There are three major issues:
1) Defense is not considered. As you might know, metrics that measure the value of defense in basketball range from terrible to less terrible. As Hollinger said in his 2003 basketball statistics annual, "It's doubly hard to talk about defense in basketball, however, because the numbers aren't there to support a discussion and the ones that do exist mislead or confuse us...at least as often as they represent an honest portrait of a player's defensive skills." So, I ignored defense entirely when I wrote the simulation software in the spring of 2004. My sense is that the metrics have not appreciably improved in the ensuing four years; I'll continue to ignore defense.

2) Substitutions and fouls and injuries. Some players commit many more fouls than others; some players get hurt a lot; some players are constantly out of shape. These players hurt their teams by being unable to play. But there is no notion of 'unable to play' in my simulation. The reason for this is that I'm not sure how to implement artificial intelligence to make substitution decisions that don't cloud the result of the picture. This is my limitation, not a limitation with the available data. So, I stick to a simplified five on five game. Since many states can barely field five players, this is a forgivable sin for this project.

3) Team composition. The simulation doesn't know that a team full of Shaqs wouldn't be able to bring the ball up the floor. It doesn't know that a team full of three point specialists wouldn't shoot at their usual percentages, because no one would draw the double teams that leave them open. For these sorts of reasons, teams that are playing many players out of position may behave oddly in the simulation.

I'll mostly be thinking about these three issues when I think about the limitations of the simulation. Your comments are invited; tune in on Tuesday for the first game in the tournament: the play-in game between Washington D.C. and Kansas.

No comments: