Kevin Shain

Data Science in the Lab

An Adaptive Bayesian Algorithm for Optimal Qubit Measurement

I did research on quantum computing for a couple of years as an undergrad mainly focusing on hardware that could prolong the coherence time of a superconducting qubit (link).This project tackles decoherence from a data analysis perspective. The adaptive Bayesian algorithm updates the estimate of qubit parameter ΔBz after each measurement and uses that knowledge to optimize the information gained from the next measurement.

Qualitative and Quantitative Real-Time PCR analysis

Real-time polymerase chain reaction (PCR) data comes in the form of fluorescence curves, the fluorescence signal at each cycle, which indicate the concentration of the analyte. Each cycle should double the concentration of the analyte until an ingredient in the reaction is used up. My goals are both a qualitative and quantitative analysis of the fluorescence curves. It is important to industry that there be an extremely robust way of distinguishing the presence of some biological molecule. This has proved to be tricky and human graders are still employed to separate positive from negative curves. The root of the problem is often the variable background signal, so I am attempting to build a classified on top of the maxRatio algorithm which nicely handles the data baselining. For the quantitative analysis, I am investigating the raw fluorescence curves to find the initial concentration of the analyte. The initial concentration is calculated based on the cycle number when the fluorescence signal exceeds the background, but finding that cycle number is a challenge. Because this research is going towards a publication, I cannot show the analysis yet.

Free Throw Prediction using Play-by-play Data

Seemingly mundane and often frustrating to watch, they account for 2/3 of the winning team's points in the last minute, free throws are one of the most statistically interesting acts in sports. At every level of basketball, free throw shooting has remained remarkably constant over time. For the past fifty years, the league-wide free throw percentage has always been within a few percent of its average of 74.9%. That the highest league-wide percentage was 77.1% occurring in 1974 is remarkable. With the exception of records like wins by a pitcher in baseball, which reflect a huge change in the way the game is played, most records have been broken in the modern era of advanced training regimens and PEDs. Even sports that don't require exceptional athleticism, like bowling, have seen advances from improving technique. In this respect, free throw shooting is unique.

Estimating the free throw shooting percentage of a player is particularly important for coaches to decide who to play for their team and who to foul on the opposing team. Coaches use the heuristic that players tend to shoot 10% worse in games than practice due to pressure and fatigue. They also consider past performance in similar situations, but not in a methodical way. To allow for methodical analysis, I have assembled the largest and most complete database of NBA free throws along with situational information. Since 2001, the NBA has included play-by-play information with the box scores to help fans follow the game. Using this information, I was able to scrape all free throws shot since 2001 as well as situational data such as game time, venue, opponent, score, number of the free throw in a given trip to the line, and whether the it was a technical free throw. This is all assembled in a PostgreSQL database containing approximately one million shots. In addition to this data, I also assembled season and career statistics on all players since 1950. Finally, I used the play-by-play data to generate a snapshot of each player's statistical history exactly as it would be available to a coach at the time of a free throw being attempted. Throughout the process of making a prediction tool, which will be detailed later, I was especially careful to obey the flow of time by removing future data from the prediction of any given free throw.

Before getting to prediction, the play-by-play data enables us to probe some of the heuristics that coaches use to make gametime decisions. First, while fatigue seems to play a role as free throw percentage decreases by 1% in the fourth quarter, it spikes to 77.4% in the last minute of the game, presumably due to the best shooters being on the floor. Also, there is a consistent decrease of 2-3% for the first few minutes of each quarter. This may indicate that players take some time to warm up, further supported by the second free throw of a pair being made 4.6% more often than the first free throw. Seeing these large effects in the league-wide free throw percentages motivates that situational information can be used to better predict a player's likelihood to make a free throw at any point during the game. In addition to being interesting in its own right, this prediction tool is useful for coaches to better decide which players to play in a given situation and which opposing players to foul. So far, a tuned random forest classifier with situational data yields free throw percentages around 3% better than just using season and career data. While there is still room for better models considering the new dataset at our disposal, the current 3% improvement is definitely enough influence strategic decisions as differences between players' free throw percentages are often within a few points.

Kobe Bryant's Shooting Percentage Kaggle Competition

In honor of Kobe Bryant retiring from the NBA after playing for 20 years, Kaggle hosted a competition to predict his likelihood of making various shots. The data was already very clean, so the main effort went into feature engineering and evaluating different types of classifiers. Much of the data was qualitative categories of shot type, action type, or opponent, each of which had many options. I handled this using dummy variables corresponding to each option in each category, which Pandas makes very simple. Some of the classifiers I tried were random forests, gradient boosting, and adaptive boosting, with a carefully tuned random forest performing the best. The classifiers were evaluated using 10-fold cross validation and the log-loss metric. I was trying to predict a chance of making the shot, but in the data, the shot is either a make (1) or miss (0). Therefore, the log-loss makes sense as a metric because it penalizes more if the prediction is confidently wrong. You can view my analysis in the iPython notebook. There is nothing wild about the approach, but this shows that sensible feature engineering and cross validation is enough to reach 4th place out of over 1000 on the public leaderboard.

About Me

I'm currently a PhD student in Physics at Harvard University. I do experimental research on the quantum anomalous Hall effect, quantum spin Hall effect, and graphene in the Yacoby group. As an undergrad, I worked in the Schoelkopf Lab at Yale University and developed a Purcell filter to increase the coherence time of superconducting qubits (paper).