Building Optimal Daily Fantasy Lineups in R
‘Are you ready for some footbaaaalll?!’ It’s that time of the year, the NFL is back! Like many others, part of my enjoyment of football season is through fantasy football, specifically daily fantasy (like DraftKings or Fanduel is you’re unfamiliar). Obviously, I want to apply my analytic skillset to give me an edge, and there’s lot of great packages in R that make that easier to do.
In this post, I’ll be using the great ffanalytics
package to scrape projections from multiple sites, and simulate multiple optimal lineups to get an idea of which players to target in my lineups.
The first step, as always, is getting the right packages loaded
library(data.table)
library(dtplyr)
library(tidyverse)
library(ffanalytics)
library(lpSolve)
library(rPref)
library(kableExtra)
## Specify Week
week <- 3
Scrape Projections
Next, we select the sources we want to pull from. (I’d normally include ESPN in here, but the results have been buggy the last few times I’ve pulled).
sources <- c('CBS', 'Yahoo', 'FantasySharks', 'NumberFire', 'FantasyPros', 'FantasyData', 'FleaFlicker')
scrape <- scrape_data(src = sources,
pos=c('QB', 'RB', 'WR', 'TE', 'DST'),
season = 2019,
week=week)
A nice feature of ffanalytics
is that you can provide custom scoring schemas. Here, I set up the DraftKings scoring schema and apply it to the scraped data, so all projections will be in that format.
scoring <- list(
pass = list(
pass_att = 0, pass_comp = 0, pass_inc = 0, pass_yds = 0.04, pass_tds = 4,
pass_int = -1, pass_40_yds = 0, pass_300_yds = 3, pass_350_yds = 0,
pass_400_yds = 0
),
rush = list(
all_pos = TRUE,
rush_yds = 0.1, rush_att = 0, rush_40_yds = 0, rush_tds = 6,
rush_100_yds = 3, rush_150_yds = 0, rush_200_yds = 0),
rec = list(
all_pos = TRUE,
rec = 1, rec_yds = 0.1, rec_tds = 6, rec_40_yds = 0, rec_100_yds = 3,
rec_150_yds = 0, rec_200_yds = 0
),
misc = list(
all_pos = TRUE,
fumbles_lost = -1, fumbles_total = 0,
sacks = 0, two_pts = 2
),
ret = list(
all_pos = TRUE,
return_tds = 6, return_yds = 0
),
dst = list(
dst_fum_rec = 2, dst_int = 2, dst_safety = 2, dst_sacks = 1, dst_td = 6,
dst_blk = 2, dst_ret_yds = 0, dst_pts_allowed = 0
),
pts_bracket = list(
list(threshold = 0, points = 10),
list(threshold = 1, points = 7),
list(threshold = 7, points = 4),
list(threshold = 14, points = 1),
list(threshold = 21, points = 0),
list(threshold = 28, points = -1),
list(threshold = 35, points = -4)
)
)
proj <- projections_table(scrape, scoring_rules = scoring) %>%
add_player_info()
Now that we have the projections in the proper format, we need to add player salaries. I don’t have a site to programatically pull this from, the way I do it currently is just from exporting from the DraftKings lineup page. There’s some cleaning that has to be done in order to make the names match, (Todd Gurley is an example), and I’m sure there’s a cleaner way to do this, but hey it worked so far.
Another potential issue is sites not updating their injury reports. For example, Trevor Siemian still had a projection for some sites, even though he’s out for the season, so I manually remove him
## Read in DraftKings
injured <- c('Trevor Siemian') ## Remove Injured Players
sal <- read_csv('DKSalaries.csv') %>%
filter(!Name %in% injured)
sal$Name[sal$Name=='Todd Gurley II'] <- 'Todd Gurley'
Build Optimal Lineups
Next I define a function to merge the salary data with the prediction data, and randomly generate a points estimate for a given player, which I’m able to do because ffanalytics
provides standard deviations for their projections, allowing me to pull projections from a probability distribution. I then generate a lineup via lpsolve
. Building lineups is a variant of the classic ‘knapsack’ problem in optimization, I only have so much salary, and I need to fit players in my lineup that will get me the most points while meeting certain constraints (1 QB, 2 RB, 3 WR, 1 TE, 1 DST, 1 Flex, all under $50,000). I repeat this process 10,000 times, giving me 10,000 lineups under 10,000 scoring scenarios.
generate_lineup <- function(n){
pred_sal <- proj %>%
filter(avg_type=='robust') %>%
mutate(Name = ifelse(pos=="DST", last_name, paste(first_name, last_name))) %>%
inner_join(sal, by=c("Name")) %>%
select(Name, team, position, points, Salary, sd_pts) %>%
filter(!is.na(points), !is.na(Salary)) %>%
group_by(Name) %>%
mutate(sal_max=max(Salary)) %>%
filter(Salary==sal_max) %>%
group_by(Name) %>%
mutate(pts_pred = rnorm(1, points, sd_pts),
lineup=n) %>%
select(-sal_max)
obj <- pred_sal$pts_pred
mat <- rbind(t(model.matrix(~ position + 0,pred_sal)), t(model.matrix(~ position + 0,pred_sal)), rep(1, nrow(pred_sal)), pred_sal$Salary)
dir <- c("=","=","<=","<=","<=", "=","=",">=",">=",">=","=","<=")
rhs <- c(1,1,3,2,4,1,1,2,1,3,9,50000)
result <- lp("max", obj, mat, dir, rhs, all.bin = TRUE)
results <- pred_sal[which(result$solution == 1),]
return(results)
}
sim_lu <- map_df(1:10000, generate_lineup) %>%
rename(pts_base=points) %>%
select(lineup, Name, team, position, pts_base, pts_pred, sd_pts, Salary)
Explore Optimal Lineups
Now that we have 10,000 optimal lineups, lets see how those lineups look.
Here’s a sample of three optimal lineups, so you can get a sense of what the data looks like.
sim_lu %>%
filter(lineup<=3) %>%
arrange(lineup, position, desc(pts_pred)) %>%
knitr::kable() %>%
kable_styling() %>%
column_spec(1, bold=TRUE) %>%
collapse_rows(columns = 1, valign = 'top')
lineup | Name | team | position | pts_base | pts_pred | sd_pts | Salary |
---|---|---|---|---|---|---|---|
1 | Dolphins | MIA | DST | 4.97000 | 6.598585 | 0.7086828 | 2000 |
Dak Prescott | DAL | QB | 23.73500 | 24.971105 | 0.8272908 | 6500 | |
Christian McCaffrey | CAR | RB | 24.96164 | 25.569784 | 1.0081680 | 8700 | |
Devonta Freeman | ATL | RB | 12.71842 | 14.707710 | 1.0526460 | 4900 | |
Zach Ertz | PHI | TE | 17.38000 | 18.002119 | 2.5352460 | 5700 | |
Keenan Allen | LAC | WR | 20.24500 | 24.267924 | 2.6612670 | 7000 | |
Sammy Watkins | KCC | WR | 17.73100 | 21.472349 | 2.6345802 | 6800 | |
Emmanuel Sanders | DEN | WR | 14.76000 | 15.308993 | 1.5537648 | 4800 | |
Nelson Agholor | PHI | WR | 13.08025 | 12.421407 | 0.6108312 | 3600 | |
2 | Cardinals | ARI | DST | 7.02700 | 7.956459 | 1.1327064 | 2700 |
Jameis Winston | TBB | QB | 20.35533 | 20.534232 | 0.4922232 | 5400 | |
Christian McCaffrey | CAR | RB | 24.96164 | 24.693976 | 1.0081680 | 8700 | |
Ezekiel Elliott | DAL | RB | 24.45650 | 22.832514 | 2.1742329 | 8900 | |
Zach Ertz | PHI | TE | 17.38000 | 16.542479 | 2.5352460 | 5700 | |
O.J. Howard | TBB | TE | 8.45050 | 11.634965 | 1.4974260 | 3800 | |
DeSean Jackson | PHI | WR | 11.60000 | 22.402629 | 8.5990800 | 5800 | |
Larry Fitzgerald | ARI | WR | 15.60500 | 15.718476 | 0.8006040 | 5100 | |
Nelson Agholor | PHI | WR | 13.08025 | 12.519298 | 0.6108312 | 3600 | |
3 | Steelers | PIT | DST | 6.83000 | 7.748545 | 0.8747340 | 2300 |
Jameis Winston | TBB | QB | 20.35533 | 20.635016 | 0.4922232 | 5400 | |
Saquon Barkley | NYG | RB | 22.66727 | 25.769062 | 2.5975152 | 9100 | |
Christian McCaffrey | CAR | RB | 24.96164 | 24.227040 | 1.0081680 | 8700 | |
C.J. Uzomah | CIN | TE | 4.88000 | 10.489492 | 2.6242020 | 2700 | |
Julio Jones | ATL | WR | 19.08000 | 19.990107 | 1.9251561 | 7300 | |
T.Y. Hilton | IND | WR | 16.04000 | 19.965852 | 1.8021003 | 6400 | |
Larry Fitzgerald | ARI | WR | 15.60500 | 15.695509 | 0.8006040 | 5100 | |
Darius Slayton | NYG | WR | 4.81500 | 10.644966 | 5.4633810 | 3000 |
Now that we have an idea of what the data looks let’s dig into what these optimized lineups are showing.
First, a look at which players are represented in optimal lineups. Here I’ll look at the top 10 by position.
sim_lu %>%
group_by(Name, position) %>%
dplyr::summarize(lu=n_distinct(lineup)) %>%
ungroup() %>%
group_by(position) %>%
top_n(10, lu) %>%
ungroup() %>%
arrange(position, desc(lu)) %>%
mutate(Name=factor(Name),
Name=fct_reorder(Name, lu)) %>%
ggplot(aes(x=Name, y=lu)) +
geom_bar(stat='identity') +
facet_wrap(~position, ncol = 3, scales='free') +
coord_flip() +
scale_y_continuous(labels = scales::comma) +
ggtitle('Top 10 Players Present by Position')
From what we see here, Dak Prescott is a solid play for QB, as he shows in in around 5,500 lineups, or around 55% of all optimal lineups. The next most common is Jameis Winston, coming in at around 2,700 lineups, just over a quarter.
Now let’s look at flex configurations. As described above, in DraftKings and most daily fantasy, you have a flex spot, where you can place an extra RB, WR, or TE. Another thing we can learn from these optimal lineups is how often which configurations show up (e.g. how often an RB is selected for the flex position).
sim_lu %>%
group_by(lineup) %>%
mutate(lineup_pts=sum(pts_pred)) %>%
group_by(lineup, position) %>%
mutate(n=n()) %>%
select(lineup, position, n, lineup_pts) %>%
distinct() %>%
spread(key=position, value=n) %>%
filter(RB>=2, TE>=1, WR>=3) %>%
mutate(flex=case_when(RB==3 ~ 'RB',
TE==2 ~ 'TE',
WR==4 ~ 'WR')) %>%
group_by(flex) %>%
dplyr::summarize(pts=median(lineup_pts),
cases=n()) %>%
knitr::kable() %>%
kable_styling(full_width = FALSE)
flex | pts | cases |
---|---|---|
RB | 160.6642 | 2325 |
TE | 161.8391 | 1900 |
WR | 161.6524 | 5775 |
For this week, a WR has been put in the flex spot for about 57% of lineups, an RB in about 23% of lineups, and a TE in about 19% of lineups. Just looking at the median lineup points, there isn’t a large difference in scoring between flex configurations.
On that now, let’s look at the distribution of lineup points?
sim_lu %>%
group_by(lineup) %>%
dplyr::summarize(points=sum(pts_pred)) %>%
ggplot(aes(x=points)) +
geom_histogram() +
ggtitle('Points Distribution of Optimal Lineups')
Nice, normal looking distribution!
A little too normal looking…
Here it is important to note that the standard deviation of the projection (supplied by ffanalytics) is used to estimate the true projection, not the true performance, that’s why the distribution is so tight and normal looking.
While you CAN enter a lot of lineups in DFS (less so than the old days), we can’t use all these lineups. So how do we narrow down? One way is to look at what’s called that Pareto frontier, lineups where we either maximize total lineup points, or minimize lineup standard deviation. If it sounds a little bit confusing, here’s a visualization to illustrate the point, the bolded points are the pareto lineups.
lu_df <- sim_lu %>%
group_by(lineup) %>%
dplyr::summarize(lineup_pts=sum(pts_pred),
lineup_sd=sum(sd_pts)) %>%
ungroup()
pto <- psel(lu_df, low(lineup_sd) * high(lineup_pts))
ggplot(lu_df, aes(y=lineup_pts, x=lineup_sd)) +
geom_point() +
geom_point(data=pto, size=5) +
ylab('Lineup Points') +
xlab('Lineup Points St Dev') +
ggtitle('Lineup Points vs Uncertainty',
subtitle = 'Pareto Lineups Bolded')
This allows us to maximize the tradeoff between lineup points and lineup uncertainty. As expected, there’s a general trend of higher points lineups being more uncertain. We can leverage that uncertainty for certain games. For example, a lineup that has good points and lower uncertainty will be a good cash game play, where the higher uncertainty lineups are good GPP plays, where you want to focus on a lineup’s ceiling rather than their raw projection.
A csv with all 10,000 lineups is located here
Next Steps
While this analysis can improve our lineup building, there are some important limitations.
First, these projections are at the player level, taking no covariance into account. For example, Dak Prescott and Amari Cooper’s simulated scores are calculated separately, in a real life scenario, a better day from Dak probably means a better day for Amari, and a better model should take that into account.
Second, and on a similar note, optimal lineups don’t account for stacking (selecting, say, a QB and WR from the same team, because points for one means points for the other, allowing you to double dip), and one needs to be cognizant of that to both maximize a lineup’s upside, and limit it’s downside (e.g. don’t take a QB and their opposing defense, as their scores are negatively correlated).
These improvements will be coming in future weeks.