I was in a position recently where I needed to provide a quick, semi-accurate estimate of how fast it would take to run a supervised learning model (GBM in this case [maybe not the best choice, in retrospect]) over a ~2GB dataset with a bivariate target. I had zero clue where to start. I had run GBM over smaller datasets in the past, but had no idea, for example, what the impact of distributing the process across 4 cores would be in terms of time.
In this particular case, interpret-ability was important (again in retrospect: CART would have better for calculating variable importance), so pre-processing with, e.g., PCA was less desirable.
My solution at the time was to run the full GBM model across various subsets of the full dataset with varying parameters in order to gauge run time (I did this by wrapping the full CARET model in system.time() and recording the output by hand for each iteration). It became clear that, for example, distributing across 4 cores cut the run time in half (which is still puzzling to me, as stochastic gradient boosting doesn’t parallelize well… Further investigation required here).
Yes, it got the job done but in an extravagantly inefficient manner, and, even worse, cost me a full night of sleep, which I can afford less and less of in the professional services world. The next day, when I had a minute, I wrote this quick script to run several trials of a GBM model using the CARET package, with random samples of various parameters (various shrinkage, various number of trees, various number of dimensions, etc). The results are written to a simple .csv, which I I can run a regression on to predict run times in the future.
One fun function I learned was tryCatch, which is a syntactically-odd but useful function for skipping trial iterations in a loop which return an error. My trials didn’t contain errors but I had to run it for about half a day, so I didn’t have to check back until it was finished.
require(doParallel)||{install.packages("doParallel")}
require(mail)||{install.packages("mail")}
require(caret)||{install.packages("caret", dependencies = c("Depends","Suggests"))}
#load("~/gbmFit180Kfin.RData") #full<-startDF
INTERDEPTH<-c(1,2)
NUMTREES<-(1:10)*100
trials<-(1:100)
df.regress<-data.frame()
for(i in trials){
tryCatch({
nrows<-as.numeric(sample(c(1000,2000,3000,4000,5000,10000,20000),1,replace=T))
dims<-as.numeric(sample(10:45,1,replace=T))
model<-"gbm"
nfolds <- as.numeric(sample(2:10,1,replace=T))
nrepeats <- as.numeric(sample(2:10,1,replace=T))
ncores <- as.numeric(sample(1:detectCores(),1,replace=T))
classprobs <- 1
preproc<-"pca"
ID_num <- length(INTERDEPTH)
ID_min <- min(INTERDEPTH)
ID_max <- max(INTERDEPTH)
ntrees_min <- min(NUMTREES)
ntree_max <- max(NUMTREES)
shrinkage <- as.numeric(sample(c(0.1,0.2,0.3),1,replace=T))
cl<-makeCluster(ncores)
registerDoParallel(cl,cores=ncores)
train_set <- full[1:nrows,]
train_class <- factor(train_set$target_eval)
levels(train_class) <- make.names(levels(train_class))
train_descr <- train_set[,6:dims]
set.seed(1)
runTime<-system.time(
gbmFit <- train(train_descr,
train_class,
method = model,
preProc = preproc,
trControl = trainControl(method = "repeatedcv",
number= nfolds,
repeats = nrepeats,
classProbs = TRUE,
summaryFunction = twoClassSummary),
metric = "ROC",
verbose=TRUE,
tuneGrid=expand.grid(.interaction.depth = INTERDEPTH,
.n.trees = NUMTREES,
.shrinkage = shrinkage)
)
)
TRIAL<-as.numeric(i)
NROWS<- as.numeric(nrows)
DIMS<-as.numeric(dims)
MODEL<-as.character(model)
NFOLDS <-as.numeric(nfolds)
NREPEATS <- as.numeric(nrepeats)
NCORES <- as.numeric(ncores)
CLASSPROBS <- as.character(classprobs)
PREPROC<-as.character(preproc)
ID_num <- as.numeric(length(INTERDEPTH))
ID_min <- as.numeric(min(INTERDEPTH))
ID_max <- as.numeric(max(INTERDEPTH))
ntrees_min <- as.numeric(min(NUMTREES))
ntrees_max <- as.numeric(max(NUMTREES))
SHRINKAGE <- as.numeric(shrinkage)
dims.fin <- as.numeric(nrow(varImp(gbmFit)[[1]][1]))
ROC <- as.numeric(max(gbmFit$results$ROC))
RUNTIME <- as.numeric(runTime[3])
df<- cbind(TRIAL,
NROWS,
DIMS,
MODEL,
NFOLDS,
NREPEATS,
NCORES,
CLASSPROBS,
PREPROC,
ID_num,
ID_min,
ID_max,
ntrees_min,
ntrees_max,
SHRINKAGE,
dims.fin,
RUNTIME,
ROC
)
df.regress<-rbind(df.regress,df)
}, error=function(e){cat("ERROR :",conditionMessage(e), "\n")}) }
write.csv(df.regress,"Run_Time_Regression_100.csv",append=T)
Lessons learned/other improvements for future experiments:
- Use CART to determine variable importance first, run bagging/boosting models for predictive accuracy
- Automatically detect and assign variable names rather than having to hardcode (I actually wrote a macro in Excel to quickly dump column names into a properly formatted text-string, so it wasn’t pure hardcoding but it would be better to have R take care of the whole thing)
- Add a “mail to” function to alert me when calculations are finished via email
- Came across several bench marking functions while researching but didn’t have time to dig in all the way. Look into existing benchmarking capabilities, especially in dplyr, etc.