Category Archives: R Notes

beep(4) “Work Complete! [orc] “

Love this:

install.packages("beepr")
library(beepr)

beep(4,expr=rnorm(10000))
update.packages(ask=FALSE);beep("ready")

“We’re ready, master”

“I’m Not ready!”

really brings me back.

Useful, too

Thanks! http://www.sumsar.net/blog/2014/06/beepr-is-on-cran/

Best ODBC-to-R tool for corporate environments: read.odbc.ffdf

ff and ETLUtils are perfect for an analyst in a corporate/enterprise setting with mid-sized-to-large datasets (i.e., can be held on-disk but not in memory) stored on databses such as Oracle or MySQL. The set I’m looking at is ~11MM rows and 132 dimensions, around 12 GB. Let’s look at a primer example, querying the database and saving the result as a data.frame:

library(RODBC)

login <- list()
login$dsn <- "MyDSN"
login$uid <- "MyUID"
login$pwd <- "MyPWD"

y<-sqlQuery(channel, paste("SELECT * from MY_TABLE WHERE ROWNUM < 100000"))

object.size(y)/1048600
>74.0599427808507 bytes #this is in Mb's, ignore the "bytes" part

class(y)
[1] "data.frame"

So, that’s 74 Mb as a dataframe. Now let’s look at the same query, but saving the result as an “ffdf” using the read.odbc.ffdf function from the ETLUtils package:


require(ETLUtils)
login <- list()
login$dsn <- "MyDSN"
login$uid <- "MyUID"
login$pwd <- "MyPWD"

x <- read.odbc.ffdf(
 query = "SELECT * from MY_TABLE WHERE ROWNUM < 100000", 
 odbcConnect.args = list(dsn = login$dsn, uid = login$uid, pwd = login$pwd),
 VERBOSE=TRUE)

class(x)
[1] "ffdf"

The comparison:


object.size(x)/object.size(y)
0.154836816876021

1/6th of the memory allocation (and much faster) Only drawback: now I have to learn how to manipulate an ffdf, instead of a dataframe.

UPDATE: Ran an overnight full read-in of the 11,206,056 rows, 132 columns. Unfortunately I didn’t capture the exact timing (estimating around 3-4 hours) but the operation was successful and the resultant ffdf was only 616.5 Mb. Saved the workspace image to my C drive, so I can re load it in the future. 

A Preference for Elegance to Brute Force, AKA O(log(n)) > O(n)

Question: Given m sorted arrays each of size n, what is the fastest algorithm to find the kth smallest element in the union of the m arrays?

A friend sent me this optimization question on Quora. He said he’d solved it in Python, on a plane ride, in O(log(k)) time (Note: the algorithm uses O(log(k)) time, [presumably] not my friend). I thought sure, challenge accepted.

I didn’t have the luxury of a full plane ride to figure it out, but I came up with a solution in R, using linear search. I know, I know, linear search is O(n) time. No excuse, I lost the challenge. But again, no plane ride! C’mon!

I wonder, though. Considering that sortation is a precondition for binary search, and sorting itself can take O(n) time, O(n+k) time, O(nk) time or more means that binary searching over a sorted list could take O(n) + O(log(n)) time in total, as opposed to linear/brute force search which tends to O(n) time. Not sure… should come back to Big O notation later.

As promised, please find the below solution on my recently not-so-anemic github page.

#searchArrays takes a list of sorted vectors as an argument
#and finds the kth smallest common element
#note that the input sorted vectors must be sorted descending

searchArrays <- function(array.list,k=1){
 element.vector<-vector()
   for(e in 1:length(array.list)){
   array.list.search<-array.list[!1:length(array.list) %in% e]
     for(p in 1:length(array.list[[e]])){
     x.search<-array.list[[e]][p]
     element.vector<-x.search
       for(l in 1:length(array.list.search)){
         for(n in 1:length(array.list.search[l])){
         element.vector<-append(element.vector,array.list.search[[l]][n])
           if(length(element.vector)==length(array.list) &
             range(element.vector)[1]==range(element.vector)[2]){
             return(element.vector)
             }
           }
         }
       }
     }
       if(length(element.vector)==length(array.list) &
       range(element.vector)[1]==range(element.vector)[2]){
       print(element.vector)
       } else { print("Error: no common element across arrays") }
}

The input to the function searchArrays() is a list of sorted vectors. The goal is to find the smallest common element in all of the arrays.

Using loops, I define the “search” array e (the first array), then loop through each element in e, which I label x.search (maybe I should have called this e.i). I then separate e from the search arrays, labeling those array.list.search.

At the heart of the nested loops is the “element.vector”, which builds one element at a time, checking for uniformity with each step. To illustrate, if you print it in real time it looks something like:

a
a b
a
a a
a a a
a a a c
a
a a
a a a
a a a a

To test the above:


test.gen<<-function(n){sort(sample(letters,n,replace=T))};
test<-list()
for(i in 1:5){a<-list(test.gen(100));test<-append(test,a)}
searchArrays(test)

 

Final thought: solving the above required checking whether element.vector was uniform at each step. It was an unexpected challenge which I enjoyed solving. The ahah moment came when I realized that if the vector is uniform , the min and max should be the same. Thus you can compare the range:

range(element.vector)[1]==range(element.vector)[2])

This blog post brought to you by Charles Mingus radio, via Spotify.

An Accidental K-NN

Problem: For a large fitness chain, identify fitness clubs which were over staffing. The best way to do this would be a simple supply-demand comparison of staffed hours vs. member usage on a day-by-day or even hourly basis, but I didn’t readily have this information.

With very limited data access and pressed for time, I decided to look for clubs with relatively higher staffing levels. Obviously, certain larger clubs will staff more hours, so I needed to normalize by things like club revenue, total member usage in terms of hours, square footage of the club and other parameters.

By the end, I had accidentally, incrementally hand-built a K-NN model from scratch.

First cut (v.1):


##using a new file, saving old file in workspace for reference
#file_init<-file
#df1_input1<-df1

file<-"~./Data/Labor KNN/LaborKNNInput2.csv"

df1<-read.csv(file, stringsAsFactors=F)
#removing value rows with 0's
df2<-df1[-which(df1$"Max.of.2013.Total.Usage"==0),]
df3<-df2[-which(df2$"Max.of.Sq.Ft"==0),]
df4<-df3[-which(df3$"Max.of.2013.Club.Revenue"==0),]
row.diff<-nrow(df1)-nrow(df4)
cat(paste("total of",row.diff,"observations removed of",nrow(df1)))

#stratifying data by role_type for clustering
df4$ROLE_TYPE<-as.factor(df4$ROLE_TYPE)
df.split<-split(df4,df4$ROLE_TYPE)

#===K-NN===#
#system.time({
k<-50 #must be >= 10

df.fin<-data.frame()
for(i in 1:length(df.split)){ #for each partition by role-type
t1<-as.data.frame(df.split[i])
t1<-cbind(t1,"group"=i)
if(nrow(t1)<=11){
d<-nrow(t1)-1
} else(d<-k)
df.int1<-data.frame()
for(j in 1:nrow(t1)){ #for each row in each partition
t2<-t1[j,]
t3<-t1[-j,]
vals<-abs(t3[7]-as.numeric(t2[7]))
hours<-t1[order(vals),6]
av<-mean(hours[1:d])
std<-sd(hours[1:d])
zsc<-(as.numeric(t2[6])-av)/std
df.int<-cbind(t1[j,"group"],t1[j,1],zsc,std,d,nrow(t1))
df.int1<-rbind(df.int1,df.int)
}
df.fin<-rbind(df.fin,df.int1)
}
names(df.fin)<-c("group#",names(df1)[1],"cluster z-score","cluster std","cluster size","group size")

})#for system.time

mer<-merge(x=df2,y=df.fin,by.x="ROLE_TYPE_LOCATION",by.y="ROLE_TYPE_LOCATION")
mer2<-mer[-which(mer$Standard_Role=="#N/A"),]
mer3<-cbind("AnlysID"=1:nrow(mer2),mer2)

#head(mer3[order(mer3[10],decreasing=T),],n=20)

For future research:

  • Package ‘kknn’ by Klaus Schliep & Klaus Hechenbichler could be used to the same effect: 

    Click to access kknn.pdf

Supervised Learning with CARET in R

“Supervised learning” is an easy enough concept to understand, I think. The goal is to design a “model” which looks at several input variables and predicts a bivariate outcome, like “buys/doesn’t buy” or “clicks through/doesn’t click through” or “lives/dies”, etc. In the case below, I’ve used a Stochastic Gradient Boosting model (GBM), which right now is considered one of the most robust predictive models available in terms of accuracy (the trade-offs being that it is computationally expensive, and can be difficult to interpret).

The setup is a dataframe with one (can be more) binary target variable(s) and any number of predictor variables. You instruct a computer to look at each row and create a decision tree representing the likelihood of arriving at an outcome given a particular value for each variable (for more on the mechanics see: CART analysis). You then have the computer repeat this procedure several times, sometimes creating thousands of trees, with different samples of the dataset (re-sampling, or “bagging”), each time testing a given number of trees against a random re-sampling of the dataset. For any given combination of predictor variables in the re-sampling, all of the decision trees in a particular model “vote” whether the taget should be a 0 or 1 (you can also approximate probability by counting the number of votes one way or another, i.e., a continuous range between 0-1). Because of this re-sampling of the data, you can check the predictive accuracy for different models by looking at predicted vs. actual outcomes.

The re-sampling algorithm, at a high level, looks like this (From Hastie, Tibshirani and Friedman’s The Elements of Statistical Learning):

The CARET package in R is an impressive wrapper of several supervised learning models, allowing the modeler access to multiple models with different parameters. For example, you can create a radom forest model simply by changing the “method” parameter from “gbm” to “rf” as well as switch a few parameter variables (a list of all available models with parameters).

I found a nice practice test set where the target_eval variable was a column with 0’s and 1’s and the train and test sets were also defined. Note: the multicolinearity tests were just for fun and aren’t necessary for the GBM model.

#input files:
file1 <- "~/data.csv"
full <- as.data.frame(read.csv(file1))

#parse training and test sets
train_set <- subset(full, train == 1)
train_class <- factor(train_set$target_eval)
levels(train_class) <- make.names(levels(train_class))
train_descr <- train_set[,5:length(train_set)]

test_set <- subset(full, train == 0)
test_class <- factor(test_set$target_eval)
test_descr <- test_set[,5:length(test_set)]

#Testing for multicollinearity
descrCorr <- cor(train_descr)
highCorr <- findCorrelation(descrCorr,0.90)
#returns NULL values at >.5, multicollinearity does not appear to be an issue
#we confirm this by looking at a summary of the correlation matrix, noting that the 3rd quartiles rarely exceeds .4
summary(cor(train_descr))

#===Stochastic Gradient Boosting Model===#

require(caret)||{install.packages("caret", dependencies = c("Depends","Suggests"))}
set.seed(1)

gbmFit <- train(train_set[,5:length(train_set)],
     train_class,
     method="gbm", #gradient boosting machine method
     preProc = "pca", #principal component analysis
     trControl = trainControl(method = "repeatedcv", #10-fold cross validation
                             number=10,
                             repeats = 10,
                             classProbs = TRUE,
                             summaryFunction = twoClassSummary), #retain classprobs
     metric = "ROC", #evaluation=ROC
     verbose=TRUE,
     tuneGrid=expand.grid(.interaction.depth = c(2),
                          .n.trees = (10:55)*100,
                          .shrinkage = 0.1)
 )

gbmFit

#testing model
set.seed(1)
gbmRaw <- predict.train(gbmFit, test_descr,type = "raw")
gbmRaw <- gsub("X0","0",gbmRaw)
gbmRaw <- gsub("X1","1",gbmRaw)

gbmProb <- predict.train(gbmFit, newdata = test_set[,5:length(test_set)],type = "prob")
names(gbmProb) <- c("0","1")

That is one of the most powerful predictive models available and I was able to build it *somewhat* from scratch it in just under a few hours with CARET. R is awesome.

fread vs. read.table: Reading in >2GB of data, fast

I had 15 .zip files containing “|” delimited text files and was having a tough time reading them into R using read.table. I tried all the tricks like setting colClasses, defining nrows and even learned some new UNIX syntax to set parameters. Was still running into memory limits.

After some research, here are a few R packages which attempt to deal with memory limits:

  • sqldf
  • read.csv.sql
  • bigmemory (only for UNIX environments)
  • “fread()” from the data.table package

Unfortunately, sqldf and read.csv.sql don’t have the fine-tuned read-in arguments that I needed (e.g., the quote=”\”” feature in read.table). It ended up not being able to properly read the data, which contained improper quotes, etc. Will have to come back to sqldf later when I have fewer format issues.

fread() from the data.table package was an initial win in terms of both memory and speed. I decided to test it against read.table to see which was faster. Memory limits still might be an issue (although the documentation seems to suggest this will be improved) but the increase in speed was promising.

Baseline read.table ran w/out stringsAsFactors=F. I then tried read.table with stringsAsFactors=F (which, in this case, makes no difference on the resultant dataframe), then compared that to fread() basic (no arguments defined, using all auto-detection), then fread() with “sep” specified and then fread() with stringsAsFactors set to FALSE.


rt.baseline<-system.time(df1<-read.table(myfile,nrows=10000,header=F,sep="|",quote="\"",fill=T))
rt.SAF<-system.time(df2<-read.table(myfile,nrows=10000,header=F,sep="|",quote="\"",fill=T,stringsAsFactors=F))

fread.base<-system.time(df3<-fread(myfile,nrows=10000))
fread.sep<-system.time(df4<-fread(myfile,nrows=10000,sep="|"))
fread.SAF<-system.time(df5<-fread(myfile,nrows=10000,sep="|",stringsAsFactors=F,header=F))

times<-cbind(rt.baseline,rt.SAF,fread.base,fread.base,fread.sep,fread.SAF)[3,]
times
rt.baseline rt.SAF fread.base fread.base fread.sep fread.SAF
 0.57 0.17 0.34 0.34 0.32 0.31

The resultant system times are “elapsed” times.

Conclusions:

  • fread() is considerably faster than base read.table, and offers certain advantages like auto-detection of separators etc.
  • read.table with “stringsAsFactors=F” is actually much faster than fread()
  • Interestingly, stringsAsFactors and sep have very little impact on fread

read.table wins in terms of speed, but fread() might still win in terms of memory allocation.

Performance Benchmarking with CARET

I was in a position recently where I needed to provide a quick, semi-accurate estimate of how fast it would take to run a supervised learning model (GBM in this case [maybe not the best choice, in retrospect]) over a ~2GB dataset with a bivariate target. I had zero clue where to start. I had run GBM over smaller datasets in the past, but had no idea, for example, what the impact of distributing the process across 4 cores would be in terms of time.

In this particular case, interpret-ability was important (again in retrospect: CART would have better for calculating variable importance), so pre-processing with, e.g., PCA was less desirable.

My solution at the time was to run the full GBM model across various subsets of the full dataset with varying parameters in order to gauge run time (I did this by wrapping the full CARET model in system.time() and recording the output by hand for each iteration). It became clear that, for example, distributing across 4 cores cut the run time in half (which is still puzzling to me, as stochastic gradient boosting doesn’t parallelize well… Further investigation required here).

Yes, it got the job done but in an extravagantly inefficient manner, and, even worse, cost me a full night of sleep, which I can afford less and less of in the professional services world. The next day, when I had a minute, I wrote this quick script to run several trials of a GBM model using the CARET package, with random samples of various parameters (various shrinkage, various number of trees, various number of dimensions, etc). The results are written to a simple .csv, which I I can run a regression on to predict run times in the future.

One fun function I learned was tryCatch, which is a syntactically-odd but useful  function for skipping trial iterations in a loop which return an error. My trials didn’t contain errors but I had to run it for about half a day, so I didn’t have to check back until it was finished.

require(doParallel)||{install.packages("doParallel")}
require(mail)||{install.packages("mail")}
require(caret)||{install.packages("caret", dependencies = c("Depends","Suggests"))}

#load("~/gbmFit180Kfin.RData") #full<-startDF

INTERDEPTH<-c(1,2)
NUMTREES<-(1:10)*100

trials<-(1:100)
df.regress<-data.frame()

for(i in trials){
 tryCatch({
 nrows<-as.numeric(sample(c(1000,2000,3000,4000,5000,10000,20000),1,replace=T))
 dims<-as.numeric(sample(10:45,1,replace=T))
 model<-"gbm"
 nfolds <- as.numeric(sample(2:10,1,replace=T))
 nrepeats <- as.numeric(sample(2:10,1,replace=T))
 ncores <- as.numeric(sample(1:detectCores(),1,replace=T))
 classprobs <- 1
 preproc<-"pca"
 ID_num <- length(INTERDEPTH)
 ID_min <- min(INTERDEPTH)
 ID_max <- max(INTERDEPTH)
 ntrees_min <- min(NUMTREES)
 ntree_max <- max(NUMTREES)
 shrinkage <- as.numeric(sample(c(0.1,0.2,0.3),1,replace=T))
 cl<-makeCluster(ncores)

 registerDoParallel(cl,cores=ncores)

 train_set <- full[1:nrows,]
 train_class <- factor(train_set$target_eval)
 levels(train_class) <- make.names(levels(train_class))
 train_descr <- train_set[,6:dims]

 set.seed(1)
 runTime<-system.time(
 gbmFit <- train(train_descr,
 train_class,
 method = model,
 preProc = preproc,
 trControl = trainControl(method = "repeatedcv",
 number= nfolds,
 repeats = nrepeats,
 classProbs = TRUE,
 summaryFunction = twoClassSummary),
 metric = "ROC",
 verbose=TRUE,
 tuneGrid=expand.grid(.interaction.depth = INTERDEPTH,
 .n.trees = NUMTREES,
 .shrinkage = shrinkage)
 )
 )

 TRIAL<-as.numeric(i)
 NROWS<- as.numeric(nrows)
 DIMS<-as.numeric(dims)
 MODEL<-as.character(model)
 NFOLDS <-as.numeric(nfolds)
 NREPEATS <- as.numeric(nrepeats)
 NCORES <- as.numeric(ncores)
 CLASSPROBS <- as.character(classprobs)
 PREPROC<-as.character(preproc)
 ID_num <- as.numeric(length(INTERDEPTH))
 ID_min <- as.numeric(min(INTERDEPTH))
 ID_max <- as.numeric(max(INTERDEPTH))
 ntrees_min <- as.numeric(min(NUMTREES))
 ntrees_max <- as.numeric(max(NUMTREES))
 SHRINKAGE <- as.numeric(shrinkage)
 dims.fin <- as.numeric(nrow(varImp(gbmFit)[[1]][1]))
 ROC <- as.numeric(max(gbmFit$results$ROC))
 RUNTIME <- as.numeric(runTime[3])

 df<- cbind(TRIAL,
 NROWS,
 DIMS,
 MODEL,
 NFOLDS,
 NREPEATS,
 NCORES,
 CLASSPROBS,
 PREPROC,
 ID_num,
 ID_min,
 ID_max,
 ntrees_min,
 ntrees_max,
 SHRINKAGE,
 dims.fin,
 RUNTIME,
 ROC
 )

 df.regress<-rbind(df.regress,df)
 }, error=function(e){cat("ERROR :",conditionMessage(e), "\n")}) }

write.csv(df.regress,"Run_Time_Regression_100.csv",append=T)

Lessons learned/other improvements for future experiments:

  • Use CART to determine variable importance first, run bagging/boosting models for predictive accuracy
  • Automatically detect and assign variable names rather than having to hardcode (I actually wrote a macro in Excel to quickly dump column names into a properly formatted text-string, so it wasn’t pure hardcoding but it would be better to have R take care of the whole thing)
  • Add a “mail to” function to alert me when calculations are finished via email
  • Came across several bench marking functions while researching but didn’t have time to dig in all the way. Look into existing benchmarking capabilities, especially in dplyr, etc.

dplyr

Q12014 was a little time-strapped for me. Some really interesting work thought and, best of all, identified new gaps in my skill set.

One major gap was basic ease and speed of manipulating larger datasets. Thankfully, the brand new “dplyr” package fills in many of those gaps, and I’ve had some time to become acquainted with it.

Side note, the “Pantheria” dataset listed below is a fantastic practice dataframe (55 columns and ~5K rows).


#dplyr requires R Version &gt;3.0.2 check:
#paste0(sessionInfo()$R.version$major,".",sessionInfo()$R.version$minor)

#Testing dplyr on pantheria data
require(dplyr)||{install.packages("dplyr", dependencies = c("Depends","Suggests"))}

URL <- "http://esapubs.org/archive/ecol/e090/184/PanTHERIA_1-0_WR05_Aug2008.txt"

system.time(
pantheria <- read.table(file=URL,header=TRUE,sep="\t",na.strings=c("-999","-999.00"))
)

pantheria$X1.1_ActivityCycle <- factor(pantheria$X1.1_ActivityCycle)
levels(pantheria$X1.1_ActivityCycle) <- c("nocturnal","cathermeral","diurnal")

pantheria$X6.2_TrophicLevel <- factor(pantheria$X6.2_TrophicLevel)
levels(pantheria$X6.2_TrophicLevel) <- c("herbivore","omnivore","carnivore")

#basic manipulations. type ?manip for documentation

mutate(pantheria,yearlyOffspring = X16.1_LittersPerYear)
filter(MSW05_Order == "Rodentia")
group_by(pantheria,X1.1_ActivityCycle,X6.2_TrophicLevel)
summarise(pantheria,meanBM = mean(X5.1_AdultBodyMass_g,na.rm=TRUE))

#chaining manipulations together using %.% (!!!)
Activity_Trophic <-
pantheria %.%
mutate(yearlyOffspring = X16.1_LittersPerYear * X16.1_LittersPerYear) %.%
filter(MSW05_Order == "Rodentia") %.%
group_by(X1.1_ActivityCycle,X6.2_TrophicLevel) %.%
summarise(meanBM = mean(X5.1_AdultBodyMass_g,na.rm=TRUE),
meanYO = mean(yearlyOffspring,na.rm=TRUE))

Other functions in dplyr I’m anxious to explore:

  • Connecting to SQLIte, PostgreSQL and MySQL servers
  • Various joins/order functions
  • Benchmarking

documentation: http://cran.r-project.org/web/packages/dplyr/dplyr.pdf

 

Thanks to W. Andrew Barr for the example: http://www.ancienteco.com/2014/01/introduction-to-dplyr-data-manipulation.html

Final note: finally found a reliable way to post code blocks in WordPress. Took me long enough. 3 month goal: post >=1 post/week. Plenty of material, just need time.

Hacker Alphabet Soup

Recursion (noun): See: recursion

This might be part of a larger post I’ll write later on the topic of, what I’ll label for now as, hacker culture.

Anyone who has toured through corporate culture has likely heard some form of the incantation it’s a bit of an alphabet soup, referring to the fact that business folk, engineers, basically any community comes up with acronyms as they work. Being able to speak the language indicates that you are one of the tribe.

The culture of the programmer can be daunting with their shorthand. Befitting of the hacker culture, they throw the traditional rules of acronym-creation to the wind. Another form of acronym-rebellion particular to the programming realm is the recursive acronym, reflective of recursive programming, which allows a function to reference/call itself within its own text, kind of like using a word to define itself

Fish (noun): A collection of fish.

The most recent/funny recursive acronym I’ve come across is YAML, which stands for “YAML Ain’t Markup Language”.