Category Archives: Big O

Installing RecordLinkage from the archive

Working to achieve database linkage using approximate or “fuzzy” matching, I needed to link customer names in one database to possible matches in another. Fuzzy matching one name to a small list of other possible names is well-documented and actually quite simple in R with agrep() and adist(). The challenge compounds on itself, though, as the list of potential matches grew. I need to match 12K names against a potential list of ~115,000 names–over a billion possibilities. Computation was an issue, especially under tight time constraints.

The package RecordLinkage, by Murat Sariyar and Andreas Borg, attempts to solve this problem in R by implementing the matching using the ff data classes (among many other useful utilities). For some reason I don’t know, RecordLinkage as a project was abandoned and archived. The package still works (and the work is fascinating: http://journal.r-project.org/archive/2010-2/RJournal_2010-2_Sariyar+Borg.pdf).

To install RecordLinkage from CRAN archive, follow instructions here:
http://stackoverflow.com/questions/24194409/how-do-i-install-a-package-that-has-been-archived-from-cran

On windows, it requires first installing RTools, then running this code:

url <- "http://cran.r-project.org/src/contrib/Archive/RecordLinkage/RecordLinkage_0.4-1.tar.gz"
pkgFile <- "RecordLinkage_0.4-1.tar.gz"
download.file(url = url, destfile = pkgFile)

# Install dependencies

install.packages(c("ada", "ipred", "evd"))

# Install package
install.packages(pkgs=pkgFile, type="source", repos=NULL)

# Delete package tarball
unlink(pkgFile)

A Preference for Elegance to Brute Force, AKA O(log(n)) > O(n)

Question: Given m sorted arrays each of size n, what is the fastest algorithm to find the kth smallest element in the union of the m arrays?

A friend sent me this optimization question on Quora. He said he’d solved it in Python, on a plane ride, in O(log(k)) time (Note: the algorithm uses O(log(k)) time, [presumably] not my friend). I thought sure, challenge accepted.

I didn’t have the luxury of a full plane ride to figure it out, but I came up with a solution in R, using linear search. I know, I know, linear search is O(n) time. No excuse, I lost the challenge. But again, no plane ride! C’mon!

I wonder, though. Considering that sortation is a precondition for binary search, and sorting itself can take O(n) time, O(n+k) time, O(nk) time or more means that binary searching over a sorted list could take O(n) + O(log(n)) time in total, as opposed to linear/brute force search which tends to O(n) time. Not sure… should come back to Big O notation later.

As promised, please find the below solution on my recently not-so-anemic github page.

#searchArrays takes a list of sorted vectors as an argument
#and finds the kth smallest common element
#note that the input sorted vectors must be sorted descending

searchArrays <- function(array.list,k=1){
 element.vector<-vector()
   for(e in 1:length(array.list)){
   array.list.search<-array.list[!1:length(array.list) %in% e]
     for(p in 1:length(array.list[[e]])){
     x.search<-array.list[[e]][p]
     element.vector<-x.search
       for(l in 1:length(array.list.search)){
         for(n in 1:length(array.list.search[l])){
         element.vector<-append(element.vector,array.list.search[[l]][n])
           if(length(element.vector)==length(array.list) &
             range(element.vector)[1]==range(element.vector)[2]){
             return(element.vector)
             }
           }
         }
       }
     }
       if(length(element.vector)==length(array.list) &
       range(element.vector)[1]==range(element.vector)[2]){
       print(element.vector)
       } else { print("Error: no common element across arrays") }
}

The input to the function searchArrays() is a list of sorted vectors. The goal is to find the smallest common element in all of the arrays.

Using loops, I define the “search” array e (the first array), then loop through each element in e, which I label x.search (maybe I should have called this e.i). I then separate e from the search arrays, labeling those array.list.search.

At the heart of the nested loops is the “element.vector”, which builds one element at a time, checking for uniformity with each step. To illustrate, if you print it in real time it looks something like:

a
a b
a
a a
a a a
a a a c
a
a a
a a a
a a a a

To test the above:


test.gen<<-function(n){sort(sample(letters,n,replace=T))};
test<-list()
for(i in 1:5){a<-list(test.gen(100));test<-append(test,a)}
searchArrays(test)

 

Final thought: solving the above required checking whether element.vector was uniform at each step. It was an unexpected challenge which I enjoyed solving. The ahah moment came when I realized that if the vector is uniform , the min and max should be the same. Thus you can compare the range:

range(element.vector)[1]==range(element.vector)[2])

This blog post brought to you by Charles Mingus radio, via Spotify.