Working to achieve database linkage using approximate or “fuzzy” matching, I needed to link customer names in one database to possible matches in another. Fuzzy matching one name to a small list of other possible names is well-documented and actually quite simple in R with agrep() and adist(). The challenge compounds on itself, though, as the list of potential matches grew. I need to match 12K names against a potential list of ~115,000 names–over a billion possibilities. Computation was an issue, especially under tight time constraints.
The package RecordLinkage, by Murat Sariyar and Andreas Borg, attempts to solve this problem in R by implementing the matching using the ff data classes (among many other useful utilities). For some reason I don’t know, RecordLinkage as a project was abandoned and archived. The package still works (and the work is fascinating: http://journal.r-project.org/archive/2010-2/RJournal_2010-2_Sariyar+Borg.pdf).
To install RecordLinkage from CRAN archive, follow instructions here:
http://stackoverflow.com/questions/24194409/how-do-i-install-a-package-that-has-been-archived-from-cran
On windows, it requires first installing RTools, then running this code:
url <- "http://cran.r-project.org/src/contrib/Archive/RecordLinkage/RecordLinkage_0.4-1.tar.gz" pkgFile <- "RecordLinkage_0.4-1.tar.gz" download.file(url = url, destfile = pkgFile) # Install dependencies install.packages(c("ada", "ipred", "evd")) # Install package install.packages(pkgs=pkgFile, type="source", repos=NULL) # Delete package tarball unlink(pkgFile)