Before choosing to support the purchase of Statistica at my workplace, I came across the ff package as an option for working with really big datasets (with special attention paid to ff dataframes, or ffdf). It looked like a good option to use, allowing dataframes with multiple data types and way more rows than if I were loading such a dataset into RAM as is normal with R. The one big problem I had is that every time I tried to use the ffsave function to save my work from one R session to the next, it told me that it could not find an external zip utility on my Windows machine. I guess because I just had so much else going on, I didn’t have the patience to do the research to find a solution to this problem.
This weekend I finally found some time to revisit this problem, and managed to find a solution! From what I can tell, R appears to expect, in cases like the ffsave function, that you have command-line utilities like a zip utility at the ready and recognizable by R. Although I haven’t tested the ff package on either of my linux laptops at home, I suspect that R recognizes the utilities that come pre-installed on them. However, in the windows case, the solution seems to be to install a supplementary group of command-line programs called Rtools. When you visit the page, be sure to download the version of Rtools that corresponds with your R version.
When you go through the installation process, you will see a screen like below. Be sure that you check the same boxes as in the screenshot below so that R knows where the zip utility lives.
Once you have it installed, that’s when the fun finally begins. Like in the smaller data case, I like reading in CSV files. So, ff provides read.csv.ffdf for importing external data into R. Let’s say that you have a data file named bigdata.csv, here would be a command for loading it up:
bigdata = read.csv.ffdf(file=”c:/fileloc/bigdata.csv”, first.rows=5000, colClasses=NA)
The first part of the command, directing R to your file, should look straightforward. The first.rows argument tells it how big you want the first chunk of data it reads in should be (ff reads parts of your data at a time to save RAM. Correct me if I’m wrong). Finally, and importantly, the colClasses=NA argument tells R not to assume the data types of each of your columns from the first chunk alone.
Now that you’ve loaded your big dataset, you can manipulate it at will. If you look at the ff and ffbase documentation, a lot of the standard R functions for working with and summarizing data have been optimized for use with ff dataframes and vectors. The upshot of this is that working with data stored in ffdf format seems to be a pretty similar experience compared to working with normal data frames. Importantly, when you want to subset your data frame to create a test sample, the ffbase package replaces the subset command so that the resultant subset is also an ffdf, and doesn’t take up more of your RAM.
I noticed that you can use the glm() and lm() functions on an ffdf, but I think you have to be careful because they are not optimized for use with ffdfs and therefore will take up the usual amount of memory if you save them to your workspace. So if you build models using these functions, be sure to select a sample from your ffdf that isn’t overly big!
Next, comes the step of saving your work. The syntax is simple enough:
This saves a .ffData file and a .RData file to the directory of your choice with “bigdata” as the filenames.
Then, when you want to load up your data in a new R session during some later time, you use the simple ffload command:
It gives you some warning messages, but as far as I can tell they do not get in the way of accessing your data. That covers the basics of working with big data using the ff package. Have fun analyzing your data using less RAM! :)