At work I’ve started to work with the biggest data set I’ve ever seen! First, let me qualify my use of the term “Big Data”. The number of rows in the resultant data set (after much transformation and manipulation in PostgreSQL and to a lesser extent in Excel) is only just 395,928. This will certainly pale in comparison to truly BIG data sets, but for me it’s huge! For now, the data I’m working with has 307 variables (but a big chunk of null values). Just calculate: if there were no null values in this data set, the number of data points would be 395,928 * 307 = 121,549,896!! Even if 60% of this table consisted of null values, that still leaves 48,619,958 data points. Wow!

Working with so much data (it came from a database dump), I’ve noticed a few things:

1) PostgreSQL is awesome at handling it. I really haven’t had to wait too long for each of my queries (create table queries, mainly) to execute. The longest I had to wait was in fact the final query that I ran today that joined all of the transformed tables that I created to a biographical information table. I don’t think it took much longer than a minute, at most.

In fact, what I found really helpful in PostgreSQL was recoding some of the qualitative variables using CASE … WHEN …. THEN clauses. Those are super handy and are carried out with seemingly little difficulty.

2) Excel 2010 had LOTS of difficulty letting me manipulate my data set once I loaded it up. Granted, the file was a 176 megabyte csv file… there’s no shame at whining when you have the weight of the world on your shoulders!

One problem I had in excel was that, at one point, it stopped immediately redrawing the excel window every time I changed to a different ribbon. For example, I would click on the “File” ribbon, after being on the “Home” ribbon, and I would still see my data set, and not all the “File” ribbon options. Sizing the excel window down and dragging it out of view to the bottom of the screen and back seemed to help at times, and waving my mouse cursor where all the relevant buttons should be helped at other times.

Another problem was when I went to recode some variables in Excel into dummy variables, it took SO LONG to finish the job! Granted, I used the standard “Find / Replace” window, which might not be the best option in Excel. I’ve learned my lesson though. I have to either recode in PostgreSQL, or do what I describe below, in R.

3) Although I haven’t done too much with the data set in R yet, it has handled it with pizzazz! R loaded the csv file in just under 2 minutes (I was honestly expecting longer). Once I had it in, I realized that for the variables in my data set where there were no zeros indicating the absence of something, I would need to include them somehow. I didn’t actually want to modify the data set to include the zeros, as that would add to how much RAM R was using up (it was already using up 1.5 gigs of ram!). So, I made a simple function that would put zeros into a vector just for the purpose of graphing or statistical testing.

In other words, if I’m doing a boxplot and I need my X variable to contain zeros where there are only NA values, I run the boxplot command, passing the X variable to the function so that the zeros get added temporarily and not permanently (e.g. boxplot(Y ~ zerofy(X)). This should hopefully prevent me from using too much RAM.

I haven’t yet run any typical statistical tests on the data set, so I can’t attest to how quickly they work on it, but that will come very very soon. So far, I’m quite impressed with the open source tools I use and their performance on this data!

Pingback: Ack! Duplicates in the Data! | Data and Analysis with R, at Work