I do data mining and modeling really often these days. However, the datasets that I work with really wouldn’t be considered “Big Data” (around 25,000 to about 200,000 rows, with quite a lot of variables). I don’t know if I’ll ever be in a position to work with “Big Data”, but all the hype around it gets me thinking from time to time.
Question: If I’ve got millions upon millions of records to work with, do I really need to submit all of them to my data analysis software (R) for data mining and modeling?
Answer: Not in the least bit. If all I’m doing is looking for trends and building models that predict some desired behaviour, all I would have to do is get a handful of random samples that are small enough to fit into my data analysis software. Then I could do my data mining on any one of the samples, build a model or models, and then test it/them on the other samples. Apparently, random sampling is possible in Hadoop. What this means is that if I get these random samples from the DBMS, I can just use the same kinds of techniques I’ve been using all along.
Am I missing something, or is “Big Data Analytics” more of a marketing term than an actual reality?