I only recently learned how to use the sqldf package to perform SELECT queries on the data frames I have loaded into the R workspace. There’s a lot I already know how to do when selecting, or subsetting data in R, but I don’t know how to do the equivalent of joining two datasets by some ID variable.
Luckily, that kind of operation is easy when you put it into SQL syntax. It basically looks something like this:
newdata = sqldf(‘SELECT dataframe1.*, dataframe2.somevariable FROM dataframe1 LEFT JOIN dataframe2 on dataframe1.ID = dataframe2.ID’)
Things get a bit more complicated when you have dots in your dataframe and variable names. Below is what the call to sqldf has to look like if you are in a situation like that:
newdata = sqldf(‘SELECT “data.frame1”.*, “data.frame2”.some_variable FROM “data.frame1” LEFT JOIN “data.frame2” on “data.frame1”.ID = “data.frame2”.ID’)
You’ll notice that I had to put double quotes around the dataframe references, so that the SQL code didn’t get confused by the presence of dots. The underscore in the variable reference “some_variable” actually translates into “some.variable” when it looks for the referent in the R data frame. It’s a little messy, but totally worth it when I consider I don’t even know how to do this kind of operation in R otherwise!