As I mentioned in a previous post, I compiled the data set that I’m currently working on in PostgreSQL. To get this massive data set, I had to write a query that was massive by dint of the number of LEFT JOINs that I had to write. Today I caught myself wondering If I had remembered to add in DISTINCT to the SELECT clause in my query, as that would prevent any obvious duplicates. I looked and, woops, I didn’t specify it in the SELECT clause. It turns out that I didn’t have too many duplicates (only 135 out of about 395000 rows), but I was still bothered.
I took a look at the duplicates and found out that there were a small bunch of duplicates in the main name and contact table where the ID number of the records was the same, but one record had actual address and contact info, while another one would be pretty bare (in other words, missing information in various contact fields). Although there was only one duplicate for each affected record, the many LEFT JOINs that I did multiplied the number of duplicates to get to about 135.
I really didn’t want to export the data again from the database and redo all the calculations I had made, so I had to find some way of cleaning up all these duplicates from within R. Thankfully, I had turned the many contact fields in the data into dummy variables in view of the data mining that I had to do later on (I love dummy variables!). This then made it possible to distinguish duplicates based on whether they had more or less fields filled out, enabling me to remove them, as shown in the script below:
Well, it worked! Now I have no more problem with duplication. Now, hopefully I'll remember to use DISTINCT in my SQL queries in the future :)