Finding Patterns Amongst Binary Variables with the homals Package

It’s survey analysis season for me at work!  When analyzing survey data, the one kind of analysis I have realized that I’m not used to doing is finding patterns in binary data.  In other words, if I have a question to which multiple, non-mutually exclusive (checkbox) answers apply, how do I find the patterns in peoples’ responses to this question?

I tried apply PCA and Factor Analysis alternately, but they really don’t seem well suited to the analysis of data consisting of only binary columns (1s and 0s). In searching for something that works, I came across the homals package.  While the main function is described as a “homogeneity analysis”, its one ability that interests me is called “non-linear PCA”.  This is supposed to be able to reduce the dimensionality of your dataset even when the variables are all binary.

Well, here’s an example using some real survey data (with masked variable names).  First we start off with the purpose of the data and some simple summary stats:

It’s a group of 6 variables (answer choices) showing peoples check-box responses to a question asking them why they donated to a particular charity.  Following are the numbers of responses to each answer choice:

mapply(whydonate, FUN=sum, 1)
 V1  V2  V3  V4  V5  V6 
201  79 183 117 288 199

With the possible exception of answer choice V2, there are some pretty healthy numbers in each answer choice.  Next, let’s load up the homals package and run our non-linear PCA on the data.

library(homals)
fit = homals(whydonate)

fit
Call: homals(data = whydonate)

Loss: 0.0003248596 

Eigenvalues:
    D1     D2 
0.0267 0.0156 

Variable Loadings:
           D1          D2
V1 0.28440348 -0.10010355
V2 0.07512143 -0.10188037
V3 0.09897585  0.32713745
V4 0.20464762  0.21866432
V5 0.26782837 -0.09600215
V6 0.33198532 -0.04843107

As you can see, it extracts 2 dimensions by default (it can be changed using the “ndim” argument in the function), and it gives you what looks very much like a regular PCA loadings table.

Reading it naively, the pattern I see in the first dimension goes something like this: People tended to answer affirmatively to answer choices 1,4,5, and 6 as a group (obviously not all the time and altogether though!), but those answers didn’t tend to be used alongside choices 2 and 3.

In the second  dimension I see: People tended to answer affirmatively to answer choices 3 and 4 as a group.  Okay, now as a simple check, let’s look at the correlation matrix for these binary variables:

cor(whydonate)

           V1            V2            V3         V4          V5         V6
V1 1.00000000  0.0943477325  0.0205241732 0.16409945 0.254854574 0.45612458
V2 0.09434773  1.0000000000 -0.0008474402 0.01941461 0.038161091 0.08661938
V3 0.02052417 -0.0008474402  1.0000000000 0.21479291 0.007465142 0.11416164
V4 0.16409945  0.0194146144  0.2147929137 1.00000000 0.158325383 0.22777471
V5 0.25485457  0.0381610906  0.0074651417 0.15832538 1.000000000 0.41749064
V6 0.45612458  0.0866193754  0.1141616374 0.22777471 0.417490642 1.00000000

The first dimension is easy to spot in the “V1″ column above. Also, we can see the second dimension in the “V3″ column above – both check out! I find that neat and easy. Does anyone use anything else to find patterns in binary data like this? Feel free to tell me in the comments!

About these ads

3 thoughts on “Finding Patterns Amongst Binary Variables with the homals Package

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s