09 May 2006
As promised, here's another entry in my ongoing quest to document all of the stupid stuff that seems like it ought to be really easy but turns out to be kind of tricky. Today, we're going to learn how to load SPSS data into R, and then how to do cross-tabulations on that data. In SPSS, this is insanely easy. It turns out not to be all that bad in R, but the whole thing depends on a particularly poorly-documented aspect of the program. We'll start with the easy part: the data import.

Before we get any further, please note that Blogger has decided to screw up my formatting. There's way more whitespace than I'd wanted, and I can't seem to get rid of it. Grrr. Anyway, let's move forward.

Consider an SPSS file called "cvd.sav". It contains stratified data representing aggregated cardiovascular disease figures for some population. The data set contains five variables:


  1. A dichotomous gender variable ("male"/"female")

  2. A categorical "age group" variable with five levels ("25-34","35-44", etc. up to "65-74")

  3. A dichotomous variable representing blood factor IX levels ("low"/"high")

  4. A dichotomous outcome variable ("CVD"/"No CVD")

  5. A "count" variable

Note that the data are aggregated. If we want to do anything with them in SPSS, we'll have to use the "Weight Cases" function, and weight by "count". We can read this file into R quite easily, using the "read.spss" function:


> library(foreign)> cvd <- read.spss("~/cvd.sav", to.data.frame=TRUE)

Note that if we omit the "to.data.frame" argument, our variable will contain an object, which will make the next step a bit trickier. So, now we have a variable called "cvd". Let's see what it looks like:


> cvd   GENDER AGEGRP IXSTATUS OUTCOME COUNT1  Male    25-34     Low   CVD        62  Male    25-34     Low   No CVD    753  Male    25-34     High  CVD        44  Male    25-34     High  No CVD    205  Male    35-44     Low   CVD       12......................34 Female  55-64     Low   No CVD    3735 Female  55-64     High  CVD       3936 Female  55-64     High  No CVD    5937 Female  65-74     Low   CVD       2738 Female  65-74     Low   No CVD    2439 Female  65-74     High  CVD       5740 Female  65-74     High  No CVD    62>

Now that we've got the data, let's do something useful with it. We'll start with a quick crosstabulation of, say, gender vs cvd. What we'd ideally like to end up with is a 2x2 table looking like this:


  CVD No CVD Total
Male a b n1
Female c d n2
  m1 m2 sum

Like I said earlier, to get this in SPSS, we'd simply use the "Crosstabs" function. In R, we need to use the "xtabs" function. This function is somewhat tricky in that it takes a "formula" as one of its arguments. It took me quite a while to figure out what to do with this, since the documentation isn't really as straightforward as it should be. Eventually, however, I figured it out. To get our table, we enter the following command:


> xtabs(count ~ gender + outcome, cvd)        outcomegender   CVD No CVD  Female 260 524     Male   233 477   >

See what happened there? Our "formula" should look familiar if you've ever entered a linear model in SPSS using the raw syntax format, and I think SAS's format is pretty similar. Essentially, you just put the "count" column of your data frame on the left hand side of the "~", and then put whichever columns you want to crosstabulate by on the right-hand side, separated by "+" signs. The order that they appear definitely affects how the table comes out, so it's worth experimenting a bit to see what you get.

What we're really doing here is creating a "Formula" object, and once you know what to look for in the docs, they're actually pretty helpful (try "help(formula)"). The formula object is also found in the linear and regression modeling equations, so they're handy to learn how to use.

So, that gets us a simple 2x2 table. What if we want to stratify further, and compare the counts by age group? We can simply tack on one more variable to the end of our formula, like so:


> xtabs(count ~ gender + outcome + agegrp, cvd), , agegrp = 25-34        outcomegender   CVD No CVD  Female  29 101     Male    10  95   , , agegrp = 35-44        outcomegender   CVD No CVD  Female  32 121     Male    27 112   , , agegrp = 45-54        outcomegender   CVD No CVD  Female  57 120     Male    53  92   , , agegrp = 55-64        outcomegender   CVD No CVD  Female  58  96     Male    71  93   , , agegrp = 65-74        outcomegender   CVD No CVD  Female  84  86     Male    72  85   >

Note that if we change the order of our formula arguments, we can get something rather different:


> xtabs(count ~ agegrp + gender + outcome, cvd), , outcome = CVD       genderagegrp  Female Male  25-34  29     10   35-44  32     27   45-54  57     53   55-64  58     71   65-74  84     72 , , outcome = No CVD       genderagegrp  Female Male  25-34 101     95   35-44 121    112   45-54 120     92   55-64  96     93   65-74  86     85 >

Possibly useful for other reasons (some sort of Cochran test of trend, perhaps), but not what we were after here. I heartily suggest experimentation: if the xtabs function isn't giving you what you want, it might be a simple matter of re-arranging the terms in your function.

Of course, as with everything else in R, we can save the output of xtabs() to a variable, which we can then use for hypothesis testing or other analysis. The return value from xtabs() is essentially a special type of matrix, so we can do all of the usual fun stuff one can usually do with matrices. Also, many functions (prop.test, chisq.test, etc.) will take a matrix as direct input.

There's lots more you can do with xtabs in R, but this should be enough to get you started. I had to screw around with it for a couple of afternoons before I finally figured out what was going on; hopefully, someday this'll help keep somebody else— or myself, once I forget how to do all of this— from banging their heads against the wall for as long.