R Clinic: xtabs()
09 May 2006
As promised, here's another entry in my ongoing quest to document all of the stupid stuff that seems like it ought to be really easy but turns out to be kind of tricky. Today, we're going to learn how to load SPSS data into R, and then how to do cross-tabulations on that data. In SPSS, this is insanely easy. It turns out not to be all that bad in R, but the whole thing depends on a particularly poorly-documented aspect of the program. We'll start with the easy part: the data import.Before we get any further, please note that Blogger has decided to screw up my formatting. There's way more whitespace than I'd wanted, and I can't seem to get rid of it. Grrr. Anyway, let's move forward.Consider an SPSS file called "cvd.sav". It contains stratified data representing aggregated cardiovascular disease figures for some population. The data set contains five variables:
Like I said earlier, to get this in SPSS, we'd simply use the "Crosstabs" function. In R, we need to use the "xtabs" function. This function is somewhat tricky in that it takes a "formula" as one of its arguments. It took me quite a while to figure out what to do with this, since the documentation isn't really as straightforward as it should be. Eventually, however, I figured it out. To get our table, we enter the following command:
- A dichotomous gender variable ("male"/"female")
- A categorical "age group" variable with five levels ("25-34","35-44", etc. up to "65-74")
- A dichotomous variable representing blood factor IX levels ("low"/"high")
- A dichotomous outcome variable ("CVD"/"No CVD")
- A "count" variable
> library(foreign)> cvd <- read.spss("~/cvd.sav", to.data.frame=TRUE)
Note that if we omit the "to.data.frame" argument, our variable will contain an object, which will make the next step a bit trickier. So, now we have a variable called "cvd". Let's see what it looks like:
> cvd GENDER AGEGRP IXSTATUS OUTCOME COUNT1 Male 25-34 Low CVD 62 Male 25-34 Low No CVD 753 Male 25-34 High CVD 44 Male 25-34 High No CVD 205 Male 35-44 Low CVD 12......................34 Female 55-64 Low No CVD 3735 Female 55-64 High CVD 3936 Female 55-64 High No CVD 5937 Female 65-74 Low CVD 2738 Female 65-74 Low No CVD 2439 Female 65-74 High CVD 5740 Female 65-74 High No CVD 62>
Now that we've got the data, let's do something useful with it. We'll start with a quick crosstabulation of, say, gender vs cvd. What we'd ideally like to end up with is a 2x2 table looking like this:CVD | No CVD | Total | |
Male | a | b | n1 |
Female | c | d | n2 |
m1 | m2 | sum |
> xtabs(count ~ gender + outcome, cvd) outcomegender CVD No CVD Female 260 524 Male 233 477 >
See what happened there? Our "formula" should look familiar if you've ever entered a linear model in SPSS using the raw syntax format, and I think SAS's format is pretty similar. Essentially, you just put the "count" column of your data frame on the left hand side of the "~", and then put whichever columns you want to crosstabulate by on the right-hand side, separated by "+" signs. The order that they appear definitely affects how the table comes out, so it's worth experimenting a bit to see what you get.What we're really doing here is creating a "Formula" object, and once you know what to look for in the docs, they're actually pretty helpful (try "help(formula)"). The formula object is also found in the linear and regression modeling equations, so they're handy to learn how to use.So, that gets us a simple 2x2 table. What if we want to stratify further, and compare the counts by age group? We can simply tack on one more variable to the end of our formula, like so:
> xtabs(count ~ gender + outcome + agegrp, cvd), , agegrp = 25-34 outcomegender CVD No CVD Female 29 101 Male 10 95 , , agegrp = 35-44 outcomegender CVD No CVD Female 32 121 Male 27 112 , , agegrp = 45-54 outcomegender CVD No CVD Female 57 120 Male 53 92 , , agegrp = 55-64 outcomegender CVD No CVD Female 58 96 Male 71 93 , , agegrp = 65-74 outcomegender CVD No CVD Female 84 86 Male 72 85 >
Note that if we change the order of our formula arguments, we can get something rather different:
> xtabs(count ~ agegrp + gender + outcome, cvd), , outcome = CVD genderagegrp Female Male 25-34 29 10 35-44 32 27 45-54 57 53 55-64 58 71 65-74 84 72 , , outcome = No CVD genderagegrp Female Male 25-34 101 95 35-44 121 112 45-54 120 92 55-64 96 93 65-74 86 85 >
Possibly useful for other reasons (some sort of Cochran test of trend, perhaps), but not what we were after here. I heartily suggest experimentation: if the xtabs function isn't giving you what you want, it might be a simple matter of re-arranging the terms in your function.Of course, as with everything else in R, we can save the output of xtabs() to a variable, which we can then use for hypothesis testing or other analysis. The return value from xtabs() is essentially a special type of matrix, so we can do all of the usual fun stuff one can usually do with matrices. Also, many functions (prop.test, chisq.test, etc.) will take a matrix as direct input.There's lots more you can do with xtabs in R, but this should be enough to get you started. I had to screw around with it for a couple of afternoons before I finally figured out what was going on; hopefully, someday this'll help keep somebody else— or myself, once I forget how to do all of this— from banging their heads against the wall for as long.