survey package is one of R’s best tools for those
working in the social sciences. For many, it saves you from needing to
use commercial software for research that uses survey data. However, it
lacks one function that many academic researchers often need to report
in publications: correlations. The
svycor function in
jtools helps to fill that gap.
A note, however, is necessary. The initial motivation to add this
feature comes from a response
to a question about calculating correlations with the
survey package written by Thomas Lumley, the
survey package author. All that is good about this function
should be attributed to Dr. Lumley; all that is wrong with it should be
attributed to me (Jacob).
With that said, let’s look at an example. First, we need to get a
survey.design object. This one is built into the
library(survey) data(api) <- svydesign(id = ~1,strata = ~stype, weights = ~pw, data = apistrat, fpc=~fpc)dstrat
The necessary arguments are no different than when using
svyvar. Specify, using an equation, which variables (and
from which design) to include. It doesn’t matter which side of the
equation the variables are on.
svycor(~api00 + api99, design = dstrat)
You can specify with the
digits = argument how many
digits past the decimal point should be printed.
svycor(~api00 + api99, design = dstrat, digits = 4)
Any other arguments that you would normally pass to
svyvar will be used as well, though in some cases it may
not affect the output.
One thing that
survey won’t do for you is give you
p values for the null hypothesis that \(r = 0\). While at first blush finding the
p value might seem like a simple procedure, complex surveys
will almost always violate the important distributional assumptions that
go along with simple hypothesis tests of the correlation coefficient.
There is not a clear consensus on the appropriate way to conduct
hypothesis tests in this context, due in part to the fact that most
analyses of complex surveys occurs in the context of multiple regression
rather than simple bivariate cases.
sig.stats = TRUE, then
svycor will use
wtd.cor function from the
to conduct hypothesis tests. The p values are derived from a
bootstrap procedure in which the weights define sampling probability.
bootn = argument is given to
define the number of simulations to run. This can significantly increase
the running time for large samples and/or large numbers of simulations.
mean1 argument tells
wtd.cor whether it
should treat your sample size as the number of observations in the
survey design (the number of rows in the data frame) or the sum of the
weights. Usually, the former is desired, so the default value of
svycor(~api00 + api99, design = dstrat, digits = 4, sig.stats = TRUE, bootn = 2000, mean1 = TRUE)
sig.stats = TRUE, the correlation parameter
estimates come from the bootstrap procedure rather than the simpler
method based on the survey-weighted covariance matrix when
sig.stats = FALSE.
By saving the output of the function, you can extract non-rounded coefficients, p values, and standard errors.
<- svycor(~api00 + api99, design = dstrat, digits = 4, sig.stats = TRUE, bootn = 2000, mean1 = TRUE) c $cors c $p.values c $std.errc
The heavy lifting behind the scenes is done by
which from its output you may not realize also calculates
svyvar(~api00 + api99, design = dstrat)
But if you save the
svyvar object, you can see that
there’s more than meets the eye.
<- svyvar(~api00 + api99, design = dstrat) var <- as.matrix(var) var var
Once we know that, it’s just a matter of using R’s
cov2cor function and cleaning up the output.
<- cov2cor(var) cor cor
Now to get rid of that covariance matrix…
<- cor[1:nrow(cor), 1:nrow(cor)] cor cor
svycor has its own print method, so you won’t see so
many digits past the decimal point. You can extract the un-rounded
<- svycor(~api99 + api00, design = dstrat) out $corsout