From Wikipedia, the free encyclopedia
Clustered standard errors (or Liang-Zeger standard errors)[1] are measurements that estimate the standard error of a regression parameter in settings where observations may be subdivided into smaller-sized groups («clusters») and where the sampling and/or treatment assignment is correlated within each group.[2][3] Clustered standard errors are widely used in a variety of applied econometric settings, including difference-in-differences[4] or experiments.[5]
Analogous to how Huber-White standard errors are consistent in the presence of heteroscedasticity and Newey–West standard errors are consistent in the presence of accurately-modeled autocorrelation, clustered standard errors are consistent in the presence of cluster-based sampling or treatment assignment. Clustered standard errors are often justified by possible correlation in modeling residuals within each cluster; while recent work suggests that this is not the precise justification behind clustering,[6] it may be pedagogically useful.
Intuitive motivation[edit]
Clustered standard errors are often useful when treatment is assigned at the level of a cluster instead of at the individual level. For example, suppose that an educational researcher wants to discover whether a new teaching technique improves student test scores. She therefore assigns teachers in «treated» classrooms to try this new technique, while leaving «control» classrooms unaffected. When analyzing her results, she may want to keep the data at the student level (for example, to control for student-level observable characteristics). However, when estimating the standard error or confidence interval of her statistical model, she realizes that classical or even heteroscedasticity-robust standard errors are inappropriate because student test scores within each class are not independently distributed. Instead, students in classes with better teachers have especially high test scores (regardless of whether they receive the experimental treatment) while students in classes with worse teachers have especially low test scores. The researcher can cluster her standard errors at the level of a classroom to account for this aspect of her experiment.[7]
While this example is very specific, similar issues arise in a wide variety of settings. For example, in many panel data settings (such as difference-in-differences) clustering often offers a simple and effective way to account for non-independence between periods within each unit (sometimes referred to as «autocorrelation in residuals»).[4] Another common and logically distinct justification for clustering arises when a full population cannot be randomly sampled, and so instead clusters are sampled and then units are randomized within cluster. In this case, clustered standard errors account for the uncertainty driven by the fact that the researcher does not observe large parts of the population of interest.[8]
Mathematical motivation[edit]
A useful mathematical illustration comes from the case of one-way clustering in an ordinary least squares (OLS) model. Consider a simple model with N observations that are subdivided in C clusters. Let be an
vector of outcomes,
a
matrix of covariates,
an
vector of unknown parameters, and
an
vector of unexplained residuals:
As is standard with OLS models, we minimize the sum of squared residuals to get an estimate
:
From there, we can derive the classic «sandwich» estimator:
Denoting yields a potentially more familiar form
While one can develop a plug-in estimator by defining and letting
, this completely flexible estimator will not converge to
as
. Given the assumptions that a practitioner deems as reasonable, different types of standard errors solve this problem in different ways. For example, classic homoskedastic standard errors assume that
is diagonal with identical elements
, which simplifies the expression for
. Huber-White standard errors assume
is diagonal but that the diagonal value varies, while other types of standard errors (e.g. Newey–West, Moulton SEs, Conley spatial SEs) make other restrictions on the form of this matrix to reduce the number of parameters that the practitioner needs to estimate.
Clustered standard errors assume that is block-diagonal according to the clusters in the sample, with unrestricted values in each block but zeros elsewhere. In this case, one can define
and
as the within-block analogues of
and
and derive the following mathematical fact:
By constructing plug-in matrices , one can form an estimator for
that is consistent as the number of clusters
becomes large. While no specific number of clusters is statistically proven to be sufficient, practitioners often cite a number in the range of 30-50 and are comfortable using clustered standard errors when the number of clusters exceeds that threshold.
Further reading[edit]
- Alberto Abadie, Susan Athey, Guido W Imbens, and Jeffrey M Wooldridge. 2022. «When Should You Adjust Standard Errors for Clustering?» Quarterly Journal of Economics.
References[edit]
- ^ Liang, Kung-Yee; Zeger, Scott L. (1986-04-01). «Longitudinal data analysis using generalized linear models». Biometrika. 73 (1): 13–22. doi:10.1093/biomet/73.1.13. ISSN 0006-3444.
- ^ Cameron, A. Colin; Miller, Douglas L. (2015-03-31). «A Practitioner’s Guide to Cluster-Robust Inference». Journal of Human Resources. 50 (2): 317–372. doi:10.3368/jhr.50.2.317. ISSN 0022-166X. S2CID 1296789.
- ^ «ARE 212». Fiona Burlig. Retrieved 2020-07-05.
- ^ a b Bertrand, Marianne; Duflo, Esther; Mullainathan, Sendhil (2004-02-01). «How Much Should We Trust Differences-In-Differences Estimates?». The Quarterly Journal of Economics. 119 (1): 249–275. doi:10.1162/003355304772839588. hdl:1721.1/63690. ISSN 0033-5533. S2CID 470667.
- ^ Yixin Tang (2019-09-11). «Analyzing Switchback Experiments by Cluster Robust Standard Error to prevent false positive results». DoorDash Engineering Blog. Retrieved 2020-07-05.
- ^ Abadie, Alberto; Athey, Susan; Imbens, Guido; Wooldridge, Jeffrey (2017-10-24). «When Should You Adjust Standard Errors for Clustering?». arXiv:1710.02926 [math.ST].
- ^ «CLUSTERED STANDARD ERRORS». Economic Theory Blog. 2016. Archived from the original on 2016-11-06. Retrieved 28 September 2021.
- ^ «When should you cluster standard errors? New wisdom from the econometrics oracle». blogs.worldbank.org. Retrieved 2020-07-05.
In many scenarios, data are structured in groups or clusters, e.g. pupils within classes (within schools), survey respondents within countries or, for longitudinal surveys, survey answers per subject. Simply ignoring this structure will likely lead to spuriously low standard errors, i.e. a misleadingly precise estimate of our coefficients. This in turn leads to overly-narrow confidence intervals, overly-low p-values and possibly wrong conclusions.
Clustered standard errors are a common way to deal with this problem. Unlike Stata, R doesn’t have built-in functionality to estimate clustered standard errors. There are several packages though that add this functionality and this article will introduce three of them, explaining how they can be used and what their advantages and disadvantages are. Before that, I will outline the theory behind (clustered) standard errors for linear regression. The last section is used for a performance comparison between the three presented packages. If you’re already familiar with the concept of clustered standard errors, you may skip to the hands-on part right away.
Data
We’ll work with the dataset nlswork that’s included in Stata, so we can easily compare the results with Stata. The data comes from the US National Longitudinal Survey (NLS) and contains information about more than 4,000 young working women. As for this example, we’re interested in the relationship between wage (here as log-scaled GNP-adjusted wage) as dependent variable (DV) ln_wage
and survey participant’s current age
, job tenure
in years and union
membership as independent variables. It’s a longitudinal survey, so subjects were asked repeatedly between 1968 and 1988 and each subject is identified by an unique idcode
.
The example data is used for illustrative purposes only and we skip many things that we’d normally do, such as investigating descriptive statistics and exploratory plots. To keep the data size limited, we’ll only work with a subset of the data (only subjects with IDs 1 to 100) and we also simply dismiss any observations that contain missing values.
library(webuse) library(dplyr) nlswork_orig <- webuse('nlswork') nlswork <- filter(nlswork_orig, idcode <= 100) %>% select(idcode, year, ln_wage, age, tenure, union) %>% filter(complete.cases(.)) %>% mutate(union = as.integer(union), idcode = as.factor(idcode)) str(nlswork) tibble [386 × 6] (S3: tbl_df/tbl/data.frame) $ idcode : Factor w/ 82 levels "1","2","3","4",..: 1 1 1 1 1 1 1 2 2 2 … $ year : num [1:386] 72 77 80 83 85 87 88 71 77 78 … $ ln_wage: num [1:386] 1.59 1.78 2.55 2.42 2.61 … $ age : num [1:386] 20 25 28 31 33 35 37 19 25 26 … $ tenure : num [1:386] 0.917 1.5 1.833 0.667 1.917 … $ union : int [1:386] 1 0 1 1 1 1 1 0 1 1 …
Let’s have a look at the first few observations. They contain data from subject #1, who was surveyed several times between 1972 and 1988, and a few observations from subject #2.
idcode year ln_wage age tenure union 1 72 1.59 20 0.917 1 1 77 1.78 25 1.5 0 1 80 2.55 28 1.83 1 1 83 2.42 31 0.667 1 1 85 2.61 33 1.92 1 1 87 2.54 35 3.92 1 1 88 2.46 37 5.33 1 2 71 1.36 19 0.25 0 2 77 1.73 25 2.67 1 2 78 1.69 26 3.67 1
A summary of all but the idcode variable gives the following:
year ln_wage age tenure union Min. :70.00 Min. :0.4733 Min. :18.0 Min. : 0.000 Min. :0.0000 1st Qu.:73.00 1st Qu.:1.6131 1st Qu.:25.0 1st Qu.: 1.167 1st Qu.:0.0000 Median :80.00 Median :1.9559 Median :31.0 Median : 2.417 Median :0.0000 Mean :79.61 Mean :1.9453 Mean :30.8 Mean : 3.636 Mean :0.2591 3rd Qu.:85.00 3rd Qu.:2.2349 3rd Qu.:36.0 3rd Qu.: 4.958 3rd Qu.:1.0000 Max. :88.00 Max. :3.5791 Max. :45.0 Max. :19.000 Max. :1.0000
We have 82 subjects in our subset and the number of times each subject was surveyed ranges from only once to twelve times:
summary(as.integer(table(nlswork$idcode))) Min. 1st Qu. Median Mean 3rd Qu. Max. 1.000 2.000 4.000 4.707 7.000 12.000
In more than one quarter of the observations, the subject answered to be currently member of a trade union.
The following shows the distribution of the DV in our data. It is roughly normally distributed with mean 1.95 and standard deviation (SD) 0.45.
We can calculate the mean and SD of the DV separately for each subject. A histogram of these subject-specific means reveals more variability:
y_mean_sd_cl <- sapply(levels(nlswork$idcode), function(idcode) { y_cl <- nlswork$ln_wage[nlswork$idcode == idcode] c(mean(y_cl), sd(y_cl)) }) hist(y_mean_sd_cl[1,], breaks = 20, main = 'Histogram of DV means per subject', xlab = NA)
We can compare the SD of the subject-specific means with the mean of the SDs calculated from each subjects’ repeated measures.
c(sd(y_mean_sd_cl[1,]), mean(y_mean_sd_cl[2,], na.rm = TRUE)) [1] 0.4038449 0.2221142
The SD between the subject-specific means is almost twice as large as the mean of the SD from each subjects’ values. This shows that there’s much more variability between each subject than within each subject’s repeated measures regarding the DV.
Fixed-effects model, not adjusting for clustered observations
Our data contains repeated measures for each subject, so we have panel data in which each subject forms a group or cluster. We can use a fixed-effects (FE) model to account for unobserved subject-specific characteristics. We do so by including the subject’s idcode
in our model formula. It’s important to note that idcode
is of type factor (we applied idcode = as.factor(idcode)
when we prepared the data) so that for each factor level (i.e. each subject) an FE coefficient will be estimated that represents the subject-specific mean of our DV. 1
Let’s specify and fit such a model using lm
. We include job tenure
, union
membership and an interaction between both (the latter mainly for illustrative purposes later when we estimate marginal effects). We also control for age
and add idcode
as FE variable.
m1 <- lm(ln_wage ~ age + tenure + union + tenure:union + idcode, data = nlswork) summary(m1) Call: lm(formula = ln_wage ~ age + tenure + union + tenure:union + idcode, data = nlswork) Residuals: Min 1Q Median 3Q Max -0.96463 -0.09405 0.00000 0.11460 1.23525 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.882e+00 1.314e-01 14.325 < 2e-16 *** age 5.631e-03 3.110e-03 1.811 0.071193 . tenure 2.076e-02 6.964e-03 2.980 0.003115 ** union 1.746e-01 6.065e-02 2.879 0.004272 ** idcode2 -6.174e-01 1.285e-01 -4.803 2.47e-06 *** idcode3 -5.625e-01 1.329e-01 -4.234 3.05e-05 *** [...]
We’re not really interested in the subject-specific means (the FE coefficients), so let’s filter them out and only show our coefficients of interest:
m1coeffs_std <- data.frame(summary(m1)$coefficients) coi_indices <- which(!startsWith(row.names(m1coeffs_std), 'idcode')) m1coeffs_std[coi_indices,] Estimate Std..Error t.value Pr...t.. (Intercept) 1.882478232 0.131411504 14.325064 8.022367e-36 age 0.005630809 0.003109803 1.810664 7.119315e-02 tenure 0.020756426 0.006964417 2.980353 3.114742e-03 union 0.174619394 0.060646038 2.879321 4.272027e-03 tenure:union 0.014974113 0.009548509 1.568215 1.178851e-01
Unsurprisingly, job tenure and especially union membership are positively associated with wage. The coefficient of the interaction term shows that with union membership the job tenure effect is even a bit higher, though not significantly.
In the next two sections we’ll see how standard errors for our estimates are usually computed and how this fits into a framework called “sandwich estimators.” Using this framework, we’ll see how the standard error calculations can be adjusted for clustered data.
In ordinary least squares (OLS) regression, we assume that the regression model errors are independent. This is not the case here: Each subject may be surveyed several times so within each subject’s repeated measures, the errors will be correlated. Although that is not a problem for our regression estimates (they are still unbiased [Roberts 2013]), it is a problem for for the precision of our estimates — the precision will typically be overestimated, i.e. the standard errors (SEs) will be lower than they should be [Cameron and Miller 2013]. The intuition behind this regarding our example is that within our clusters we usually have lower variance since the answers come from the same subject and are correlated. This lowers our estimates’ SEs.
We can deal with this using clustered standard errors with subjects representing our clusters. But before we do this, let’s first have a closer look on how “classic” OLS estimates’ SEs are actually computed.
In matrix notation, a linear model has the form
This model has parameters (including the intercept parameter
) expressed as
parameter vector
. The parameters will be estimated from
observations in our data. The DV is
(an
vector), the independent variables form an
matrix
. Finally, the error term
is an
vector that captures everything that influences
but cannot be explained by
. In classic OLS, we assume that
has a mean of zero and a variance of
.
By minimizing an estimation for our parameters,
, can be found as
. The estimated variance
for these parameters is then
(1)
where is the estimated variance of the error or residual variance. This is calculated as
with being the residuals. The numerator is also called the residual sum of squares and the denominator is the degrees of freedom.
You can see for example in Sheather 2009 for how the formulas for and
are derived or watch this video for a clear step-by-step derivation.
We now have all the pieces together to replicate the standard errors from model m1
with our own calculations. To translate these formulae to R, we use model.matrix
to get the design matrix ,
residuals
for the residual vector ,
nobs
for the number of observations ,
ncol(X)
for the number or parameters, solve
to calculate the inverse of and
diag
to extract the diagonal of a square matrix.
X <- model.matrix(m1) u <- residuals(m1) n <- nobs(m1) p <- ncol(X) sigma2 <- sum(u^2) / (n - p) # solve (X^T X) A = I, where I is identity matrix -> A is (X^T X)^-1 crossXinv <- solve(t(X) %*% X, diag(p)) m1se <- sqrt(diag(sigma2 * crossXinv)) m1se [1] 0.131411504 0.003109803 0.006964417 0.060646038 0.128532897 ...
Let’s check if this is equal to the standard errors calculated by lm
(using near
because of minor deviations due to floating point number imprecision):
all(near(m1se, m1coeffs_std$Std..Error)) [1] TRUE
We extracted our parameter estimates’ variance from the diagonal of the (variance-)covariance or vcov matrix and R has the
vcov
function to calculate it from a fitted model. It’s exactly what we computed before using eq. 1:
all(near(sigma2 * crossXinv, vcov(m1))) [1] TRUE
Clustered standard errors
Classic OLS SEs can be generalized so that some assumptions, namely that the regression model errors are independent, can be relaxed. The foundation for this is the sandwich estimator 2
(2)
Let’s first understand how the above equation relates to eq. 1, the classic OLS parameter variance: One assumption of classic OLS is constant variance (or homoscedasticity) in the errors across the full spectrum of our DV. This implicates that is a diagonal matrix with identical
elements, i.e.
. Plugging this into eq. 2 gives eq. 1 which shows that the classic OLS parameter variance is a specialized form of the sandwich estimator. 3
When we want to obtain clustered SEs, we need to consider that in the “meat” part of eq. 2 is not a diagonal matrix with identical
elements anymore, hence this can’t be simplified to eq. 1. Instead, we can assume that
is block-diagonal with the clusters forming the blocks. This means we assume that the variance in the errors is constant only within clusters and so we first calculate
per cluster
and then sum the
. Cameron and Miller 2013 (p. 11) show how
is calculated in detail and also which finite-sample correction factor is applied. From this article we get the equation
(3)
where is the number of clusters. It’s interesting to see how the residuals are added up per cluster and then averaged. As Cameron and Miller 2013 (p. 13) notes, this implicates an important limitation: With a low number of clusters, this averaging is imprecise.
Let’s translate this formula to R. We already have as
u
(the residuals) and the design matrix X
. We can generate a list of matrices , sum them and multiply the correction factor:
omegaj <- lapply(levels(nlswork$idcode), function(idcode) { j <- nlswork$idcode == idcode # drop = FALSE: don't drop dimensions when we have only one obs. X_j <- X[j, , drop = FALSE] # tcrossprod is outer product x * x^T t(X_j) %*% tcrossprod(u[j]) %*% X_j }) n_cl <- length(levels(nlswork$idcode)) # num. clusters # correction factor * sum of omega_j omega <- (n-1) / (n-p) * (n_cl / (n_cl-1)) * Reduce('+', omegaj) # sandwich formula; extract diagonal and take square root to get SEs m1clse <- sqrt(diag(crossXinv %*% omega %*% crossXinv)) m1clse [1] 0.157611390 0.006339777 0.011149190 0.101970509 0.020561516 ...
We will later check that this matches the estimates calculated with R packages that implement clustered SE estimation. For now, let’s compare the classic OLS SEs with the clustered SEs:
m1coeffs_with_clse <- cbind(m1coeffs_std, ClustSE = m1clse) m1coeffs_with_clse[coi_indices, c(1, 2, 5)] Estimate Std..Error ClustSE (Intercept) 1.882478232 0.131411504 0.157611390 age 0.005630809 0.003109803 0.006339777 tenure 0.020756426 0.006964417 0.011149190 union 0.174619394 0.060646038 0.101970509 tenure:union 0.014974113 0.009548509 0.009646023
We can see that, as expected, the clustered SEs are all a bit higher than the classic OLS SEs.
The above calculations were used to show what’s happening “under the hood” and also how the formulas used for these calculations are motivated. However, doing these calculations “by hand” is error-prone and slow. It’s better to use well trusted packages for daily work and so next we’ll have a look at some of these packages and how they can be used. Still, it’s helpful to understand some background and the limitations for this approach. See Cameron and Miller 2013 for a much more thorough guide (though only with examples in Stata) that also considers topics like which variable(s) to use for clustering, what to do when dealing with a low number of clusters or how to implement multi-way clustering.
Option 1: sandwich and lmtest
The sandwich package implements several methods for robust covariance estimators, including clustered SEs. Details are explained in Zeileis et al. 2020. The accompanying lmtest package provides functions for coefficient tests that take into account the calculated robust covariance estimates.
As explained initially, the parameter estimates from our model are fine despite the clustered structure of our data. But the SEs are likely biased downward and need to be corrected. This is why we can resume to work with our initially estimated model m1
from lm
. There’s no need to refit it and sandwich works with lm model objects (and also some other types of models such as some glm models). We only have to adjust how we test our coefficient estimates in the following way:
- We need to use
coeftest
from the lmtest package; - we need to pass it our model and either a function to calculate the covariance matrix or an already estimated covariance matrix to the
vcov
parameter; - we need to specify a cluster variable in the
cluster
parameter.
The sandwich package provides several functions for estimating robust covariance matrices. We need vcovCL
for clustered covariance estimation and will pass this function as vcov
parameter. Furthermore, we cluster by subject ID, so the cluster variable is idcode
.
library(sandwich) library(lmtest) m1coeffs_cl <- coeftest(m1, vcov = vcovCL, cluster = ~idcode) m1coeffs_cl[coi_indices,] Estimate Std. Error t value Pr(>|t|) (Intercept) 1.882478232 0.157611390 11.9437956 3.667970e-27 age 0.005630809 0.006339777 0.8881715 3.751601e-01 tenure 0.020756426 0.011149190 1.8616981 6.362342e-02 union 0.174619394 0.101970509 1.7124500 8.784708e-02 tenure:union 0.014974113 0.009646023 1.5523613 1.216301e-01
The calculated SE values seem familiar and they are indeed equal to what we calculated before as m1clse
“by hand”:
all(near(m1clse, m1coeffs_cl[,2])) [1] TRUE
The lmtest package provides several functions for common post-estimation tasks, for example coefci
to calculate confidence intervals (CIs). If we use these, we need to make sure to specify the same type of covariance estimation, again by passing the appropriate vcov
and cluster
parameters:
(m1cis <- coefci(m1, parm = coi_indices, vcov = vcovCL, cluster = ~idcode)) 2.5 % 97.5 % (Intercept) 1.572314302 2.19264216 age -0.006845258 0.01810688 tenure -0.001184099 0.04269695 union -0.026048678 0.37528746 tenure:union -0.004008324 0.03395655
This is really important, as otherwise the classic (non-clustered) covariance estimation is applied by default. This, due to lower SEs, leads to narrower CIs:
coefci(m1, parm = coi_indices) 2.5 % 97.5 % (Intercept) 1.6238731375 2.14108333 age -0.0004889822 0.01175060 tenure 0.0070511278 0.03446172 union 0.0552738733 0.29396491 tenure:union -0.0038164264 0.03376465
Here, the tenure
and union
CIs suddenly don’t include zero any more!
Instead of passing vcovCL
as function to the vcov
parameter, it’s more convenient and computationally more efficient to calculate the covariance matrix only once using vcovCL
and then passing this matrix to functions like coeftest
and coefci
instead:
cl_vcov_mat <- vcovCL(m1, cluster = ~idcode)
Now we pass this matrix for the vcov
parameter. We don’t need to specify the cluster
parameter anymore, since this information was only needed in the previous step.
m1coeffs_cl2 <- coeftest(m1, vcov = cl_vcov_mat) all(near(m1coeffs_cl[,2], m1coeffs_cl2[,2])) # same SEs? [1] TRUE m1cis2 <- coefci(m1, parm = coi_indices, vcov = cl_vcov_mat) all(near(m1cis, m1cis2)) # same CIs? [1] TRUE
Another example would be to calculate marginal effects, for example with the margins package. Again, to arrive at clustered SEs we will need to pass the proper covariance matrix via the vcov
parameter. We do this for the marginal effect of tenure
at the two levels of union
:
library(margins) margins(m1, vcov = cl_vcov_mat, variables = 'tenure', at = list(union = 0:1)) %>% summary() factor union AME SE z p lower upper tenure 0.0000 0.0208 0.0111 1.8617 0.0626 -0.0011 0.0426 tenure 1.0000 0.0357 0.0083 4.3089 0.0000 0.0195 0.0520
Otherwise classic SEs are estimated, which are smaller:
margins(m1, variables = 'tenure', at = list(union = 0:1)) %>% summary() factor union AME SE z p lower upper tenure 0.0000 0.0208 0.0070 2.9804 0.0029 0.0071 0.0344 tenure 1.0000 0.0357 0.0081 4.3846 0.0000 0.0198 0.0517
As you can see, the combination of lm
and the packages sandwich and lmtest are all you need for estimating clustered SEs and inference. However, you really need to be careful to include the covariance matrix at all steps of your calculations.
Option 2: lm.cluster
from miceadds
There’s also lm.cluster
from the package miceadds 4, which may be a bit more convenient to use. Internally, it basically does the same that we’ve done before by employing sandwich’s vcovCL
(see source code parts here and there for example).
Instead of using lm
, we fit the model with lm.cluster
and specify a cluster variable (this time as string, not as formula). The model summary then contains the clustered SEs:
library(miceadds) m2 <- lm.cluster(ln_wage ~ age + tenure + union + tenure:union + idcode, cluster = 'idcode', data = nlswork) m2coeffs <- data.frame(summary(m2)) m2coeffs[!startsWith(row.names(m2coeffs), 'idcode'),] Estimate Std..Error t.value Pr...t.. (Intercept) 1.882478232 0.157611390 11.9437956 6.995639e-33 age 0.005630809 0.006339777 0.8881715 3.744485e-01 tenure 0.020756426 0.011149190 1.8616981 6.264565e-02 union 0.174619394 0.101970509 1.7124500 8.681378e-02 tenure:union 0.014974113 0.009646023 1.5523613 1.205758e-01
An object m
that is returned from lm.cluster
is a list that contains the lm
object as m$lmres
and the covariance matrix as m$vcov
. Again, these objects need to be “dragged along” if we want to do further computations. For margins
, we also need to pass the data again via data = nlswork
:
margins(m2$lm_res, vcov = m2$vcov, variables = 'tenure', at = list(union = 0:1), data = nlswork) %>% summary() factor union AME SE z p lower upper tenure 0.0000 0.0208 0.0111 1.8617 0.0626 -0.0011 0.0426 tenure 1.0000 0.0357 0.0083 4.3089 0.0000 0.0195 0.0520
The result is consistent with our former computations. The advantage over the “lm + sandwich + lmtest” approach is that you can do clustered SE estimation and inference in one go. For further calculations you still need to be careful to supply the covariance matrix from m$vcov
.
Option 3: lm_robust
from estimatr
Another option is to use lm_robust
from the estimatr package which is part of the DeclareDesign framework [Blair et al. 2019]. Like lm.cluster
, it’s more convenient to use, but it doesn’t rely on sandwich and lmtest in the background and instead comes with an own implementation for model fitting and covariance estimation. This implementation is supposed to be faster than the other approaches and we’ll check that for our example later.
But first, let’s fit a model with clustered SEs using lm_robust
. We use the same formula as with lm
or lm.cluster
, but also specify the clusters
parameter: 5
library(estimatr) m3 <- lm_robust(ln_wage ~ age + tenure + union + tenure:union + idcode, clusters = idcode, data = nlswork) summary(m3) Call: lm_robust(formula = ln_wage ~ age + tenure + union + tenure:union + idcode, data = nlswork, clusters = idcode) Standard error type: CR2 Coefficients: Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF (Intercept) 1.882e+00 0.141424 13.310872 8.596e-14 1.5930761 2.171880 28.642 age 5.631e-03 0.005726 0.983442 3.342e-01 -0.0061215 0.017383 26.790 tenure 2.076e-02 0.010089 2.057345 5.771e-02 -0.0007714 0.042284 14.812 union 1.746e-01 0.093790 1.861817 7.735e-02 -0.0209918 0.370231 20.049 idcode2 -6.174e-01 0.019094 -32.334722 2.810e-07 -0.6656871 -0.569086 5.283 idcode3 -5.625e-01 0.078607 -7.156034 9.687e-07 -0.7273116 -0.397718 18.550
Unlike lm
, lm_robust
allows to specify fixed effects in a separate fixed_effects
formula parameter which, according to the documentation, should speed up computation for many types of SEs. Furthermore, this cleans up the summary output since there are no more FE coefficients:
m3fe <- lm_robust(ln_wage ~ age + tenure + union + tenure:union, clusters = idcode, fixed_effects = ~idcode, data = nlswork) summary(m3fe) Call: lm_robust(formula = ln_wage ~ age + tenure + union + tenure:union, data = nlswork, clusters = idcode, fixed_effects = ~idcode) Standard error type: CR2 Coefficients: Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF age 0.005631 0.005726 0.9834 0.33419 -0.0061215 0.01738 26.790 tenure 0.020756 0.010089 2.0573 0.05771 -0.0007714 0.04228 14.812 union 0.174619 0.093790 1.8618 0.07735 -0.0209918 0.37023 20.049 tenure:union 0.014974 0.009043 1.6558 0.14062 -0.0062982 0.03625 7.187 Multiple R-squared: 0.7554 , Adjusted R-squared: 0.6861 [...]
When we compare the results from lm_robust
with lm
, we can see that the point estimates are the same. The lm_robust
SEs are, as expected, higher than the “classic” SEs from lm
. However, the lm_robust
SEs are also a bit smaller than those calculated from sandwich::vcovCL
:
m3fe_df <- tidy(m3fe) %>% rename(est.lm_robust = estimate, se.lm_robust = std.error) m3fe_df$se.sandwich <- m1coeffs_cl[coi_indices,2][2:5] m3fe_df$est.classic <- m1coeffs_std[coi_indices,1][2:5] m3fe_df$se.classic <- m1coeffs_std[coi_indices,2][2:5] m3fe_df[c('term', 'est.classic', 'est.lm_robust', 'se.classic', 'se.lm_robust', 'se.sandwich')] term est.classic est.lm_robust se.classic se.lm_robust se.sandwich age 0.005630809 0.005630809 0.003109803 0.005725617 0.006339777 tenure 0.020756426 0.020756426 0.006964417 0.010088940 0.011149190 union 0.174619394 0.174619394 0.060646038 0.093789776 0.101970509 tenure:union 0.014974113 0.014974113 0.009548509 0.009043429 0.009646023
This is because lm_robust
by default uses a different cluster-robust variance estimator “to correct hypotheses tests for small samples and work with commonly specified fixed effects and weights” as explained in the Getting started vignette. Details can be found in the Mathematical notes for estimatr.
As with the lm
and lm.cluster
results, we can also estimate marginal effects with a lm_robust
result object. However, this doesn’t seem to work when you specify FEs via fixed_effects
parameter as done for m3fe
:
margins(m3fe, variables = 'tenure', at = list(union = 0:1)) %>% summary() Error in predict.lm_robust(model, newdata = data, type = type, se.fit = TRUE, : Can't set `se.fit` == TRUE with `fixed_effects`
With m3
(where FEs were directly specified in the model formula), marginal effects estimation works and we don’t even need to pass a separate vcov
matrix, since this information already comes with the lm_robust
result object m3
. 6
margins(m3, variables = 'tenure', at = list(union = 0:1)) %>% summary() factor union AME SE z p lower upper tenure 0.0000 0.0208 0.0101 2.0573 0.0397 0.0010 0.0405 tenure 1.0000 0.0357 0.0077 4.6174 0.0000 0.0206 0.0509
As already said, lm_robust
uses a different variance estimator than sandwich’s vcovCL
and Stata. However, by setting se_type
to 'stata'
we can replicate these “Stata Clustered SEs”:
m3stata <- lm_robust(ln_wage ~ age + tenure + union + tenure:union + idcode, clusters = idcode, se_type = 'stata', data = nlswork) m3stata_se <- tidy(m3stata) %>% pull(std.error) all(near(m3stata_se, m1clse)) # same SEs? [1] TRUE
In summary, lm_robust
is as convenient to use as lm
or lm.cluster
, but offers similar flexibility as sandwich
for estimating clustered SEs. A big advantage is that you don’t need to care about supplying the right covariance matrix to further post-estimation functions like margins
. The proper covariance matrix is directy attached to the fitted lm_robust
object (and can be accessed via model$vcov
or vcov(model)
if you need to). Is parameter estimation also faster with lm_robust
?
Performance comparison
We’ll make a rather superficial performance comparison only using the nlswork
dataset and microbenchmark. We will compare the following implementations for estimating model coefficients and clustered SEs:
lm
andvcovCL
from sandwichlm.cluster
lm_robust
with default SEs (se_type = 'CR2'
)lm_robust
with Stata SEs (se_type = 'stata'
)lm_robust
withfixed_effects
parameter and Stata SEs (fixed_effects = idcode, se_type = 'stata'
)
For a fair comparison, we don’t calculate CIs (which lm_robust
by default does). These are the results for 100 test runs:
As expected, lm/sandwich
and lm.cluster
have similar run times. lm_robust
is faster for all three configurations (3. to 5.) and is especially fast when estimating Stata SEs (4. and 5.). With our example data, specifying fixed_effects
(5.) doesn’t seem to speed up the calculations.
Conclusion
We’ve seen that it’s important to account for clusters in data when estimating model parameters, since ignoring this fact will likely result in overestimated precision which in turn can lead to wrong inference. R provides many ways to estimate clustered SEs. The packages sandwich
and lmtest
come with a rich set of tools for this task (and also for other types of robust SEs) and work with lm
and other kinds of models. lm.cluster
from the miceadds package provides a more convenient wrapper around sandwich
and lmtest
. However, users should be careful to not forget passing along the separate cluster robust covariance matrix for post-estimation tasks. This is something users don’t need to care for when using lm_robust
from the estimatr package, since the covariance matrix is not separate from the fitted model object. Another advantage is that lm_robust
seems to be faster than the other options.
The source code for this article can be found on GitHub.
References
Blair, Graeme, Jasper Cooper, Alexander Coppock, and Macartan Humphreys. 2019. “Declaring and Diagnosing Research Designs.” American Political Science Review 113 (3): 838–59.
Cameron, A Colin, and Douglas L Miller. 2013. “A Practitioner’s Guide to Cluster-Robust Inference.”
Roberts, Molly. 2013. “Robust and Clustered Standard Errors.”
Sheather, Simon. 2009. A Modern Approach to Regression with R. Springer Texts in Statistics. New York: Springer-Verlag.
Zeileis, Achim, Susanne Köll, and Nathaniel Graham. 2020. “Various Versatile Variances: An Object-Oriented Implementation of Clustered Covariances in R.” Journal of Statistical Software 95 (1).
-
It is not always necessary to use an FE model and you can very well estimate robust SEs from clustered data also without FEs. ↩
-
This is called sandwich estimator because of the structure of the formula: Between two slices of bread
there is the meat
. ↩
-
Plugging in
in eq. 2 gives
which reduces to eq. 1. because
↩
-
Once again a strange package name. Unlike sandwich this package name is not derived from a formula, but simply stands for additional functionality for imputation with the mice package — and just happens to include clustered SE estimation. ↩
-
Note that this time we have to use the “bare” unquoted variable name — not a formula, not a string. Every package has a different policy! ↩
-
The standard
vcov
function will return the correct (cluster robust) covariance matrix for a fittedlm_robust
model, whereas it will return the “classic” covariance matrix for a fittedlm
model. ↩
In ancient Greek times, important decisions were never made without consulting the high priestess at the Oracle of Delphi. She would deliver wisdom from the gods, although this advice was sometimes vague or confusing, and was often misinterpreted by mortals. Today I bring word that the high priestess and priests (Athey, Abadie, Imbens and Wooldridge) have delivered new wisdom from the god of econometrics on the important decision of when should you cluster standard errors. This is definitely one of life’s most important questions, as any keen player of seminar bingo can surely attest. In case their paper is all greek to you (half of it literally is), I will attempt to summarize their recommendations, so that your standard errors may be heavenly.
The authors argue that there are two reasons for clustering standard errors: a sampling design reason, which arises because you have sampled data from a population using clustered sampling, and want to say something about the broader population; and an experimental design reason, where the assignment mechanism for some causal treatment of interest is clustered. Let me go through each in turn, by way of examples, and end with some of their takeaways.
The Sampling Design reason for clustering
Consider running a simple Mincer earnings regression of the form:
Log(wages) = a + b*years of schooling + c*experience + d*experience^2 + e
You present this model, and are deciding whether to cluster the standard errors. Referee 1 tells you “the wage residual is likely to be correlated within local labor markets, so you should cluster your standard errors by state or village.”. But referee 2 argues “The wage residual is likely to be correlated for people working in the same industry, so you should cluster your standard errors by industry”, and referee 3 argues that “the wage residual is likely to be correlated by age cohort, so you should cluster your standard errors by cohort”. What should you do?
You could try estimating your model with these three different clustering approaches, and see what difference this makes.
Their advice: whether or not clustering makes a difference to the standard errors should not be the basis for deciding whether or not to cluster. They note there is a misconception that if clustering matters, one should cluster.
Instead, under the sampling perspective, what matters for clustering is how the sample was selected and whether there are clusters in the population of interest that are not represented in the sample. So, we can imagine different scenarios here:
- You want to say something about the association between schooling and wages in a particular population, and are using a random sample of workers from this population. Then there is no need to adjust the standard errors for clustering at all, even if clustering would change the standard errors.
- The sample was selected by randomly sampling 100 towns and villages from within the country, and then randomly sampling people in each; and your goal is to say something about the return to education in the overall population. Here you should cluster standard errors by village, since there are villages in the population of interest beyond those seen in the sample.
- This same logic makes it clear why you generally wouldn’t cluster by age cohort (it seems unlikely that we would randomly sample some age cohorts and not others, and then try and say something about all ages); and that we would only want to cluster by industry if the sample was drawn by randomly selecting a sample of industries, and then sampling individuals from within each.
Even in the second case, Abadie et al. note that both the usual robust (Eicker-Huber-White or EHW) standard errors, and the clustered standard errors (which they call Liang-Zeger or LZ standard errors) can both be correct, it is just that they are correct for different estimands. That is, if you are content on just saying something about the particular sample of individuals you have, without trying to generalize to the population, the EHW standard errors are all you need; but if you want to say something about the broader population, the LZ standard errors are necessary.
Special case: even when the sampling is clustered, the EHW and LZ standard errors will be the same if there is no heterogeneity in the treatment effects.
Sidenote 1: this reminds me also of propensity score matching command nnmatch of Abadie (with a different et al.), where you can get the narrower SATE standard errors for the sample, or the wider PATE errors for the population.
Sidenote 2: This reason is hardly ever a rationale for clustering in an impact evaluation. But Rosenzweig and Udry’s paper on external validity does make the point that we only observe treatment effects for specific points in time, and that if we want to say something more general about how our treatment behaves in other points in time, we need wider standard errors than we use for just saying something about our specific sample – which is very related to the point here about being very clear what your estimand is.
The Experimental Design Reason for Clustering
The second reason for clustering is the one we are probably more familiar with, which is when clusters of units, rather than individual units, are assigned to a treatment. Let’s take the same equation as above, but assume that we have a binary treatment that assigns more schooling to people. So now we have:
Log(wages) = a +b*Treatment + e
Then if the treatment is assigned at the individual level, there is no need to cluster (*). There has been much confusion about this, as Chris Blattman explored in two earlier posts about this issue (the fabulously titled clusterjerk and clusterjerk the sequel), and I still occasionally get referees suggesting I try clustering by industry or something similar in an individually-randomized experiment. This Abadie et al. paper is now finally a good reference to explain why this is not necessary.
(*) unless you are using multiple time periods, and then you will want to cluster by individual, since the unit of randomization is individual, and not individual-time period.
What about if your treatment is assigned at the village level. Then cluster by village. This is also why you want to cluster difference-in-differences at the state-level when you have a source of variation that comes from differences across states, and why a “treatment” like being on one side of a border vs the other is problematic (because you have only 2 clusters).
Adding fixed effects
What if we sample at the level of cities, but then add city fixed effects to our Mincer regression. Or we randomize at the city level, but add city fixed effects. Do we still need to cluster at the city level?
The authors note that there is a lot of confusion about using clustering with fixed effects. The general rule is that you still need to cluster if either the sampling or assignment to treatment was clustered. However, the authors show that cluster adjustments will only make an adjustment with fixed effects if there is heterogeneity in treatment effects.
How to cluster?
This is largely a paper about when to cluster, not how to cluster. There is of course a whole other debate about when you can rely on asymptotics, vs bootstrapping, vs randomization inference approaches. They show with asymptotic approximations that the standard Liang-Zeger cluster adjustment is generally conservative, and offer an alternative cluster-adjusted variance estimator that can be used if there is variation in treatment assignment within clusters and you know the fraction of clusters sampled. But since with the sample sizes used in many experiments the concern is now that asymptotic standard errors may not be conservative enough, you should be careful about using such an adjustment with typical sample sizes.
Кластерированные стандартные ошибки являются измерением , которые оценить стандартную ошибку в виде регрессии параметра в условиях , где наблюдения могут быть разделены на более мелкий размере группы ( «кластеры») и , где назначение выборки и / или лечения коррелирует в пределах каждой группы. Кластерные стандартные ошибки широко используются в различных прикладных эконометрических условиях, включая различия в различиях или эксперименты. Аналогично тому, как стандартные ошибки Хубера-Уайта согласованы при наличии гетероскедастичности, а стандартные ошибки Ньюи-Уэста согласованы при наличии точно смоделированной автокорреляции , кластерные (или «Лян-Зегер») стандартные ошибки согласованы при наличии кластерных отбор проб или назначение лечения. Кластерные стандартные ошибки часто оправдываются возможной корреляцией при моделировании остатков внутри каждого кластера; хотя недавняя работа предполагает, что это не точное оправдание кластеризации, это может быть полезно с педагогической точки зрения.
Интуитивная мотивация
Кластерные стандартные ошибки часто полезны, когда лечение назначается на уровне кластера, а не на индивидуальном уровне. Например, предположим, что исследователь в области образования хочет выяснить, улучшает ли новая методика обучения результаты тестов учащихся. Поэтому она поручает учителям в «обработанных» классах опробовать эту новую технику, не затрагивая «контрольные» классы. Анализируя свои результаты, она может захотеть сохранить данные на уровне ученика (например, для контроля наблюдаемых характеристик на уровне ученика). Однако, оценивая стандартную ошибку или доверительный интервал своей статистической модели, она понимает, что классические или даже устойчивые к гетероскедастичности стандартные ошибки неуместны, поскольку результаты тестов учащихся в каждом классе не распределяются независимо. Вместо этого ученики в классах с лучшими учителями имеют особенно высокие результаты тестов (независимо от того, проходят ли они экспериментальное лечение), в то время как ученики в классах с худшими учителями имеют особенно низкие результаты тестов. Исследователь может сгруппировать свои стандартные ошибки на уровне класса, чтобы учесть этот аспект своего эксперимента.
Хотя этот пример очень конкретен, похожие проблемы возникают в самых разных условиях. Например, во многих настройках панельных данных (таких как разница в различиях ) кластеризация часто предлагает простой и эффективный способ учета отсутствия независимости между периодами в каждой единице (иногда называемой «автокорреляцией остатков»). Другое распространенное и логически отличное обоснование для кластеризации возникает, когда невозможно произвести случайную выборку из всей генеральной совокупности, и поэтому вместо этого выбираются кластеры, а затем единицы рандомизируются внутри кластера. В этом случае сгруппированные стандартные ошибки учитывают неопределенность, вызванную тем фактом, что исследователь не наблюдает за большими частями исследуемой совокупности.
Математическая мотивация
Полезную математическую иллюстрацию дает случай односторонней кластеризации в обычной модели наименьших квадратов (МНК). Рассмотрим простую модель с N наблюдениями, которые разделены на C кластеров. Пусть быть вектор результатов, матрица ковариаций, вектор неизвестных параметров, и вектор необъяснимых остатков:
Как это принято в моделях OLS, мы минимизируем сумму квадратов остатков, чтобы получить оценку :
Отсюда мы можем вывести классическую оценку «сэндвича»:
Обозначение дает потенциально более знакомую форму
Хотя можно разработать плагин оценщика, определив и разрешив , этот полностью гибкий оценщик не будет сходиться к as . Принимая во внимание допущения, которые практикующий специалист считает разумными, различные типы стандартных ошибок решают эту проблему по-разному. Например, классические гомоскедастические стандартные ошибки предполагают диагональ с идентичными элементами , что упрощает выражение для . Стандартные ошибки Хубера-Уайта предполагают диагональность, но диагональное значение варьируется, в то время как другие типы стандартных ошибок (например, стандартные ошибки Ньюи – Уэста , Моултона, пространственные SE Конли) накладывают другие ограничения на форму этой матрицы, чтобы уменьшить количество параметров, которые практикующему специалисту необходимо оценить.
Кластерные стандартные ошибки предполагают, что это блок-диагональ в соответствии с кластерами в выборке, с неограниченными значениями в каждом блоке, но с нулями в других местах. В этом случае можно определить и как внутриблочные аналоги и и вывести следующий математический факт:
Путем построения подключаемых матриц можно сформировать оценку для согласованности по мере того, как количество кластеров становится большим. Хотя статистически не доказано, что достаточное количество кластеров является достаточным, практики часто приводят число в диапазоне 30–50 и чувствуют себя комфортно, используя кластерные стандартные ошибки, когда количество кластеров превышает этот порог.
использованная литература
Надежна ли ваша стандартная ошибка?
Перевод
Ссылка на автора
Практическое руководство по выбору правильной спецификации
Управляющее резюме
- Проблема:Стандартные ошибки по умолчанию (SE), сообщаемые Stata, R и Python, являются правильными только при очень ограниченных обстоятельствах. В частности, эти программы предполагают, что ваша ошибка регрессии распределена независимо и одинаково. На самом деле это не так, что приводит к серьезным ошибкам типа 1 и типа 2 в тестах гипотез.
- Лечение 1:если вы используете OLS, вы должны кластеризовать SE по двум параметрам: индивидуально по годам.
- Лечение 2:если вы используете FE (фиксированный эффект), вы должны кластеризовать SE только на 1 измерение: индивидуальное.
- коды: Вот это ссылка на коды Stata, R и SAS для реализации кластеризации SE.
Если вам интересно об этой проблеме, пожалуйста, продолжайте читать. В противном случае, увидимся в следующий раз
План на день
В этом посте я собрал бы 69-страничный документ о здравой стандартной ошибке в чит-лист. Эта бумага Опубликованный профессором Митчеллом Петерсеном в 2009 году, на сегодняшний день собрал более 7 879 ссылок. Это остается библией для выбора правильной здравой стандартной ошибки.
Проблема
Обычная практика
В любом классе Stats 101 ваш профессор мог бы научить вас набирать «reg Y X» в Stata или R:
Вы приступаете к проверке своей гипотезы с указанными точечными оценками и стандартной ошибкой. Но в 99% случаев это было бы неправильно.
Ловушка
Чтобы OLS давал объективные и непротиворечивые оценки, нам нужно, чтобы термин ошибки epsilon был распределен независимо и одинаково:
Независимый означает, что никакие серийные или взаимные корреляции не допускаются:
- Последовательные корреляции:для одного и того же человека остатки в разные периоды времени коррелируют;
- Кросс-корреляция:различные индивидуальные остатки коррелируются внутри и / или между периодами.
Одинаковый означает, что все остатки имеют одну и ту же дисперсию (например, гомоскедастичность).
Визуализация проблемы
Давайте визуализируем i.i.d. предположение в дисперсионно-ковариационной матрице.
- Нет последовательной корреляции:все недиагональные записи в красных пузырьках должны быть 0;
- Нет взаимной корреляции:все диагональные записи должны быть одинаковыми — все записи в зеленых прямоугольниках должны быть 0;
- гомоскедастичность:диагональные записи должны быть одинаковыми константами.
Что не так, если вы используете SE по умолчанию без I.I.D. Ошибки?
Вывод выражения SE:
Стандартная ошибка по умолчанию — последняя строка в (3). Но чтобы получить от 1-й до последней строки, нам нужно сделать дополнительные предположения:
- Нам нужно предположение о независимости, чтобы переместить нас с 1-й строки на 2-ю в (3). Визуально все записи в зеленых прямоугольниках И все недиагональные записи в красных пузырьках должны быть 0.
- Нам нужно одинаково распределенное предположение, чтобы переместить нас со 2-й строки на 3-ю. Визуально все диагональные элементы должны быть точно такими же.
По умолчанию SE прав в ОЧЕНЬ ограниченных обстоятельствах!
Цена неправильного
Мы не знаем, будет ли заявленная SE переоценена или недооценена истинная SE. Таким образом, мы можем получить:
- Статистически значимый результат, когда эффекта нет в реальности. В результате команде разработчиков программного обеспечения и продукта может потребоваться несколько часов работы над каким-либо прототипом, который никак не повлияет на итоговую прибыль компании.
- Статистически незначимый результат, когда в действительности наблюдается значительный эффект. Это могло быть для тебя перерывом. Упущенная возможность Очень плохо
На самом деле ложный позитив более вероятен.Там нет недостатка в новичках машинного обучения студентов, заявляющих, что они нашли какой-то шаблон / сигнал, чтобы побить рынок. Однако после развертывания их модели работают катастрофически. Частично причина в том, что они никогда не думали о последовательной или взаимной корреляции остатков.
Когда это происходит, стандартная ошибка по умолчанию может быть в 11 раз меньше, чем истинная стандартная ошибка, что приводит к серьезной переоценке статистической значимости их сигнала.
Надежная стандартная ошибка на помощь!
Правильно заданная надежная стандартная ошибка избавит от смещения или, по крайней мере, улучшит его. Вооружившись серьезной стандартной ошибкой, вы можете безопасно перейти к этапу вывода.
Есть много надежных стандартных ошибок. Выбор неправильного средства может усугубить проблему!
Какую робастную стандартную ошибку я должен использовать?
Это зависит от дисперсионно-ковариантной структуры. Спросите себя, страдает ли ваш остаток от взаимной корреляции, последовательной корреляции или от того и другого? Напомним, что:
- Кросс-корреляция:в течение одного и того же периода времени разные индивидуальные остатки могут быть коррелированы;
- Последовательная корреляция:для одного и того же лица остатки за разные периоды времени могут быть коррелированы
Случай 1: Термин ошибки имеет отдельный конкретный компонент
Предположим, что это истинное состояние мира:
При условии независимости от отдельных лиц, правильная стандартная ошибка будет:
Сравните это с (3), у нас есть дополнительный член, который обведен красным. Превышение или недооценка сообщаемой стандартной ошибки OLS истинной стандартной ошибки зависит от знака коэффициентов корреляции, который затем увеличивается на число периодов времени T.
Где практическое руководство?
Основываясь на большем количестве теории и результатов моделирования, Петерсен показывает, что:
Вы не должны использовать:
- Стандартные ошибки Fama-MacBeth:он предназначен для работы с последовательной корреляцией, а не с перекрестной корреляцией между отдельными фирмами.
- Стандартные ошибки Ньюи-Уэста:он предназначен для учета последовательной корреляции неизвестной формы в остатках одного временного ряда.
Вы должны использовать:
- Стандартные кластерные ошибки:в частности, вы должны объединить вашу стандартную ошибку по фирмам. Обратитесь к концу поста для кодов.
Случай 2: Термин ошибки имеет компонент, зависящий от времени
Предположим, что это истинное состояние мира:
Правильная стандартная ошибка по существу такая же, как (7), если вы поменяете N и T.
Вы должны использовать:
- Стандартные ошибки Fama-MacBeth:так как это то, что он создан для Обратитесь к концу поста для кода Stata.
Случай 3: Термин «ошибка» имеет как твердое, так и временное влияние
Предположим, что это истинное состояние мира:
Вы должны использовать:
- Кластерная стандартная ошибка:кластеризацию следует проводить по двум параметрам — по годам. Обратите внимание, что это не настоящие стандартные ошибки, они просто создают менее предвзятую стандартную ошибку. Смещение становится более выраженным, когда в одном измерении всего несколько кластеров.
коды
Подробные инструкции и тестовые данные Stata, R и SAS от Petersen можно найти Вот, Для моей собственной записи я собираю список кода Stata здесь:
Случай 1: кластеризация по 1 измерению
Регресс зависимая_вариантная независимая_вариабельная, надежный кластер (cluster_variable)
Случай 2: Фама-Макбет
tsset firm_identifier time_identifier
fm independent_variable independent_variables, byfm (by_variable)
Случай 3: кластеризация по двум измерениям
cluster2 зависимая_ переменная independent_variables, fcluster (cluster_variable_one) tcluster (cluster_variable_two)
Случай 4: фиксированный эффект + кластеризация
xtreg variable_variable independent_variables, надежный кластер (cluster_variable_one)
Наслаждайтесь своим недавно найденным крепким миром!
До следующего раза