awhug


Some notes on grouped or 'cell means' regression

Posted on

Say you have some data reported as counts and, for whatever reason, want to fit a regression model on the data as-is in its grouped format. For example, you might want to fit a linear probability model to a summary table of binary yes/no count data. In econometrics this is sometimes called the 'cell means regression' or 'grouped data regression'.

When observations are collapsed into groups (cells with counts and defined by covariates), you can fit a weighted regression to retrieve the correct coefficients. However, the naive standard errors are wrong unless we adjust for the within-cell variance.

Schwiebert (2015) provides explicit formulas that show how regression statistics computed from individual data can be replicated from grouped (cell-means) data. This blog implements the formulas in R, focusing on the case of binary data.

Data

The Titanic dataset in the datasets package provides information in table format on the fate of passengers on the fatal maiden voyage of the ocean liner ‘Titanic’, summarised according to economic status (class), sex, age and survival. We'll use this to demonstrate.

First arrange, we the data into a data.frame object suitable for analysis. Some cells with zero total counts which contribute nothing to the model are dropped.

# Use Titanic dataset as example
df <- as.data.frame(datasets::Titanic)

# Pivot wider - two separate columns of survived / non survived counts
df_wide <- tidyr::pivot_wider(
	data = df,
	names_from = Survived,
	values_from = Freq
)

# Calculate totals and proportions survived
df_wide <- within(df_wide, {
	Total <- Yes + No
	y <- Yes / Total
})

# Drop NA obs
(df_wide <- na.omit(df_wide))

The data now look like this, where 'Yes' and 'No' represent counts of passengers who did and did not survive (respectively), along with the proportion of survivors.

ClassSexAgeNoYesyTotal
1stMaleChild051.000000005
2ndMaleChild0111.0000000011
3rdMaleChild35130.2708333348
1stFemaleChild011.000000001
2ndFemaleChild0131.0000000013
3rdFemaleChild17140.4516129031
1stMaleAdult118570.32571429175
2ndMaleAdult154140.08333333168
3rdMaleAdult387750.16233766462
CrewMaleAdult6701920.22273782862
1stFemaleAdult41400.97222222144
2ndFemaleAdult13800.8602150593
3rdFemaleAdult89760.46060606165
CrewFemaleAdult3200.8695652223

Hand-rolled weighted regression parameter estimates

Next, we set up the design matrix, $\bar{X}$ and start calculating the regression. The totals of each cell will form a weights vector, $w$, for the regression, and we calculate the weighted coefficients as $$\hat{\beta}=(\bar{X}^TW\bar{X})^{-1}\bar{X}^TW\bar{y}$$ where $W = \text{diag}(w)$ and $\bar{y}$ is the cell average. Here, the cell average represents the proportion of survivors.

# Construct the full-rank design matrix
X <- model.matrix(
	y ~ Class + Sex + Age,
	data = df_wide
)

# Get weights in vector and diag matrix formats
w <- df_wide$Total
W <- diag(w)

# Calculate weighted coefficients
wx <- t(X) %*% W
wxp_inv <- solve(wx %*% X)
wx_y <- wx %*% df_wide$y

The weighted regression coefficients are

> (b <- wxp_inv %*% wx_y)
                  [,1]
(Intercept)  0.5836454
Class2nd    -0.1860801
Class3rd    -0.3067344
ClassCrew   -0.1755538
SexFemale    0.4906798
AgeAdult    -0.1812957

This is a linear probability model, so the interpretation in terms of probability is direct - e.g. adults were 18% less likely to have survived, compared to children, independent of all other factors1.

There's nothing special happening here so far. The same thing could have been done much more simply just using lm.

> mod <- lm(y ~ Class + Sex + Age,
    data = df_wide,
    weights = Total)
> round(coef(mod), 4)
(Intercept)    Class2nd    Class3rd   ClassCrew   SexFemale    AgeAdult 
     0.5836     -0.1861     -0.3067     -0.1756      0.4907     -0.1813

Correcting the standard errors

If we run a regular OLS regression, the standard errors we get back will be wrong. The general solution provided by Schwiebert (2015) is to calculate a variance estimate to scale the estimated covariance matrix. $$ \hat{\sigma}^2= \frac{ w^T\bar{y^2}-2\hat{\beta}\bar{X}^TW\bar{y}+\hat{\beta}^T\bar{X}^TW\bar{X}\hat{\beta}} {N-K} $$ Where $K$ is the number of coefficients, and $N$ is the total sample size (i.e. $\sum{w}$). Notice that the denominator uses the total number of individual observations, not the number of groups.

There's also an average squared outcome term here $\bar{y^2}$ - more on this later (it's important for generalising to other types of grouped regression). For now just note that $y^2=y$ in the case of binary data, so $\bar{y^2}=\bar{y}$, meaning we don't actually need a separate average squared outcome estimate.

Then the estimated covariance matrix of $\hat{\beta}$ can then be found as

$$ \widehat{Cov[\hat{\beta}]}=(\bar{X}^TW\bar{X})^{-1}\hat{\sigma}^2 $$

This is implemented in R like so

# Number of parameters and set squared y outcome
k <- length(b)
df_wide$y2 <- df_wide$y

# An estimate of σ2 is computed as
wy2 <- t(w) %*% df_wide$y2
sigma2 <- (wy2 - t(2*b) %*% wx_y +
    t(b) %*% wx %*% X %*% b) / (sum(w) - k)

# Now calculate the adjusted SE
adj_vcov <- wxp_inv * as.vector(sigma2)

We can return the corrected standard errors like so

> sqrt(diag(adj_vcov))
(Intercept)    Class2nd    Class3rd   ClassCrew   SexFemale 
 0.04775514  0.03300931  0.02770768  0.02796772  0.02300518 
   AgeAdult 
 0.04096761

These results match a regression run on the individual observation datasets.

Heteroskedasticity adjusted standard errors

Actually an even better idea still is to get robust standard errors that correct for heteroskedasicity. It turns out to be pretty easy.

To get heteroskedasticity-consistent standard errors at the cell level, use the White (HC0) estimator adapted to the weighted design, where the robust covariance matrix estimator is

$$ \widehat{Cov[\hat{\beta}]}_{\text{White}}=(\bar{X}^TW\bar{X})^{-1} \space (\bar{X}^TW\text{diag}(\bar{e^2})\bar{X}) \space (\bar{X}^TW\bar{X})^{-1} $$

Where

$$ \bar{e^2}=\bar{y^2}-y_{\text{pred}}2\bar{y} + y_{\text{pred}}^2 $$

Again using R, I do it like this:

# Get the model predictions
y_pred <- fitted(mod)

# Calculate e^2
e2 <- df_wide$y2 - 2*df_wide$y * y_pred + y_pred^2

# Robust variance-covariance
rob_vcov <- wxp_inv %*% (wx %*% diag(e2) %*% X) %*% wxp_inv

Now we retrieve the corrected standard errors like so:

> sqrt(diag(rob_vcov))
(Intercept)    Class2nd    Class3rd   ClassCrew   SexFemale    AgeAdult 
 0.05488994  0.02953254  0.02817123  0.02909888  0.02389658  0.04788446

What if my data are not binary?

The same formulas and functions described above apply just as well to any kind of averaged data, but there's a slight complication. You would also need the average squared outcome term $\bar{y^2}$ to properly calculate the within-cell variance and adjust the standard errors accordingly.

In the case of a linear probability model, this doesn't matter - the count/proportion cells give you all the information you need. However if you were working with a continuous variable, you either need to calculate the average squared outcome somehow or hope the data are available.


1

Notice how the probability of surviving conditional on being a little girl in first class technically sums to 107%. This model is wrong.