Some notes on grouped or 'cell means' regression

Posted on October 14, 2025

Say you have some data reported as counts and, for whatever reason, want to fit a regression model on the data as-is in its grouped format. For example, you might want to fit a linear probability model to a summary table of binary yes/no count data. In econometrics this is sometimes called the 'cell means regression' or 'grouped data regression'.

When observations are collapsed into groups (cells with counts and defined by covariates), you can fit a weighted regression to retrieve the correct coefficients. However, the naive standard errors are wrong unless we adjust for the within-cell variance.

Schwiebert (2015) provides explicit formulas that show how regression statistics computed from individual data can be replicated from grouped (cell-means) data. This blog implements the formulas in R, focusing on the case of binary data.

Data

The Titanic dataset in the datasets package provides information in table format on the fate of passengers on the fatal maiden voyage of the ocean liner ‘Titanic’, summarised according to economic status (class), sex, age and survival. We'll use this to demonstrate.

First arrange, we the data into a data.frame object suitable for analysis. Some cells with zero total counts which contribute nothing to the model are dropped.

# Use Titanic dataset as example
df <- as.data.frame(datasets::Titanic)

# Pivot wider - two separate columns of survived / non survived counts
df_wide <- tidyr::pivot_wider(
	data = df,
	names_from = Survived,
	values_from = Freq
)

# Calculate totals and proportions survived
df_wide <- within(df_wide, {
	Total <- Yes + No
	y <- Yes / Total
})

# Drop NA obs
(df_wide <- na.omit(df_wide))

The data now look like this, where 'Yes' and 'No' represent counts of passengers who did and did not survive (respectively), along with the proportion of survivors.

Class	Sex	Age	No	Yes	y	Total
1st	Male	Child	0	5	1.00000000	5
2nd	Male	Child	0	11	1.00000000	11
3rd	Male	Child	35	13	0.27083333	48
1st	Female	Child	0	1	1.00000000	1
2nd	Female	Child	0	13	1.00000000	13
3rd	Female	Child	17	14	0.45161290	31
1st	Male	Adult	118	57	0.32571429	175
2nd	Male	Adult	154	14	0.08333333	168
3rd	Male	Adult	387	75	0.16233766	462
Crew	Male	Adult	670	192	0.22273782	862
1st	Female	Adult	4	140	0.97222222	144
2nd	Female	Adult	13	80	0.86021505	93
3rd	Female	Adult	89	76	0.46060606	165
Crew	Female	Adult	3	20	0.86956522	23

Hand-rolled weighted regression parameter estimates

Next, we set up the design matrix, $\bar{X}$ and start calculating the regression. The totals of each cell will form a weights vector, $w$, for the regression, and we calculate the weighted coefficients as $$\hat{\beta}=(\bar{X}^TW\bar{X})^{-1}\bar{X}^TW\bar{y}$$ where $W = \text{diag}(w)$ and $\bar{y}$ is the cell average. Here, the cell average represents the proportion of survivors.

# Construct the full-rank design matrix
X <- model.matrix(
	y ~ Class + Sex + Age,
	data = df_wide
)

# Get weights in vector and diag matrix formats
w <- df_wide$Total
W <- diag(w)

# Calculate weighted coefficients
wx <- t(X) %*% W
wxp_inv <- solve(wx %*% X)
wx_y <- wx %*% df_wide$y

The weighted regression coefficients are

> (b <- wxp_inv %*% wx_y)
                  [,1]
(Intercept)  0.5836454
Class2nd    -0.1860801
Class3rd    -0.3067344
ClassCrew   -0.1755538
SexFemale    0.4906798
AgeAdult    -0.1812957

This is a linear probability model, so the interpretation in terms of probability is direct - e.g. adults were 18% less likely to have survived, compared to children, independent of all other factors¹.

There's nothing special happening here so far. The same thing could have been done much more simply just using lm.

> mod <- lm(y ~ Class + Sex + Age,
    data = df_wide,
    weights = Total)
> round(coef(mod), 4)
(Intercept)    Class2nd    Class3rd   ClassCrew   SexFemale    AgeAdult 
     0.5836     -0.1861     -0.3067     -0.1756      0.4907     -0.1813

Correcting the standard errors

If we run a regular OLS regression, the standard errors we get back will be wrong. The general solution provided by Schwiebert (2015) is to calculate a variance estimate to scale the estimated covariance matrix. $$ \hat{\sigma}^2= \frac{ w^T\bar{y^2}-2\hat{\beta}\bar{X}^TW\bar{y}+\hat{\beta}^T\bar{X}^TW\bar{X}\hat{\beta}} {N-K} $$ Where $K$ is the number of coefficients, and $N$ is the total sample size (i.e. $\sum{w}$). Notice that the denominator uses the total number of individual observations, not the number of groups.

There's also an average squared outcome term here $\bar{y^2}$ - more on this later (it's important for generalising to other types of grouped regression). For now just note that $y^2=y$ in the case of binary data, so $\bar{y^2}=\bar{y}$, meaning we don't actually need a separate average squared outcome estimate.

Then the estimated covariance matrix of $\hat{\beta}$ can then be found as

$$ \widehat{Cov[\hat{\beta}]}=(\bar{X}^TW\bar{X})^{-1}\hat{\sigma}^2 $$

This is implemented in R like so

# Number of parameters and set squared y outcome
k <- length(b)
df_wide$y2 <- df_wide$y

# An estimate of σ2 is computed as
wy2 <- t(w) %*% df_wide$y2
sigma2 <- (wy2 - t(2*b) %*% wx_y +
    t(b) %*% wx %*% X %*% b) / (sum(w) - k)

# Now calculate the adjusted SE
adj_vcov <- wxp_inv * as.vector(sigma2)

We can return the corrected standard errors like so

> sqrt(diag(adj_vcov))
(Intercept)    Class2nd    Class3rd   ClassCrew   SexFemale 
 0.04775514  0.03300931  0.02770768  0.02796772  0.02300518 
   AgeAdult 
 0.04096761

These results match a regression run on the individual observation datasets.

Heteroskedasticity adjusted standard errors

Actually an even better idea still is to get robust standard errors that correct for heteroskedasicity. It turns out to be pretty easy.

To get heteroskedasticity-consistent standard errors at the cell level, use the White (HC0) estimator adapted to the weighted design, where the robust covariance matrix estimator is

$$ \widehat{Cov[\hat{\beta}]}_{\text{White}}=(\bar{X}^TW\bar{X})^{-1} \space (\bar{X}^TW\text{diag}(\bar{e^2})\bar{X}) \space (\bar{X}^TW\bar{X})^{-1} $$

Where

$$ \bar{e^2}=\bar{y^2}-y_{\text{pred}}2\bar{y} + y_{\text{pred}}^2 $$

Again using R, I do it like this:

# Get the model predictions
y_pred <- fitted(mod)

# Calculate e^2
e2 <- df_wide$y2 - 2*df_wide$y * y_pred + y_pred^2

# Robust variance-covariance
rob_vcov <- wxp_inv %*% (wx %*% diag(e2) %*% X) %*% wxp_inv

Now we retrieve the corrected standard errors like so:

> sqrt(diag(rob_vcov))
(Intercept)    Class2nd    Class3rd   ClassCrew   SexFemale    AgeAdult 
 0.05488994  0.02953254  0.02817123  0.02909888  0.02389658  0.04788446

What if my data are not binary?

The same formulas and functions described above apply just as well to any kind of averaged data, but there's a slight complication. You would also need the average squared outcome term $\bar{y^2}$ to properly calculate the within-cell variance and adjust the standard errors accordingly.

In the case of a linear probability model, this doesn't matter - the count/proportion cells give you all the information you need. However if you were working with a continuous variable, you either need to calculate the average squared outcome somehow or hope the data are available.

Notice how the probability of surviving conditional on being a little girl in first class technically sums to 107%. This model is wrong.