A Simple Implementation of Doubly Robust Estimation in Logistic Regression With Covariates Missing at Random
2009; Lippincott Williams & Wilkins; Volume: 20; Issue: 3 Linguagem: Inglês
10.1097/ede.0b013e3181a0acc7
ISSN1531-5487
Autores Tópico(s)Statistical Methods and Bayesian Inference
ResumoMissing covariate data is a feature common to most epidemiologic and clinical studies. It is now widely recognized that the routinely used complete-case analysis—which excludes subjects with any missing variable—can (i) yield highly inefficient estimators in regression analysis, and (ii) be severely biased when the data are not a completely random sample of the full data (thus not missing completely at random). In the simplest and most familiar setting where the observed data arise from simple random sampling (ie, independent and identically distributed data), weighted estimating equations provide a powerful framework to perform regression analysis, while appropriately accounting for covariate data missing at random but not necessarily missing completely at random. In this issue of Epidemiology, Moore et al1 use 2-stage weighted estimating equations or re-weighted estimating equations to perform logistic regression of Y, the indicator of “trying to loose weight in the last 12 months” on Z, the indicator of high cholesterol that is missing by happenstance, controlling for a subset of fully observed covariates X collected in the third National Health and Nutrition Examination Survey (NHANES III). NHANES III used stratified, multistage sampling and thus does not follow a simple random sampling design. The authors use all sampled participants with complete observations, but propose to jointly account for individuals' differential propensity both to be selected into the survey sample and to have an observed cholesterol indicator. They note that this can be achieved by multiplying participants' respective contribution to the complete-case logistic score equation by the product of the (known) inverse probability of being selected and the (unknown) inverse probability of having an observed cholesterol indicator (see equation (3) of their paper). The missingness mechanism, which stochastically determines which of the sampled participants have complete data, is unknown to the analyst and therefore must be estimated. Fortunately, missingness at random implies that an individual's probability of having an observed cholesterol indicator (R = 1) does not depend on whether or not she has high cholesterol, but may depend on other fully observed variables Y, X, and S. Here, S includes covariates that might not be of primary scientific interest, but are necessary with Y and X, to explain any association between R and Z. Under this assumption, one can therefore proceed by substituting the unknown weights with estimated weights,2 constructed with an estimate ˆπ = π (X, Y, S; ˆα) of Pr(R = 1|X, Y, S), where ˆα denotes the observed data maximum likelihood estimator of the coefficients of the logistic regression of R on (X, Y, S). Moore et al essentially adapt this approach to their two-stage procedure, but require the stronger assumption that S can be dropped from π. In this commentary, special consideration is given to issues of statistical efficiency and modeling robustness. First, even in the absence of model mis-specification, inverse-probability weighting can be highly inefficient. This is because, even when (as suggested by Moore et al), one fits a highly parameterized model for π (of course within limits of the data), inverse-probability weighting still fails to make optimal use of data (Y, X, S) observed among all individuals including those missing Z. Second, an incorrect working regression model for the missingness mechanism, π, will invariably result in a biased re-weighted estimator of the parameter β = (β0, βx′, βz) indexing the main logistic regression model of interest logit(Pr(Y = 1|X, Z; β)) = β0 + βx′X + βzZ. In this short commentary, we briefly address these 2 issues by describing a straightforward implementation of an estimator that (a) partially protects against model mis-specification of the weights by remaining consistent for β when the model for π is incorrect, provided that a working parametric model g (Z|Y, X, S; η), with unknown parameter η, for the conditional density of the covariate Z given (Y, X, S), is correct, (b) remains consistent for β when π is correctly modeled, irrespective of the correctness of the posited model for the conditional density of Z, and (c) is in large sample and in the absence of model mis-specification, guaranteed to be more efficient than inverse-probability weighting. By virtue of satisfying (a) and (b), this estimator is an example of a so-called doubly robust estimator. Because we rarely know which if any of (a) or (b) holds, double robustness is a desirable property as it effectively provides the analyst with 2 separate chances of obtaining a consistent estimator of the logistic regression parameter β, instead of the single chance conferred by inverse-probability weighting and conventional likelihood or Bayesian solutions to this problem.3–8 When compared with existing implementations of doubly robust estimators,8 the main appeal of the proposed approach is that it is easily performed using Proc NLMIXED in SAS and thus can be used in routine data analyses without much additional effort. The accompanying technical report9 provides an extension of the method to regression analysis with outcome missing at random, and to the estimation of a marginal structural mean model for point exposure under the assumption of no unmeasured confounding. A Simple Doubly Robust Estimator To simplify this presentation, I depart from the survey sampling framework of the paper; however, as I will later discuss, the described methodology equally applies to the survey setting. Thus, I assume that the observed data consists of n samples of independent and identically distributed data O = (R, ZR, X, S, Y) obtained from a common statistical distribution function. Then, under the missing at random assumption, and assuming that an individual's probability of having complete data is bounded away from 0, ˆβdr satisfies (a), (b), and (c), where ˆβdr is obtained as followed: First, obtain the estimator ˆη that maximizes the complete-case working log-likelihood function σni=1Ri × log (g (Zi|Yi, Xi, Si; η)). For Zi categorical, continuous or a count, conventional models from the exponential family as well as any other user-supplied parametric model can be implemented in Proc NLMIXED of SAS. If Zi is binary, go to the next step; otherwise, for observation i = 1, … n, and user-supplied integer K, generate pseudo data Zi,k*, k = 1,…, K, from ĝ = g(Zi,k*|Yi, Xi, Si; ˆη). Then go to the next step. For binary Zi, find ˆβdr that maximizes the weighted-pseudo-log-likelihood function ll (β*) = σi=1n[ˆωi log(f (Yi|Xi, Zi; β*)) + (1 − ˆωi) × (σz=01g(z|Yi, Xi, Si; ˆη) × log (f (Yi|Xi,z; β*)))] where log (f (Yi|Xi, z; β*)) = (Yi log (Pr (Y = 1|Xi, Zi; β*)) + (1 − Yi) log (Pr(Y = 0|Xi, Zi; β*)) is the log-likelihood for logistic regression, and Categorical Zi can generally be handled in a fashion similar to the binary case, provided the number of categories is not excessive. For other types of variable Zi, let ˆβdr = ˆβdr (K) be the maximizer of llK (β*) = σi=1n[ˆωif (Yi|Xi, Zi; β*) + (1 − ˆωi) × (σk=1Kf (Yi|Xi, Zi,k*; β*))], where Zi,k* k = 1, . . ., K are obtained in step 2. Maximizing either ll(β*) or llK(β*) can be done in SAS Proc NLMIXED with the “model y∼ general (·)” statement, where · stands for an individual's contribution to either ll (β*) or llK (β*) whose functional form can be specified by the user. See the example of programming code provided in the Appendix. An informal argument as to the correctness of the approach comes from noting that the first derivative of say ll (β*) gives the estimating function where u (Y, Z, X; β*) = [1, Xi′, Zi]′ (Yi − Pr(Y = 1|Xi, Zi; β*)) is the logistic regression score function evaluated at β*, and ˆE (·|Yi, Zi, Si) stands for the conditional expectation with respect to ĝ, the estimated density function of Z given Y, X, and S; this estimating function is exactly an empirical version of the estimating function previously derived and used for efficiency purposes2,10 and subsequently shown to be doubly robust.6,7,8 The first term of the equation corresponds to inverse-probability weighting for logistic regression, and only uses complete cases. The role of the second term is to make optimal use of all of the observed data. It serves to recover information by leveraging any existing correlation between Zi and (Yi, Xi, Si) to “impute ” expected contributions to the estimating function for both complete and incomplete observations. When both models for π and g are correct, one can show that ˆβdr is generally asymptotically more efficient than the inverse-probability weighting estimator2 used by Moore et al,1 which solves σi=1nˆωiu(Yi, Zi, Xi; β*) = 0. A similar result holds for llK (β*), although ˆE (·|Yi, Zi, Si) is then replaced by a summation over the pseudo-observations Zi,k*, k = 1, . . . K. When Z is continuous, this summation is in effect evaluating an integral by Monte Carlo simulation, thus avoiding the need for numerical integration. The choice of K can only affect the efficiency of the resulting estimator and not its consistency. When all models are correct, smaller values of K will invariably lead to less efficiency than large ones; however, values of K between 5 and 15 should probably suffice in most applications. The estimator ˆβdr is asymptotically normal, therefore formal inferences on β can be based on Wald-type confidence intervals constructed with the nonparametric bootstrap estimator of variance. It is important to note that ll (β*) (equivalently llK (β*)) cannot be viewed as a proper log- likelihood function, but rather serves as a device to obtain an estimator with properties (a), (b), and (c). Finally, the proposed method can be applied to vector-valued Z with components consisting of continuous, discrete or covariates of mixed type, and is thus very general. A Simulation Study In a small simulation study aimed at illustrating the finite sample behavior of this estimator compared with inverse-probability weighting, I simulated 1000 random samples of size n = 500, of independent and identically distributed data under a generating mechanism which does the following: samples S and X from independent standard normal densities; generates dichotomous Z and Y from the joint probability mass function where I impose and so that logit (f(1|Z, X; β)) = β0 + βzZ + βxzZX + βxX and logit (g(1|Y, X, S; η)) = η0+ ηyY + ηyxYX + ηxX + ηsS, where ηY = βz and ηyx = βxz; and finally obtains R from Bernoulli sampling with success probability given by logit (π(X, Y, S; α)) = α0 + αyY + αsS + αxX + αyx2YX2 + αsySY + αs2xS2X, thus imposing MAR. The Table 1 reports results from fitting ˆβdr and ˆβipw (the inverse-probability–weighted estimator) to the observed data (RZ, R, X, Y, S) under 4 different model specifications; first, in estimating ˆβdr and ˆβipw I consistently estimate both π and g in the scenario corresponding to the “all true” row of results in the Table, next I mis-specify π by setting αyx2=αsy=αs2x = 0 under the setting which is reported in the “wrong π” row of results of the Table, then I mis-specify g by setting ηyz = 0 under the setting which is reported in the “wrong g” row of results of the Table, and finally both g and π are incorrectly specified as above, in the final row labeled “all wrong” results.TABLE 1: Simulation ResultsUnder the correct model specification of both π and g (corresponding to the first 2 rows of results in the Table), both estimators have finite sample bias of comparable size across parameter estimates. As predicted by theory, ˆβdr is more efficient than ˆβipw, with ranging between 87% and 92% for different parameters. Under mis-specification of g, both estimators maintain small finite sample bias, while neither appears to clearly dominate with respect to efficiency across all parameters. Next, I find that in agreement with theory, ˆβipw is severely biased under mis-specification of π relative to ˆβdr which again maintains negligible bias. Under mis-specification of both π and g, the bias of all estimators is also larger as expected. Survey Sampling I have briefly demonstrated a simple implementation of a doubly robust estimator of the parameters of a logistic regression model with covariates missing at random. Although this method is described under random sampling, it is easily adapted to the survey sampling setting. In fact, this is achieved by simply weighting each person's contribution to the pseudo-log-likelihood by the inverse of their probability of being selected into the sample. I revisited the NHANES III data used by Moore et al1 and obtained a doubly robust estimate of β using my approach. Remarkably, out of 14 logistic regression coefficients, my estimated effects agreed for all but 3 parameters; education: 12 = 0.282 (instead of 0.134 reported by Moore et al), >12 = 0.572 instead of (0.356); routine health care visit <2 years = 0.493 (instead of 0.703 reported by Moore et al); and history of premature heart disease = 0.202 (instead of 0.002). However, I did not obtain confidence intervals for these estimates, as the nonparametric bootstrap in this setting requires taking into account the design of the study, a task beyond the scope of this commentary. With very few changes, the approach proposed herein equally applies to linear and log-linear regression with missing covariates, most commonly used for continuous and count data respectively; details are omitted but are easily deduced from the accompanying technical report.9
Referência(s)