Self-consistent estimation of mean response functions and their derivatives
2011; Wiley; Volume: 39; Issue: 2 Linguagem: Inglês
10.1002/cjs.10104
ISSN1708-945X
AutoresRichard Charnigo, Cidambi Srinivasan,
Tópico(s)Advanced Statistical Methods and Models
ResumoMany methods have been developed for the nonparametric estimation of a mean response function, but most of these methods do not lend themselves to simultaneous estimation of the mean response function and its derivatives. Recovering derivatives is important for analyzing human growth data, studying physical systems described by differential equations, and characterizing nanoparticles from scattering data. In this article the authors propose a new compound estimator that synthesizes information from numerous pointwise estimators indexed by a discrete set. Unlike spline and kernel smooths, the compound estimator is infinitely differentiable; unlike local regression smooths, the compound estimator is self-consistent in that its derivatives estimate the derivatives of the mean response function. The authors show that the compound estimator and its derivatives can attain essentially optimal convergence rates in consistency. The authors also provide a filtration and extrapolation enhancement for finite samples, and the authors assess the empirical performance of the compound estimator and its derivatives via a simulation study and an application to real data. The Canadian Journal of Statistics 39: 280–299; 2011 © 2011 Statistical Society of Canada Plusieurs méthodes ont été développées pour l'estimation non paramétrique d'une fonction de réponse moyenne. Cependant, la majorité de ces méthodes ne permettent pas l'estimation simultanée de la fonction de réponse moyenne et de ses dérivées. L'obtention des dérivées est importante lorsque nous analysons les données sur la croissance humaine, nous étudions des systémes physiques décrits par des équations différentielles ou encore lorsque nous caractérisons les nanoparticules á partir de données de diffusion. Dans cet article, les auteurs proposent une nouvel estimateur composé qui combine l'information provenant de plusieurs estimateurs ponctuels indicés de faon discréte. Contrairement aux lisseurs par splines ou par noyau, l'estimateur composé est infiniment différentiable. Contrairement aussi aux lisseurs par régression locale, l'estimateur composé est autocohérent dans le sens que ses dérivées estiment les dérivées de la fonction de réponse moyenne. Les auteurs montrent que l'estimateur composé, ainsi que ses dérivées, atteignent essentiellement le taux de convergence optimale. Les auteurs fournissent aussi des ajustements par filtration ou extrapolation pour les petits échantillons. á l'aide d'une étude de simulation, ils mesurent aussi la performance empirique de l'estimateur composé et de ses dérivées. Ils l'appliquent aussi á un vrai jeu de données. La revue canadienne de statistique 39: 280–299; 2011 © 2011 Société statistique du Canada Our goal is to simultaneously estimate µ(x) and up to J of its derivatives. To this end, we propose a new estimator , infinitely differentiable in x and not requiring the specification of orthogonal basis functions, that enjoys the following self-consistency property:Property 1. An estimator and companion estimators , , are self-consistent if exists and equals for every , where . For brevity, when the companion estimators are clear from the context, we say simply that is self-consistent. Panel (a) shows the local regression estimate of for the first girl in the Berkeley Growth Study. We used the locfit.raw function from the locfit package for R. The nearest neighbors smoothing parameter was set to 0.4, but otherwise the default settings were used. Panel (b) shows the local regression estimate of for the same girl. The first derivative estimate suggests that the girl's growth spurt peaked at 10.5 years, while the second derivative estimate suggests 10.1 years. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.] The rest of this article is organized as follows. Section 2 formally defines the compound estimator. Section 3 establishes convergence rates for the compound estimator and its derivatives. If the pointwise estimators possess asymptotically optimal convergence rates, then the compound estimator and its derivatives achieve essentially those same rates. A finite-sample enhancement is proposed in Section 4 to improve estimation of µ(x) and its derivatives for x close to the boundaries ±1. Section 5 provides guidance on tuning parameter selection. Section 6 includes results from a simulation study and an application to real data. Section 7 contains a discussion that emphasizes the advantages of the compound estimator over existing nonparametric regression methods. Let be a compact interval, and let Mn be a nondecreasing sequence of nonnegative integers. Partition [−1, 1] into intervals of equal width, and let In be the set of all interval midpoints that fall inside I. We refer to the elements of In as centering points. Let be defined as in (2) for each . Relation (3) motivates us to define as a weighted average of the , the weights themselves varying in x with larger weights assigned to the for which |x − a| is small. For this strategy to succeed, the weights must vary in x smoothly. Also, since the conclusion of (3) does not hold for a fixed neighborhood of a, the number of centering points must increase with the sample size. The proof of Theorem 1 is provided in the Appendix. The proof is constructive in that it explicitly prescribes βn and Mn. We emphasize that a single choice for βn and Mn yields the conclusion of Theorem 1 for all j up to J. Theorem 1 does not explicitly impose assumptions on the xi and in model (1). The assumptions implicitly imposed are those required for (7) or (8). We describe a finite-sample enhancement of the compound estimator that entails filtering out some of the noise in model (1) and extrapolating to overcome boundary issues. Imagine having ideal data instead of the actual noisy data . There would still be some work to do in estimating µ(x) and its derivatives, since a mean response function is not uniquely determined by n data points. However, we would anticipate more success in estimating µ(x) and its derivatives with ideal data than with noisy data. Of course, ideal data are unavailable because the are unknown. Even so, better approximations to the are available than the Yi. A simple option is to use compound estimation itself, with J temporarily set to 0 in the polynomials (2), to provide approximations to the . The actual noisy data can be replaced with synthetic data , where the 0 subscript in signifies the temporary choice of J. Compound estimation, with J restored to an appropriately large value, can then be applied to the synthetic data instead of to the actual noisy data . Boundary effects, however, are an issue for finite samples: the pointwise estimators may not perform well at centering points near ±1. This issue might be overcome if we could somehow collect data with . Yet, we assumed in model (1) that . A way out of this dilemma invokes our assumption that µ(x) may be smoothly extended from [−1, 1] to an open set containing [−1, 1], along with the fact that in (12) obviously has a smooth extension outside [−1, 1]. Compound estimation requires the selection not only of tuning parameters for the pointwise estimators but also of βn and Mn. We now present a strategy by which data analysts may select these tuning parameters. Our strategy, which we employ in the simulation study of Section 6.1, entails using the GCp criterion developed by Charnigo, Hall & Srinivasan (2011). The following description is only a brief sketch, intended to make the present article self-contained. Interested readers may consult Charnigo, Hall & Srinivasan (2011) for details, including theoretical justifications and empirical demonstrations that the GCp criterion outperforms competing approaches to tuning parameter selection across a variety of nonparametric regression methods for derivative estimation (including kernel smoothing, local regression, and compound estimation). Remark 12. One may perceive intuitively that a large degree of smoothing during pointwise estimation should be accompanied by a modest degree of smoothing during the combination of polynomials (2), or vice versa. Noting that a large βn corresponds to a modest degree of smoothing during the combination of polynomials (2), one may then suspect that a large bandwidth for pointwise estimation ought to be combined with a large βn, or else a small bandwidth ought to be combined with a small βn. Yet, our experience has been that the results of compound estimation are much more sensitive to the bandwidth for pointwise estimation than to βn. In particular, selection of βn does not seem tethered to specification of the bandwidth for pointwise estimation.Remark 13. If one performs filtration and extrapolation preprocessing, then one must also set tuning parameters during that preprocessing as well as specify κ. In principle all of this can be accomplished using the GCp criterion, but in practice doing so makes the minimization of the GCp criterion unwieldy since the dimension of Λn is then quite large. Our experience is that the results of compound estimation are not particularly sensitive to the choice of κ and that κ = 0.3 works well. Hence, we suggest deviating from κ = 0.3 only if there is a subject matter consideration warranting a different choice. Moreover, we suggest assuming that the Mn chosen for the main phase of compound estimation will work well during the filtration and extrapolation preprocessing, that some fixed multiple of the βn chosen for the main phase (say, 10βn) will suffice for the preprocessing, and that some fixed multiple of the pointwise estimator bandwidth chosen for the main phase (say, 1/2 this bandwidth) will suffice for the preprocessing. The reason for using different tuning parameters during the preprocessing than during the main phase is to ensure that the preprocessing does not smooth away valuable information contained in the data about the derivatives of the mean response function. We used the oscillatory test function to generate 100 data sets of size n = 125 from model (1) with equispaced xi and . We then applied three nonparametric regression methods to each data set. Perform filtration and extrapolation preprocessing as described in Section 4 with κ = 0.3. Use bandwidth h/2 for the local regression pointwise estimators based on local poynomials of degree 2 and rectangular weights, use centering points, and use 10βn instead of βn in (4) during this preprocessing. The result of this step is synthetic data as indicated in the last paragraph of Section 4. Apply compound estimation to the synthetic data with J = 7. Use bandwidth h for the local regression pointwise estimators based on local polynomials of degree 7 and rectangular weights, use centering points, and use βn in (4). The result of this step is an estimator of having the form for 0 ≤ j ≤ J and . Use to calculate as defined in (14) with k = 5 and in accordance with as defined in (13). The result of this step is a proxy for . Let denote the minimizer of over . The GCp-selected estimator of is . Method 2: Local regression. Consider . Use bandwidth h for estimation by local regression based on local polynomials of degree 7 and rectangular weights. This method is not self-consistent but, in light of Stone's (1980) optimality results, may be regarded as a benchmark. This method is also amenable to tuning parameter selection by GCp. The locfit.raw function from the locfit package for R was employed to perform local regression. Method 3: Spline smoothing. Consider . Use degrees of freedom (df) for estimation by smoothing splines based on piecewise polynomials of degree 7. These piecewise polynomials were then differentiated to estimate the derivatives of µ(x). This method too is amenable to tuning parameter selection by GCp. The degrees of freedom for spline smoothing were chosen to match the degrees of freedom for local regression, defined as the trace of the hat matrix for local regression. The smooth.Pspline function from the pspline package for R was employed to perform spline smoothing. We repeated the above with 100 data sets of size n = 500, except that k was then set to 20 in the calculation of (14). Results appear in Tables 1 and 2 for compound estimation (), local regression (), and spline smoothing () with tuning parameters selected by the GCp criterion. Expanded versions of Tables 1 and 2 providing results for all tuning parameters considered (whether or not selected by the GCp criterion) are available in the Online Supplement at www.richardcharnigo.net/SelfConsOnline.pdf. The columns 0 through 6 present the average value over the 100 data sets of (17) for 0 ≤ j ≤ 6, while the column RSD presents the average value of (18). Taking an 80% improvement as a cutoff for successful estimation, we find that compound estimation with GCp-selected tuning parameters successfully recovers six derivatives when n = 125 or n = 500. Local regression only recovers two derivatives when n = 125 and three derivatives when n = 500, while spline smoothing recovers four derivatives when n = 125 and five derivatives when n = 500. Moreover, compound estimation fares better with respect to the RSD (18) than either local regression or spline smoothing. To explore the performance of compound estimation for a mean response function (and derivatives) without oscillations, we repeated all of the above with the monotone test function and . Results appear in Tables 3 and 4 for compound estimation, local regression, and spline smoothing with GCp-selected tuning parameters. Expanded versions of Tables 3 and 4 providing results for all tuning parameters considered are available in the Online Supplement at www.richardcharnigo.net/SelfConsOnline.pdf. Compound estimation with GCp-selected tuning parameters successfully recovers one derivative when n = 125, compared to none for local regression and spline smoothing! However, local regression and spline smoothing do recover one derivative when n = 500. Juxtaposing the results for in Tables 1 and 2 against those for in Tables 3 and 4 reveals that the number of derivatives that can be successfully estimated—whether by compound estimation or another nonparametric regression method—may depend greatly on the mean response function itself. Our perceptions, based on the aforementioned results as well as on the results of other simulation studies not discussed herein, are that derivative estimation will be more successful when: (i) the derivatives are large in magnitude relative to the noise; and, (ii) the derivatives do not have their extrema at the boundaries of . Perception (i) relates to the effective amplification of the error terms by the process of differentiation. Consider, for example, that the difference quotient approximation has variance on the order of n2 if the xi are equispaced. Thus, unless the signal to noise ratio is large, the derivatives will be drowned out by the amplification of the error terms. Perception (ii) relates to the well-known difficulties of estimating mean response functions and their derivatives near the boundaries of . We consider the data from the first 10 girls in the Berkeley Growth Study (Ramsay & Silverman 2002). Model (1) with relates the measured height of a girl Yi to her age xi. We employed compound estimation to recover a height function, a velocity function (first derivative), and an acceleration function (second derivative) for each of the 10 girls. Since our code was written for an explanatory variable taking values in [−1, 1], we estimated for and then set for . Compound estimation was carried out for each of the 10 girls following the implementation steps in Section 6.1, subject to the following modifications. First, we set k = 1 for calculation of the GCp criterion (14). Second, we fixed Mn = 3. (The inclusion of Mn = 2 in the simulation study of Section 6.1 was for illustrative purposes, not because the choice Mn = 2 was perceived to have merit.) Third, instead of generating synthetic data on [−1.3, 1.3] during filtration and extrapolation preprocessing, we generated synthetic data on [−1, 1.3]. This third modification was made because z = −1.3 corresponds to x = −1.55, which is not biologically meaningful. Panel (a) of Figure 2 shows estimated height curves and may be compared to Figure 6.1 of Ramsay & Silverman (2002), who modeled with smoothing splines and then solved a differential equation to recover µ(x). Panel (b) shows the estimated velocity for the first girl and is analogous to Figure 6.4 of Ramsay & Silverman (2002). Our velocity estimate is generally similar to that of Ramsay & Silverman (2002), although ours is not as large at 1 year of age. Panel (a) depicts estimated height functions for the first 10 girls in the Berkeley Growth Study. Panel (b) shows the estimated velocity for the first girl, while panel (c) shows the estimated velocities for all 10 girls. Panel (d) depicts estimated accelerations for all 10 girls along with the average of the estimated accelerations (solid heavy line). The estimates in panels (a) through (d) were obtained from compound estimation; see Section 6.2 of the main text for implementation details. Panels (a), (b), and (d) may be compared to Figures 6.1, 6.4, and 6.7 in Ramsay & Silverman (2002). Panel (c) of Figure 2 presents the estimated velocities for all ten girls. One of the estimated velocities clearly dips below zero at 17 years of age. This phenomenon is real, at least to the extent that we believe the raw data. The girl's height measurements exhibited a decreasing pattern over 2 years, falling from 164.4 cm at age 16 to 164.1 cm at age 17 and 163.8 cm at age 18. Panel (d) portrays the estimated accelerations and is analogous to Figure 6.7 of Ramsay & Silverman (2002). Because there is considerable variation among the 10 girls, we included a solid heavy line representing the average of the acceleration estimates. The average of our acceleration estimates is fairly similar to that of Ramsay & Silverman (2002). The convergence rate of from Corollary 1 suggests that almost nothing is lost asymptotically when one switches from a local regression smooth to the compound estimator. However, we agree with Loader (1999) that asymptotic performance should not be the sole criterion for judging a nonparametric regression method. Indeed, two salient points for judging the compound estimator in relation to local regression are that: (i) when using the compound estimator, a data analyst will never be confronted by contradictory results like those in panels (a) and (b) of Figure 1 or by the other problems noted in the examples of Section 1; and, (ii) as shown by our simulation study, the compound estimator can substantially outperform local regression in finite samples. Unlike local regression, a spline smooth in the form of a degree J piecewise polynomial is self-consistent with . However, spline smooths may become very unstable as one approaches the maximum number of differentiations that can be performed on . Panel (a) of Figure 3 shows an order seven spline estimate for the fifth derivative of the oscillatory test function from Section 6.1. Apart from some boundary issues, the spline estimate for the fifth derivative is reasonable. Panel (b) shows the corresponding estimate for the sixth derivative, which is not reasonable. Self-consistency alone does not ensure that successful estimation of is accompanied by a good recovery of . On the other hand, the combination of self-consistency with infinite differentiability of the estimator —not just infinite differentiability of the target µ(x)—tends to protect against anomalies like that in Figure 3. As such, the self-consistent and infinitely differentiable compound estimator often fares well in recovering when has been estimated well. Thus, even though a spline smooth is also self-consistent, the compound estimator can be a more viable approach to simultaneous estimation of µ(x) and its derivatives. Panel (a) depicts the spline smoothing estimate (shown in red) of (shown in black) for a simulated data set of size n = 500 with equispaced xi and independent normally distributed having common variance σ2 = 0.04. We used the smooth. Pspline function from the pspline package for R. The norder parameter was set to 4 (corresponding to J = 7 in our notation), and the equivalent degrees of freedom parameter was set to 12. Panel (b) depicts the corresponding estimate of . [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.] This places more constraints on . So, the simplest possible kernel function (20) is a polynomial of degree or on [−1, 1], according to whether J is odd or even. For example, if J = 7, then the simplest possible kernel function (20) is a polynomial of degree 22 on [−1, 1]. This seems unnecessarily cumbersome, given that the compound estimator constructed from polynomials (2) of degree 7 can achieve virtually the same asymptotic performance. Moreover, kernel smooths as constructed above are only finitely differentiable and may therefore become unstable as the maximum number of differentiations is approached, an issue already illustrated for spline smooths in Figure 3. We conclude by explaining why we did not pursue an orthogonal basis functions approach. If for some orthogonal basis functions such that is a good approximation to µ(x) for a modest J (Donoho & Johnstone 1994; Gyorfi et al. 2002 and references therein; Charnigo, Sun & Muzic 2006), then a strategy for estimating µ(x) is to obtain coefficient estimates and put . One may also want to use to estimate for a positive integer j. However, just because is a good approximation to µ(x) does not mean that is a good approximation to . As noted by Lahiri (2001), even a small misspecification of the basis functions may greatly inflate the J required for to approximate µ(x). The problem becomes more delicate when the derivatives of µ(x) are to be approximated by the derivatives of . As a simple example, suppose that , for odd k > 0, and for even k > 0. If , then is a good approximation to µ(x) with J = 2, β0 = 0, and . Yet, is a poor approximation to , and does not even resemble . This example makes clear the difficulty in specifying basis functions suitable for the simultaneous estimation of µ(x) and its derivatives, which is why we pursued an entirely different approach in this article. This material is based upon work supported by the National Science Foundation under grant no. DMS-0706857. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. We thank Professors Grace Wahba and Peter Hall for helpful comments. We also thank the Editor, Associate Editor, and two referees. Since finitely many terms of the form constitute result (32) completes the proof in the event that (7) holds.
Referência(s)