Artigo Revisado por pares

Estimating spatial models with endogenous variables, a spatial lag and spatially dependent disturbances: Finite sample properties

2008; Elsevier BV; Volume: 87; Issue: 3 Linguagem: Inglês

10.1111/j.1435-5957.2008.00187.x

ISSN

1435-5957

Autores

B. Fingleton, Julie Le Gallo,

Tópico(s)

Economic and Environmental Valuation

Resumo

Papers in Regional ScienceVolume 87, Issue 3 p. 319-339 Free Access Estimating spatial models with endogenous variables, a spatial lag and spatially dependent disturbances: Finite sample properties* Bernard Fingleton, Bernard Fingleton Department of Economics, Strathclyde University, United Kingdom (e-mail: bf100@cam.ac.uk)Search for more papers by this authorJulie Le Gallo, Julie Le Gallo CRESE, Université de Franche-Comté, 45D, Avenue de l’Observatoire, 25030 Besançon Cedex, France (e-mail: jlegallo@univ-fcomte.fr)Search for more papers by this author Bernard Fingleton, Bernard Fingleton Department of Economics, Strathclyde University, United Kingdom (e-mail: bf100@cam.ac.uk)Search for more papers by this authorJulie Le Gallo, Julie Le Gallo CRESE, Université de Franche-Comté, 45D, Avenue de l’Observatoire, 25030 Besançon Cedex, France (e-mail: jlegallo@univ-fcomte.fr)Search for more papers by this author First published: 11 November 2008 https://doi.org/10.1111/j.1435-5957.2008.00187.xCitations: 142 * Previous versions of this paper have been presented at the 1st World Conference of the Spatial Econometrics Association, Fitzwilliam College, Cambridge University, Cambridge, 11–14 July 11, 2007, at the 47th Congress of the European Regional Science Association, Paris, 29 August, 2 September, 2007 and at the 54th North American Meetings of the Regional Science Association International, Savannah, GA, 7–11 November, 2007. We would like to thank two anonymous referees, B.H. Baltagi, M. Bosker, H. Kelejian, I. Prucha and the other participants of these meetings for useful comments. The usual disclaimers apply. AboutSectionsPDF ToolsRequest permissionExport citationAdd to favoritesTrack citation ShareShare Give accessShare full text accessShare full-text accessPlease review our Terms and Conditions of Use and check box below to share full-text version of article.I have read and accept the Wiley Online Library Terms and Conditions of UseShareable LinkUse the link below to share a full-text version of this article with your friends and colleagues. Learn more.Copy URL Share a linkShare onFacebookTwitterLinked InRedditWechat Abstract Abstract. This paper discusses estimation methods for models including an endogenous spatial lag, additional endogenous variables due to system feedback and an autoregressive or a moving average error process. It extends Kelejian and Prucha's, and Fingleton and Le Gallo's feasible generalized spatial two-stage least squares estimators and also considers HAC estimation in a spatial framework as suggested by Kelejian and Prucha. An empirical example using real estate data illustrating the different estimators is proposed. The finite sample properties of the estimators are finally investigated by means of Monte Carlo simulation. Resumen. Este artículo discute métodos de estimación para modelos incluyendo un intervalo espacial endógeno, variables endógenas adicionales debido a retroalimentación del sistema y un proceso autorregresivo o uno de error de media móvil. Amplia Kelejian y Prucha's, y los estimadores de mínimos cuadrados bietápicos espaciales generalizados factibles de Fingleton y Le Gallo y considera también la estimación HAC en un marco espacial tal y como sugieren Kelejian y Prucha. Proponemos un ejemplo empírico utilizando datos de bienes inmuebles ilustrando los diferentes estimadores. Las propiedades en muestras finitas de los estimadores se estudian finalmente mediante simulación de Monte Carlo. 1 Introduction Recent years have witnessed an explosion in the application of spatial models in social sciences in general, and in economics in particular. Spatial econometric models have been used to analyse different topics such as regional growth, hedonic housing models, county police expenditures, technology adoption, etc. From a methodological point of view, spatial regression techniques are now becoming part of the toolbox of applied econometrics and the interest is increasingly shifting away from the single-equation cross-sectional setting to more complex frameworks such as panel data models, simultaneous models or qualitative variables models in a spatial context (see Anselin 2006 for a recent literature review). The most widely used models are variants of the one suggested in Cliff and Ord (1981), where spatial autocorrelation is included in the regression model as an endogenous spatial lag variable and/or a spatial error process, either in the form of an autoregressive process or a moving average process. As is well-known, the spatial lag variable is endogenous to the model since it implies simultaneous spatial interaction. However, it is surprising to note that the analysis of the effects of other endogenous variables has been rather neglected.11 Some exceptions include Kelejian and Prucha (2004, 2007), Fingleton and Le Gallo (2008), Anselin and Lozano-Gracia (2008). However, in applied econometrics work, the presence of endogenous variables on the right hand side of regression models is a common occurrence, as endogeneity may be the result of measurement errors on explanatory variables, of omitted variables correlated with included explanatory variables or of the existence of an unknown set of simultaneous structural equations. Focusing on the latter aspect, we note that there are numerous examples where spatial autocorrelation and endogeneity of some explanatory variables are potentially both present. This is the case in hedonic housing prices models, in which house values are regressed against structural, neighbourhood and locational characteristics. Numerous reasons justify the introduction of spatial autocorrelation in these models: omission of spatially autocorrelated regressors, displaced demand and supply effects, etc. (Anselin 2008). Moreover, some explanatory variables may be endogenous following the simultaneous choice of the house price and of the quantities of attributes. This is notably the case for floor space. Similarly, in the traditional neoclassical conditional convergence model (Barro and Sala-I-Martin 1995) relating output per worker growth to the initial level of output per worker and several conditioning variables, the latter are usually not exogenous but simultaneously determined with growth rate of output per worker (Temple 1999). Moreover, as an extensive literature shows, spatial autocorrelation is also an almost unavoidable characteristic of these models (Abreu et al. 2005; Fingleton and López-Bazo 2006). In the absence of spatial autocorrelation, different estimation methods for models with endogenous regressors are available. The entire set of structural equations can be estimated using systems methods such as full information maximum likelihood, three stage least squares or limited information instrumental variables. However, there is much simplicity introduced if the system can be reduced to a single equation. In a spatial setting, two cases must be considered. First, the case of a spatial lag model with additional endogenous variables is straightforward since it can be estimated by two-stage least squares, including the lower orders of the spatial lags of the exogenous variables used to instrument the spatial lag variable (see Anselin and Lozano-Gracia 2008; Dall'erba and Le Gallo 2008 for applications of this procedure). Second, maximum likelihood of a model with a spatial error process and endogenous variables to our knowledge does not feature in the spatial econometrics literature and would be difficult, if not impossible, to implement. In this case, Fingleton and Le Gallo (2008) have extended Kelejian and Prucha's (1998) feasible generalized spatial two-stage least squares (FGS2SLS) estimator to account for endogenous variables due to system feedback, given an autoregressive or a moving average error process. In this context, the aim of this paper is three-fold. First, it presents an estimation method of a model which includes an endogenous spatial lag, other endogenous variables and a spatial error process that can be autoregressive or a moving average. This method, based on instrumental variables and generalized method of moments, extends Kelejian and Prucha's (1998) and Fingleton and Le Gallo's (2008) feasible generalized spatial two-stage least squares estimators. It goes beyond Kelejian and Prucha (1998) by allowing additional endogenous variables and a moving average error process. It goes beyond Fingleton and Le Gallo (2008) because of the presence of the spatial lag. Second, it also analyses the properties of the non-parametric heteroscedasticity and autocorrelation consistent estimator of the variance-covariance matrix in a spatial context (SHAC), as suggested by Kelejian and Prucha (2007), in the context of endogeneity due to system feedback. Third, the choice of appropriate instruments, that are independent of the errors and yet correlate sufficiently highly with the endogenous variable, is a difficult practical problem and is given particular attention. Indeed, one problem which is frequently encountered is a paucity of instruments, typically leading analysts and applied econometricians to use various ad hoc and somewhat suboptimal instruments. We therefore consider in this paper the properties of such instruments, notably the quasi-instrument already investigated in Fingleton and Le Gallo (2008): it is defined by analogy with the 3-group method for measurement errors (Kennedy 2003) and has been used in a spatial framework by Fingleton (2003). The outline of the paper is as follows. Section 2 describes the model with an endogenous spatial lag, additional endogenous variables and spatially dependent (autoregressive or moving average) errors, together with the estimation methods that are used. Section 3 gives an example, using real estate data, of the differing outcomes that are produced using different instrumental variables. Section 4 outlines the Monte Carlo simulation methodology, describing how the system feedback leads to our data generating mechanism, and gives the simulation results based on these data. Section 5 concludes, highlighting some of the factors affecting the RMSE of the distribution of some estimated parameters. 2 Estimation methods of a spatial lag model with endogenous regressors and spatially dependent disturbances Consider a regression model that contains exogenous explanatory variables X, as well as endogenous and spatially lagged variables: (1) in which Y is the (n × 1) vector of observations on the dependent variable; X is an (n × k) matrix of observations on k exogenous variables with b as the corresponding (k × 1) vector of parameters; H is an (n × c) matrix of observations on c endogenous regressors with γ as the corresponding (c × 1) vector of parameters; W is an (n × n) spatial weights matrix of known constants with zero diagonal elements; ρ is a scalar spatial autoregressive parameter and u is the (n × 1) vector of error terms, which can be generated in two ways, by a specific spatial process or non-parametrically. We first consider the parametric case. Two main spatial processes have been investigated: the autoregressive (AR) process and the moving average (MA) process. In the first case, u takes the following form: (2) where λ is the scalar spatial error autoregressive parameter; W* is an (n × n) spatial weights matrix with zero diagonal elements, which can be similar or different from W and ξ is an (n × 1) vector of innovations, with ξ ∼ iid(0,σ2In). In the second case: (3) where is the scalar spatial error moving average parameter and ξ ∼ iid(0,σ2In). As pointed out by Anselin (2003), both processes differ in terms of the diffusion process they imply. The AR process implies that a shock at one location j is transmitted to all other locations of the sample. In contrast, in the case of the MA process, a shock at location j will only affect the directly interacting locations as given by the non-zero elements in W. Shock effects in this case are local rather than global. This specification extends the SARAR model of Kelejian and Prucha (1998), where only exogenous variables and a spatial autoregressive error process are considered, to allow for endogenous right hand side variables and a moving average process for the error term. Our specification also generalizes the case investigated by Fingleton and Le Gallo (2008) who omitted the endogenous spatial lag. Therefore, we focus here both on endogeneity and simultaneous spatial interaction. However, contrary to Kelejian and Prucha (2004), we do not propose that we should have knowledge of the entire system leading to simultaneity and endogenous regressors, but focus instead on the estimation method for a single structural equation. We suggest an estimation procedure similar to the procedure in Kelejian and Prucha (1998), comprising three stages. In the first stage, the model is estimated by 2SLS that takes care of both the endogenous spatial lag and of the other endogenous variables. The second stage uses the resulting 2SLS residuals to estimate λ (or ) and σ2 using a GM procedure, with the moments derived by Kelejian and Prucha for the AR case and the moments derived by Fingleton (2008) for the MA case. In the third stage, the estimated λ (or ) is used to perform a Cochrane-Orcutt transformation (or its equivalent for ) to account for the spatial dependence in the residuals. Rather than modelling the error process, Kelejian and Prucha (2007) have suggested a non-parametric heteroscedasticity and autocorrelation consistent (HAC) estimator of the variance-covariance matrix, namely SHAC, under a set of relatively simple assumptions. Taking Equation (1) as a point of departure, they assume that the (n × 1) disturbance vector u is generated as follows: (4) where R is an (n × n) non-stochastic matrix whose elements are not known. Note that this disturbance process allows for general patterns of correlation and heteroscedasticity. The asymptotic distribution of the corresponding IV estimators imply the following variance-covariance matrix: Ψ = n−1Z′ΣZ with Σ = (σij) denotes the variance-covariance matrix of u and Z denotes a (n × f) full column rank matrix of instruments with f ≥ (c + k + 1). Kelejian and Prucha (2007) show that the SHAC estimator for its (r, s)th element is: (5) where dij is the distance between unit i and unit j; dn is the bandwith and K(.) is the Kernel function with the usual properties. They finally show that small sample inferences concerning κ = [ργ b]′ can be based on the following approximation: (6) where , Yp = Z(Z′Z)−1Z′R and R = [Wy H X]. 3 Application to real estate data We motivate our Monte Carlo analysis of estimator properties via a model of average house prices22 In units of £1,000. in n = 353 small areas33 Unitary authority and local authority districts (UALADs). of England and Wales in the year 2001. Rather than simply assume on empirical grounds that a spatial lag is necessary, we argue a priori that its presence in the reduced form is the outcome of spatial economic interactions in the supply and demand functions. A spatial error process is also present to account for unmodelled spatial effects due to omitted regressors. Initially, we develop the model with exogenous regressors, but later, in order to correspond more closely to the Monte Carlo set-up, we also assume that one of the explanatory variables is endogenous (c.f. Fingleton 2008). The reduced form is the outcome of demand and supply functions for housing, the demand function being: (7) in which the quantity of housing demanded (qi) in location i is a function of income either from local jobs (wlEl) or from jobs within commuting distance (wcEc). In both of these we multiply wage rates by levels of employment, with the wage rates denoted by wl (local) and wc (within commuting distance) and employment levels denoted by El and Ec. The quality of local schooling is given by A, the variable we initially assume is exogenous. Appendix 1 describes the sources of these data. We assume that qi and pi are negatively related, and given that high prices reduce demand, high prices ‘nearby’ will reduce demand ‘nearby’, with the consequence that demand will be displaced from nearby places into i. We refer to this as a displaced demand effect. This means that demand at i will be positively related to the weighted average of prices in ‘surrounding’ areas, which is denoted by the matrix product , in which is an element of WD, a weights matrix appropriate to the demand function. Other unmodelled factors such as demand coming from non-wage earners (retired people, students, etc.) and the effects of criminality, social quality of the neighbourhood, amenity, local taxes, etc., are represented by a stochastic error . Turning now to the supply function, we assume that: (8) with the level of housing supply qi in area i increasing in the price at i. The rationale for this assumption is that where prices are high, more properties will be offered for sale by property owners wishing to realize the value of their property assets. Where prices are low, property owners are assumed to be more likely to withhold their properties from the market, in anticipation that prices may subsequently rise. A similar picture emerges from the consideration of property developers, who will tend to develop in response to price signals coming from high price areas and shun low price areas. As with the demand side spillover effect, we also assume a similar phenomenon for the supply side, though it will operate in reverse. It is assumed that high prices ‘near’ to area i ()44 Matrix WS is the weights matrix appropriate to the supply function. will attract supply from i, and this is the basis of the negative sign attributed to η. We refer to this as a displaced supply effect. Additionally, and controlling for price effects, supply also depends on the size of the existing stock of properties (O) and on other unmodelled variables that are represented by . The reduced form is the outcome of first normalizing the supply function with respect to p, thus: (9) and then substituting for q to obtain: While for completeness we have invoked separate W matrices for the displaced supply and demand effects, much simplicity is introduced by assuming that WE = WD = WS, leading to: (10) This is the well-known spatial lag model (Ord 1975; Cliff and Ord 1981; Upton and Fingleton 1985; Anselin 1988a). Written in matrix form, the model is: (11) where p is the (n × 1) vector of prices; X is an (n × k) matrix of k − 1 exogenous variables (wlEl, wcEc, A, O) with the first column being a column of 1s; scalar ρ is the spatial lag coefficient and b is the (k × 1) vector of exogenous variable coefficients. In order to be able to carry out ML estimation, an explicit probability distribution has to be invoked, in which case the vector of residuals ξ is here assumed to be normally distribution with mean zero and constant variance σ2. As is common throughout this strand of literature, the weights matrix is assumed to be a non-stochastic, fixed entity the elements of which are established a priori on the basis of some reasonable assumptions and hypotheses. Hence the results we obtain quite naturally depend partially on the assumptions, or hypotheses, we build into the matrix WE. We believe that the spillover between areas will not simply be a function of spatial propinquity, to the exclusion of other effects, but that it is more realistic to base it on relative ‘economic distance’. Big towns and cities are less remote than their geographical separation would imply, whereas very small locations are often isolated from one another. If prices are high in one large city, some demand tends to be displaced to a similar place (perhaps another remote city) rather than spill over to somewhere on the city periphery. Likewise displaced supply may not be totally constrained by spatial proximity, but be attracted to locations closer in terms of economic distance. Hence economic distance reflects the reduced transaction costs associated with flows between geographically remote cities, which have better communications infrastructure, lower costs of information gathering and uncertainty, and similar economic and employment structures. In this context, our measure of economic distance is very simple, it is based on a negative exponential function of straight line distance dij (in miles) between areas i and j, and on the size of each area's economy (El) measured in terms of the total employment level in 1999 (in units of 1000), thus: (12) In Equation (12), ideally we would like to estimate the coefficient β, which determines the rate of distance decay, but to do this would be a very complex estimation process and we follow the vast majority of the literature by taking this a priori to equal a fixed value, in this case β = 100. The resulting matrix is then row standardized giving the asymmetric matrix: (13) In Table 1 we provide ML and 2SLS estimates55 Unreported Bootstrap estimates, which are also robust to error non-normality and heteroscedasticity, are similar. To obtain the 2SLS estimates, we regress WEp on the exogenous variables wlEl, wcEc, A, O and their first spatial lags, except for the lag of A which has been omitted to achieve full column rank in the matrix of instruments. Kelejian and Robinson (1993), Kelejian and Prucha (1998) also suggest excluding high order spatial lags to avoid linear dependence. which show clearly that despite the displaced demand and supply effects we have tried to embody in our model, significant residual autocorrelation remains. In order to find this, we have used a contiguity matrix (with cell (j, k) equal to 1 when areas j and k are contiguous, and equal to 0 otherwise) which is then row standardized. On this basis, according to the LM test (Anselin 1988a, 1988b), highly significant residual autocorrelation exist, with p = 0.00000001 in and with the Z score indicating positive residual autocorrelation. Using the Anselin and Kelejian (1997) statistic, which is appropriate for 2SLS residuals, Z = 4.740, again pointing to positive residual autocorrelation. Table 1. ML and 2SLS estimates for the house price data Variable Parameter ML 2SLS Partial estimate t-ratio Partial estimate t-ratio Constant d 0 −704.0501 −9.25 −703.6700 −9.17 WEp ρ 0.7233 11.61 0.7212 10.78 wlEl d 1 0.7162 9.64 0.7166 9.52 wcEc d 2 0.0307 7.40 0.0308 7.15 O d 4 −0.0005 −5.47 −0.0005 −5.41 A d 3 185.0888 9.57 185.0621 9.49 σ 35.8172 36.1263 Df 347 347 LM () 31.30 Z 5.595 4.757 Log-likelihood −1764.8763 Note: The LM test is for residual autocorrelation applied to the spatial lag model (Anselin 1988, pp. 106, 194). Our assumption has been that the spatial lag is principally a net displacement effect dependent on both economic mass and distance, so we are not surprised by the presence of residual spatial autocorrelation. There are numerous omitted variables that surely also play a part in determining house prices, for instance air and neighbourhood quality (Anselin 2003; Anselin and Le Gallo 2006), insurance premiums (particularly with respect to flood plains), varying planning and building regulations, noise (proximity to flight paths) and so on. Many of these omitted regressors will be spatially autocorrelated, and if significant their absence would tend to induce, as a net outcome, a spatially non-random pattern of residuals (Dubin 1988; Brueckner 2003). In order to attempt to capture this, we assume two contrasting error process models. The AR process assumes that a ‘shock’ to area j is transmitted outwards as a chain reaction (with diminishing force) to all other areas. This AR process for the errors is similar to the one we are assuming for the net effect of displaced demand or supply, which also cascades outwards in an autoregressive process. Alternatively, we also assume that the errors may have a much more limited spatial footprint, with no transmission of impact beyond contiguous areas (see Fingleton 2008). We give estimates based on either assumption, both for this example and below in the Monte Carlo simulations. In practice the different error process assumptions make only a small difference to the results we obtain. More explicitly we represent these unmodelled effects by an AR error process and by a MA error process, with the SARAR specification defined as: ((14a)) and the SARMA specification given by: ((14b)) Table 2 gives the results of estimating these models via GMM. The negative moving average parameter signifies positive contributions to the residuals from contiguous errors. The bootstrap distribution for based on 99 random samples with replacement from the residuals û gives a bootstrap estimate equal to −0.003907 and bootstrap variance of 0.008523, so that is an extreme observation with respect to its bootstrap distribution, ranking below any of the 99 reference values, suggesting a significant moving average error process. The SARAR model estimates and bootstrap analysis (unreported) provide very similar interpretations. Table 2. GMM estimates for the SARAR and SARMA models Variable Parameter GMM (AR) GMM (MA) Partial estimate t-ratio Partial estimate t-ratio Constant d 0 −628.9821 −8.27 −671.4766 −8.76 WEp ρ 0.7535 8.36 0.7429 8.22 wlEl d 1 0.5220 6.27 0.5902 6.90 wcEc d 2 0.0346 6.64 0.0320 6.02 O d 4 −0.0004 −4.34 −0.0005 −4.54 A d 3 164.2390 8.58 175.7372 9.08 AR/MA λ, 0.3549 −0.4883 σ 33.4963 34.8145 Df 346 346 Table 3 summaries the model in which schooling (A) is treated as an endogenous variable,66 The Hausman test is described by Greene (2003, p. 82). The p-values resulting from the F-test on the fitted values of A are sufficiently large not to reject the null that the coefficient on the fitted A values is zero, suggesting that A can be treated as exogenous, but we ignore this basing the assumption of endogeneity on theoretical argument rather than empirical evidence. using a quasi-instrument with values equal to −1, 0, 1 depending on whether or not A is in the upper, middle or lower third of values when placed in rank order. In addition, given that England and Wales are partitioned exhaustively into 47 counties and alternatively into 353 local authority areas, 46 dummy variables are included as instruments, with a dummy variable k equal to 0 or 1 depending on whether the area j is located within county k. The results in Table 3 are very similar to those obtained recognizing the endogeneity of the spatial lag (WEp) but ignoring the error correlation (Table 1) and the endogeneity of A, and similar to the estimates obtained by also modelling the error process (Table 2). Table 3. GMM estimates for the SAR model with endogenous A and quasi-instrument Parameter GMM (AR) GMM (MA) Partial estimate t-ratio Partial estimate t-ratio Constant d 0 −589.647 −6.23 −602.711 −6.40 WEp ρ 0.729367 8.13 0.732868 8.62 wlEl d 1 0.527122 6.33 0.541299 6.51 wcEc d 2 0.0358468 6.77 0.035227 6.83 O d 4 −0.000433553 −4.41 −0.000436604 −4.45 A d 3 154.892 6.47 158.187 6.63 AR/MA λ, 0.35489 −0.403094 σ 33.5006 34.6196 Df 346 346 Correlation, fitted versus actual 0.852339 0.840907 Instruments County dummies, 3 groups for A County dummies, 3 groups for A R-squared regression of instruments on A 0.7179 0.7179 Hausman p-value 0.213842 0.213842 Sargan p-value 0.00976858 0.0102467 Table 3 contains estimates assuming both an AR error process and a MA error process and controlling for the endogeneity of A, with very similar outcomes. There is a reasonable level of fit, as indicated by the correlation between observed and fitted values. The R-squared for the regression of the instruments on A is equal to 0.71790, indicating a close association between instruments and the endogenous variable. However, the fact that we are using a quasi-instrument shows up in failure of the Sargan test; the instruments are evidently not independent of the residuals, as shown by the Sargan test statistic p-values (obtained by referring 71.3136 and 71.0844 to the χ2 distribution with 46 degrees of freedom). Table 4 summarizes the estimates obtained by replacing the quasi-instrument by strictly exogenous77 The case for treating these three variables as exogenous was made earlier. instruments based on wlEl, wcEc and O. Since these variables and their spatial lags are already included in the set of instruments for WEp, we create functions of these variables to act as a set of separate instruments, again using the three-group approach outlined above. The outcome of the Sargan test is p-value greater than 0.05, indicating that the instruments are independent of the residuals. Table 4 shows that with these strictly exogenous instruments, the parameter estimate for A and its significance are very much reduced, although it remains the case that the one-tailed p-values indicate a marginally significant effect of A on house prices. For the AR errors model, the t-ratio of 1.74 is consonant with a p-value of 0.05, whereas the MA errors specification gives a p-value of 0.03. Therefore we conclude that there is a significant effect of schooling on house

Referência(s)