Application of the dynamic spatial ordered probit model: Patterns of land development change in Austin, Texas

Artigo Acesso aberto Revisado por pares

Application of the dynamic spatial ordered probit model: Patterns of land development change in Austin, Texas

2009; Elsevier BV; Volume: 88; Issue: 2 Linguagem: Inglês

10.1111/j.1435-5957.2009.00249.x

ISSN

1435-5957

Autores

Xiaokun Wang, Kara M. Kockelman,

Tópico(s)

Housing Market and Economics

Resumo

Papers in Regional ScienceVolume 88, Issue 2 p. 345-365 Free Access Application of the dynamic spatial ordered probit model:Patterns of land development change in Austin, Texas Xiaokun Wang, Xiaokun Wang Bucknell University, Department of Civil Engineering, 701 Moore Avenue, Lewisburg, PA 17837, USA (e-mail: cara.wang@bucknell.edu)Search for more papers by this authorKara M. Kockelman, Kara M. Kockelman University of Texas, 1 University Station, ECJ Suite 6.9, C1761, Austin, TX 78712, USA (e-mail: kkockelm@mail.utexas.edu)Search for more papers by this author Xiaokun Wang, Xiaokun Wang Bucknell University, Department of Civil Engineering, 701 Moore Avenue, Lewisburg, PA 17837, USA (e-mail: cara.wang@bucknell.edu)Search for more papers by this authorKara M. Kockelman, Kara M. Kockelman University of Texas, 1 University Station, ECJ Suite 6.9, C1761, Austin, TX 78712, USA (e-mail: kkockelm@mail.utexas.edu)Search for more papers by this author First published: 29 June 2009 https://doi.org/10.1111/j.1435-5957.2009.00249.xCitations: 27AboutSectionsPDF ToolsRequest permissionExport citationAdd to favoritesTrack citation ShareShare Give accessShare full text accessShare full-text accessPlease review our Terms and Conditions of Use and check box below to share full-text version of article.I have read and accept the Wiley Online Library Terms and Conditions of UseShareable LinkUse the link below to share a full-text version of this article with your friends and colleagues. Learn more.Copy URL Share a linkShare onFacebookTwitterLinked InRedditWechat Abstract Abstract The evolution of land development in urban area has been of great interest to policy-makers and planners. Due to the complexity of the land development process, no existing studies are considered sophisticated enough. This research uses the dynamic spatial ordered probit (DSOP) model to analyse Austin's land use intensity patterns over a 4-point panel. The observational units are 300 m × 300 m grid cells derived from satellite images. The sample contains 2,771 such grid cells, spread among 57 zip code regions. The marginal effects of control variables suggest that increases in travel times to central business district (CBD) substantially reduce land development intensity. More important, temporal and spatial autocorrelation effects are significantly positive, showing the superiority of the DSOP model. The derived parameters are used to predict future land development patterns, along with associated uncertainty in each grid cell's prediction. Resumen La evolución del desarrollo del suelo en áreas urbanas ha sido de gran interés para formuladores de políticas y urbanistas. Debido a la complejidad del proceso de desarrollo urbano, se considera que los estudios existentes no son lo suficientemente sofisticados. Este estudio utiliza el modelo probit ordenado espacial dinámico (DSOP, por sus siglas en inglés) para analizar los patrones de intensidad de uso del suelo sobre un panel de 4 puntos. Las unidades de estudio son celdas en una malla de 300m x 300 m a partir de imágenes de satélite. La muestra contiene 2,771 de estas celdas, distribuidas entre 57 regiones de códigos postales. Los efectos marginales de las variables de control sugieren que los incrementos en la duración de los desplazamientos al distrito central de negocios (CBD, por sus siglas en inglés) reducen sustancialmente la intensidad del desarrollo urbano del suelo. Con mayor importancia, los efectos de autocorrelación temporal y espacial son significativamente positivos, mostrando la superioridad del modelo DSOP. Los parámetros derivados son utilizados para predecir patrones futuros de desarrollo urbano del suelo, junto con la incertidumbre asociada a la predicción para cada celda de la malla. 1 Background In studies of social behaviours and human activities, many choices or attributes (e.g., religious beliefs, presidential election outcomes and levels of crime) involve discrete responses in a temporal and spatial context. It is especially true for analysis of dynamics in land development intensity levels under the influence of geology, demographics, transportation conditions and other, socio-economic factors: land owners make development decisions based on their knowledge and prediction of neighbouring land development (See, e.g., Candau et al. 2000; Waddell 2002). As a result, land development is often clustered. For example, one can expect that a parcel of land is more likely to be intensely developed if its neighbourhood offers intensely developed land. Wang and Kockelman (2008) have developed a dynamic spatial ordered probit (DSOP) model aiming to analysing the temporal and spatial relationships in ordered categorical data. This paper demonstrates how this model can be applied to the analysis of land development change. The analysis relies on Austin, Texas data sets. Thanks to rapid population growth and economic expansion, the area has experienced some dramatic changes during the last two decades. As will be shown in more detail in the following sections, during this time period, the region's land development has both sprawled over space and escalated in intensity. One direct result of this development is congestion. The Texas Transportation Institute's urban mobility report (Schrank and Lomax 2005) indicates that Austin ranks number 1 among all 30 medium-sized U.S. cities, in travel delay and wasted fuel per capita. In this study, land development intensity is defined based on how much land is covered by manmade materials, which are characterized by higher reflectance levels and other visual clues provided via satellite images. These 'intensity levels' are indexed as integers, and their order is key. This application, in addition to disclosing the spatial and temporal patterns of urban land development change, illustrates the potential broad application of the dynamic spatial ordered probit model. For urban areas, the evolution of land development intensity is a topic of interest to traffic demand modellers, policy-makers, and land developers. Such changes influence regional economies and environmental conditions. For non-urban areas, analysing the dynamics of land development intensity is also important: For example, undeveloped land around the world, including some precious lands like the Amazon rainforest are being converted for agriculture and other human uses. Such changes can significantly contribute to climate change, desertification, resource depletion and loss of habitats and species. Many studies have been conducted on the land change patterns. However, many studies that recognize spatial effects have tried to either construct and control for a variety of neighbourhood attributes or remove all spatial correlation through strategic sampling (to provide a dispersed sample, with minimal interactions). Some also attempt to recognize temporal dependencies by controlling for variables from previous periods. For example, Nelson and Hellerstein (1997) sampled selectively and created exogenous variables based on neighbouring units' land cover data in order to study the deforestation effects of roadways via a multinomial logit model. Wear and Bolstad (1998) controlled for prior land uses in the neighbourhood of each data cell in their study of southern Appalachian landscapes, which involved binary response data. Munroe et al. (2001) attempted to filter out spatial correlations through sampling and then removed the residual spatial dependence through a 'trend surface' approach (Cliff and Ord 1981). Over the years, statistical evaluations of land use patterns have continued. Examples include works by Landis et al. (1995), Paez and Suzuki (2001) and Paez (2006). Accompanying such literature is the development of spatial econometric methods for limited dependent variables. Class studies include work by Case (1992), McMillen (1992) and Dubin (1995). However, as with all other existing models dealing with discrete response data in a temporal and/or spatial context, the applicability of these methods is still limited. One reason is that none of the existing models handles multi-level response data while explicitly recognizing spatial and dynamic effects. The following sections first introduce the specification and estimation of the DSOP model, then describes the datasets used in this study. The effects of different factors on land development intensity are discussed based on the estimation results. The estimates also are applied to predict land development intensity levels in the study area. 2 Model specification and estimation Wang (2007) and Wang and Kockelman (2008) discuss the DSOP's model specification and estimation in detail. The dynamic spatial ordered probit model captures the ordinal nature of land development intensity values (denoted as ranked integers). In addition, land development decisions strongly depend on prior and existing conditions, as well as owner/developer expectations of future conditions (such as local and regional congestion, population, and school access). These expectations can be approximated using contemporaneous measures of access and land use intensity, after which some spatial correlation in unobserved factors is likely to remain, as represented by the regional effect θi. This following discussion summarizes DSOP methodology and highlights key findings. In short, a dynamic ordered probit model with spatial and temporal autocorrelation can be described by extending existing specifications of spatial probit models, static and dynamic, ordered and categorical. The most closely related works are those by Girard and Parent (2001), Smith and LeSage (2004), Wang (2007) and Wang and Kockelman (2008). The model specification is as follows: (1) where i indexes regions (i= 1, . . . , M), k indexes individuals inside each region, or neighbourhood (i.e., k= 1, . . . , ni), and t indexes time periods. In other words, there are M regions/neighbourhoods, each containing ni observations; so that the total number of observations is . In addition, λ is the temporal autocorrelation coefficient to be estimated. Each individual is observed T times, so the total number of observations is NT. Uikt is a latent (unobserved) response variable for individual k from region i at period t. Xikt is a Q× 1 vector of explanatory variables, and β is the set of corresponding parameters. The residual is composed of two parts: θi captures all common yet random components for observations within region i, while remaining random information is captured by individual effect εik which is heteroscedastic with variance υi, or (2) This specification allows the model to reflect spatial autocorrelation across regions while recognizing intra-regional clustering. A spatial autoregressive process can be formulated here, as follows: (3) where weight wij reflects proximity, and can be derived based on contiguity and/or distance between regions. The magnitude of overall neighbourhood influence is reflected by ρ, also called the spatial coefficient. ui aims to capture any regional effects that are not spatially distributed, and is assumed to be iid normally distributed, with zero mean and common variance σ2. Stacking all regions, then, the vector of regional effects can be formulated as follows: (4) The vector of regional effects will be a function of the weight matrix W, which has zeros on its diagonal and is composed of purely exogenous elements wij, so that (5) The use of such regional effects to capture certain spatial dependencies also enhances computational efficiency: normally, the number of regions is much lower than the total number of observations, allowing use of a weight matrix W of relatively low rank. Thanks to a lower dimension, the inversion of (I-ρW) and calculation of its eigenvalues are much less memory-intensive. Of course, both of these computations are necessary for parameter estimation. Furthermore, the specifications shown here allow the special case of every individual serving as a separate region, where ni= 1, ∀i∈M (and M=N). In this context, all individuals can be spatially auto-correlated without imposing regional boundaries. While computational burdens will increase, this approach certainly is feasible, assuming a reasonable sample size. Equation (1)'s recursive time-space form implies that current response values depend on previous period values, along with various contemporaneous factors (Anselin 1999). Furthermore, after controlling for all such temporally lagged and contemporaneous variables, the residuals are no longer temporally autocorrelated but remain spatially dependent. The context of land development intensity levels fits this specification: land development depends on past and present conditions, including owner/developer experiences of local and regional congestion and population, as well as nearby development and variables like school access. For the case of an ordered probit specification, the observed response variable, yikt, is as follows: (6) That is, the observed variable is a censored form of the latent variable, and its possible outcomes are integers ranging from 1 to S. The latent variable Uikt is allowed to vary between unknown boundaries γ0 < γ1 < . . . < γS−1 < γS, where γ0 is −∞ and γS is +∞. If constants are to be included in the explanatory variables, γ1 also is normalized to equal zero. The probabilities for these S outcomes are as follows: (7) where Φ(•) is the cumulative distribution function (CDF) for a standard normal distribution. The resulting likelihood function is as follows: (8) where y, U and γ are the vector forms of yikt, Uikt and γs. δ (A) is an indicator function equalling 1 when event A is true (and 0 otherwise). Allowance of all these features should make the model more statistically reliable in mimicking and forecasting the temporal and spatial evolution of ordered response variables, like land use intensity, tree cover, and home safety ratings. In addition, modifications of this model to allow for irregularly spaced panel data sets may be very helpful for practice, along with new methods for allowing non-sparse weight matrices and large sample sizes. Another important opportunity for exploration is prediction out of sample, which is rarely pursued in spatial econometrics research but can be critical in practice. Estimation of the dynamic spatial ordered probit model is achieved in a Bayesian framework where each parameter has prior and posterior distributions. The posterior distributions are consistently derived using Markov chain Monte Carlo (MCMC) methods (Gelfand and Smith 1990), by sampling sequentially from the complete set of conditional distributions. Wang and Kockelman (2008) describe more details about the derivation of these conditional distributions, including the choice of prior distributions and the subsequent calculation. As Wang and Kockelman (2008) describe, most of the parameters follow standard distributions and can be conveniently generated using routines built in commercial mathematical packages (such as Matlab and Gauss). The spatial coefficient ρ, however, follows a non-standard posterior distribution and has to be generated using numerical methods. The threshold parameter γ follows a multidimensional truncated normal distribution and the truncations co-vary. Therefore, the marginal distribution of each element in γ also is expected to be non-standard. The Gibbs sampling starts from arbitrary initial parameter and latent variable values. These parameters or variables of interest are then sampled sequentially based on their conditional distributions. In each iteration, these values are updated and replace values from the previous iteration. The iterative process continues until the desired number of draws is achieved. In some studies, the initial values are used to represent analysts' intuition and expectations. As the number of draws increases, the dataset becomes more influential in parameter estimation. The evolution of a parameter's values is called its 'trace', and the estimation process is considered convergent when the various traces stabilize on a distribution. Parameter values sampled in iterations before convergence is achieved are omitted as 'burn-in' values. 3 Data description The data used for land development dynamics comes from multiple data sources, including satellite images, the Census of Population, City of Austin school district and employment data, as well as transportation and geographic data from the Capital Area Council of Governments (CAPCOG). The land cover information serves as the dependent variable, and all others serve as explanatory variables. These include total neighbourhood population, number of workers living in the neighbourhood, average household income and number of schools in the neighbourhood, travel time to the nearest major highway (including U.S. Highway 290, U.S. Highway 79, U.S. Highway 183, State Highway 71, Interstate 35, Loop 1, and Loop 360, which did not change from 1983 to 2000), travel time to the region's central business district (CBD), travel time to major (Austin's 15 biggest) employers, travel time to the nearest airfield, average ground slope, and average elevation (of each 300 m × 300 m grid cell). Travel times were estimated using a series of calculations (see Wang 2007, for more details), recognizing the different travel speeds on different road classifications. For undeveloped locations far away from the existing road network, the process also considered the time that travellers need to spend to first access the network. A set of rather standard routines were used to integrate the various databases, as described in Wang (2007). And land development intensity levels for each 300 metre grid cell were determined based on an average of the intensity scores accruing to all 100 pixels contained in the cell. Lands that are highly reflective in satellite images are largely covered by man-made surfaces, indicating more intense development. In contrast, vegetation qualifies as less developed, and water and barren lands are considered undeveloped. Any process that orders the data into ascending categories could have been used here, with 1 indicating largely undeveloped cells, 4 indicating highly developed cells, and 2 and 3 lying in between. Scores and category thresholds used here are described at length in Wang (2007). And Figure 1 shows land development intensity levels for the study area in different model years. An interesting and important part of the data processing involves definition of 'regions' and selection of cell samples. As discussed above, observations in the same region should share common latent features. In ecological and environmental studies, regional boundaries may derive from natural spatial partitions, such as rivers and mountain ranges, with observations in the same region sharing vegetation and micro climate. For human activities, boundaries are more likely to be administrative units, across which policies and practices can change, such as zoning and school administration. Figure 1Open in figure viewerPowerPoint Computed land intensity levels across different years In Austin's urban area, zoning is based on neighbourhood planning areas (NPAs). Changes in zoning constraints often occur across these boundaries. However, information for many interesting variables is often organized based on zip codes. In order to be consistent with existing spatial units, study regions were based on 57 zip codes. These zip codes tend to align nicely with a single NPA (i.e., define the boundaries of an NPA), or the union of 2 to 4 NPAs. There are 57 of these, offering interesting regional variation while keeping computational burdens reasonable (allowing the estimation process to converge within eight hours on a standard DELL Precision 360 workstation.). After defining these regions, the next step was to select observations (grid cells) in each region. Of course, one can use all 29,946 300 m grid cells as observations. However, there are good reasons for selecting only a subset of these. First, the 'boundary' of a region may be somewhat ambiguous and the differences between regions may be slight. If all grid cells are used, cells that are located in two different regions yet lie in close proximity may be more similar than grid cells that are far away from each other yet belong to the same region. The second reason is computational: 29,946 grid cells create a very large pool of observations with difficulty in parameter estimation due to large-matrix inversion for spatial covariance components. In this study, a 10% sampling rate (Σni= 2,995) is likely to return satisfactory estimation results with significantly reduced computation time and so was used here. In order to ensure that observations in the same region are more alike than those in other regions, samples were selected around regional (zip code area) centroids. In this way, observations in the same region are spatially clustered (all contiguous), and thus expected to be more similar to each other than to observations in other regions. Second, in order to represent the entire study area, samples should be distributed as evenly as possible across space. If an equal number of 300 m cell observations is selected in each region, smaller regions will get more weight (than they 'deserve') in the sample. In order to spatially balance the selection, the number of observations in each region was set proportional to region area. Finally, 224 sampled cells were removed, because, in the case of very narrow zip codes, they extended into neighbouring regions or fell along edges of the study area (so neighbourhood information could not be obtained), while others exhibited unrealistic elevation and slope values (Caused by missing values in the CAPCOG dataset). A total of 2,771 observations (per year) resulted from this processing. These observations are distributed across the 57 regions (zip code areas), with the number of grid-cell observations per region ranging from 2 to 333. Table 1 summarizes definitions of all these variables, and Table 2 summarizes their statistics. Table 2's statistics show trends that are expected: development intensity levels, population, number of workers, and average household income have all increased over time. Average (uncongested) travel times to major facilities and employers have fallen, thanks to road system expansions in peripheral zones. Table 1. Data description for land development intensity level analysis Variable Description INTLV Development intensity level ELEVTN Average elevation of the 300 m grid cell (km) SLOPE Average slope of the 300 m grid cell (%) NSCHOOL Number of K-12 schools in the neighbourhood POP Population (thousand) in the neighbourhood WORKER Number of workers (000 s) living in the neighbourhood INC Average household income ($000 s) in the neighbourhood EMPTT Travel time to nearest major (top 15) employer (hours) CBDTT Travel time to CBD (hours) AIRTT Travel time to nearest airfield (hours) RDTT Travel time to nearest highway (hours) Table 2. Summary statistics for land development intensity analysis Variable Minimum Maximum Mean Std. Deviation Constant through Years ELEVTN 0.136 0.390 0.251 0.061 SLOPE 0.034 17.328 2.699 2.196 NSCHOOL 0.000 7.000 1.208 1.377 1983 INTLV 0.000 3.000 0.826 0.774 POP 0.225 37.531 4.632 7.298 WORKER 0.121 19.997 2.408 3.918 INC 17.330 88.941 45.368 15.109 EMPTT 0.004 1.115 0.453 0.223 CBDTT 0.000 0.358 0.154 0.070 AIRTT 0.005 0.784 0.345 0.157 RDTT 0.002 0.498 0.111 0.093 1991 INTLV 0.000 3.000 0.948 0.874 POP 0.203 51.310 6.860 10.424 WORKER 0.121 27.633 3.624 5.652 INC 20.540 105.412 53.844 17.766 EMPTT 0.004 0.733 0.298 0.149 CBDTT 0.000 0.339 0.148 0.068 AIRTT 0.004 0.630 0.259 0.120 RDTT 0.002 0.430 0.092 0.082 1997 INTLV 0.000 3.000 1.300 0.827 POP 0.389 64.873 8.007 12.615 WORKER 0.211 35.220 4.240 6.900 INC 23.332 119.738 61.077 20.341 EMPTT 0.001 0.313 0.112 0.060 CBDTT 0.000 0.308 0.142 0.065 AIRTT 0.004 0.628 0.227 0.116 RDTT 0.002 0.385 0.086 0.074 2000 INTLV 0.000 3.000 1.359 0.929 POP 0.478 64.629 9.131 13.153 WORKER 0.238 36.238 4.836 7.278 INC 15.869 125.094 65.024 22.635 EMPTT 0.001 0.182 0.070 0.037 CBDTT 0.000 0.266 0.126 0.057 AIRTT 0.005 0.437 0.154 0.070 RDTT 0.002 0.251 0.054 0.044 4 Model estimation This section applies the DSOP model (see Wang and Kockelman 2008, for more discussion on the model specification and estimation techniques) to the land development intensity levels analysis. As noted in section 2, explanatory variables for both analyses include temporally lagged latent variables and various contemporaneous variables. The following sections discuss the model estimation and results. First, the number of burn-in samples is determined. Estimate means, standard deviations, posterior distributions, and their marginal effects are then calculated and discussed. The performance of DSOP model with this dataset is also compared to those with simpler models. Finally, model estimates are used to predict response variables' values under hypothetical scenarios. The predictions can be visualized via a 'most likely' result and an 'uncertainty index'. Figure 2 shows several typical estimation traces (convergence patterns) for parameters in the development intensity model. These patterns are representative, and the traces of other parameter estimations are all similar to them. Rigorous proof of convergence is a complicated topic, so here 'convergence' is based on the trace of variable estimates. If, after a certain number of iterations, parameter estimates stabilize, the estimation is assumed to have converged. Results of iterations before this turning point are omitted and all inferences are drawn based on the converged iterations. Figure 2Open in figure viewerPowerPoint Convergence patterns of development intensity level estimation The model begins with diffuse priors and iterates 10,000 times. As observed in Figure 2, different parameters start 'converging' after different numbers of runs. However, after 6,000 runs, all traces appear stable, indicating an overall model convergence. Hence, the first 6,000 runs were omitted (as a 'burn-in' sample), and the model uses the next 4,000 draws to estimate parameter means and standard deviations, as shown in Table 3. Table 3. Estimation results for model of land development intensity levels Variable Mean Std. Dev. t-stat. POP −0.024 0.036 −0.668 WORKER 0.089 0.067 1.327 INC 0.019 0.002 9.143 EMPTT −0.232 0.130 −1.778 CBDTT −4.365 0.851 −5.126 AIRTT −2.867 0.248 −11.550 RDTT 2.309 0.385 6.001 NSCHOOL 0.039 0.017 2.305 ELEV −0.239 0.696 −0.343 SLOPE −0.034 0.010 −3.394 λ 0.561 0.019 30.005 ρ 0.857 0.074 11.612 σ 2 0.871 0.222 3.931 γ 1 −0.834 0.011 −77.231 γ 2 2.235 0.031 71.393 γ 3 4.361 0.034 130.167 According to the results, neighbourhood population and worker counts do not have statistically significant impacts on land development intensity levels. Average household income, by contrast, appears to generally boost such levels. Distances to major employers, Austin's CBD, and the nearest airfield all have statistically and practically significant effects on land development: the farther the cells lie from these attractions, the less likely they are to develop intensely. Interestingly, Euclidean distance to nearest highway is estimated to have a negative marginal effect on intensity, implying that (in the study area) development is more likely to occur at locations far from major roads. Considering that distances to the CBD and major employers already have been controlled for, this result can be interpreted as such: after access to work and the region's core are determined, developers tend to choose locations some distance away from the highway (and its noise, pollutants and safety issues). The result also suggests that locations with more neighbourhood schools are more likely to be intensely developed while elevation is not a statistically influential factor, locations with steeper slopes are less attractive to land development. Unlike slope coefficients in a standard linear model, beta values associated with explanatory variables cannot be interpreted so directly in a model involving latent response. Moreover, as Greene (2005) explains, these parameter signs in a model of ordered categorical response only indicate changes in likelihood of the two extreme outcomes (y= 1 and 4), rather than changes in all outcomes. Section 5.1 of this paper quantifies the marginal effects of all control variables. Another important estimation result is the practical and statistical significance of both the temporal autocorrelation coefficient (λ) and the spatial autocorrelation coefficient (ρ). These suggest that prior-period information has a very important influence on the (current) latent variable's value (mean λ= 0.561) and that, even after controlling for various neighbourhood characteristics, residuals remain strongly and positively correlated across space (mean ρ= 0.857). These results support the notion that land development decisions depend heavily on neighbouring conditions, and that spatial relationships should be reflected in model specification. As a further confirmation, the mean values of regional specific error (θi) estimates (and their statistical significance) are shown in Figure 3. A clustering pattern (where similar values tend to co-locate, rather than lie randomly distributed across space) is clearly visible in this figure, so the spatial autocorrelation of these regional-specific error terms was tested using Moran's I (Moran 1950), in ArcMap. It should be noted that the weight

Ver no editor

Altmetric

PlumX

Entrar

Lembrar minha senha

Receber meu e-mail de confirmação

Application of the dynamic spatial ordered probit model: Patterns of land development change in Austin, Texas