Integrating Environmental Context into DHS Analysis While Protecting Participant Confidentiality: A New Remote Sensing Method

Artigo Acesso aberto Revisado por pares

Integrating Environmental Context into DHS Analysis While Protecting Participant Confidentiality: A New Remote Sensing Method

2018; Wiley; Volume: 45; Issue: 1 Linguagem: Inglês

10.1111/padr.12222

ISSN

1728-4457

Autores

Kathryn Grace, Nicholas N. Nagle, Clara R. Burgert, Shelby Rutzick, David Van Riper, Trinadh Dontamsetti, Trevor Croft,

Tópico(s)

Global Public Health Policies and Epidemiology

Resumo

Population and Development ReviewVolume 45, Issue 1 p. 197-218 DATA AND PERSPECTIVESOpen Access Integrating Environmental Context into DHS Analysis While Protecting Participant Confidentiality: A New Remote Sensing Method Kathryn Grace, Search for more papers by this authorNicholas N. Nagle, Search for more papers by this authorClara R. Burgert-Brucker, Search for more papers by this authorShelby Rutzick, Search for more papers by this authorDavid C. Van Riper, Search for more papers by this authorTrinadh Dontamsetti, Search for more papers by this authorTrevor Croft, Search for more papers by this author Kathryn Grace, Search for more papers by this authorNicholas N. Nagle, Search for more papers by this authorClara R. Burgert-Brucker, Search for more papers by this authorShelby Rutzick, Search for more papers by this authorDavid C. Van Riper, Search for more papers by this authorTrinadh Dontamsetti, Search for more papers by this authorTrevor Croft, Search for more papers by this author First published: 19 December 2018 https://doi.org/10.1111/padr.12222Citations: 11AboutSectionsPDF ToolsRequest permissionExport citationAdd to favoritesTrack citation ShareShare Give accessShare full text accessShare full-text accessPlease review our Terms and Conditions of Use and check box below to share full-text version of article.I have read and accept the Wiley Online Library Terms and Conditions of UseShareable LinkUse the link below to share a full-text version of this article with your friends and colleagues. Learn more.Copy URL Share a linkShare onEmailFacebookTwitterLinked InRedditWechat Introduction Understanding the ways that people live given certain environmental conditions is of central concern to researchers in health, development, population, climate change, and other related fields (see Grace et al. 2014; Balk et al. 2005; de Sherbinin 2011). One major source of data on health and development is the USAID-funded Demographic and Health Surveys (DHS) program. DHS is a major source of population and health data for the poorest countries in the world and provides high-quality and detailed data on individual health outcomes—particularly outcomes related to maternal and child health. The primary sampling unit in the DHS are villages or village “clusters.” Cluster size can vary but contains a number of households within a geographic area who participated in the survey. Since many of the data included in DHS are personal and potentially sensitive, the DHS maintains confidentiality of the respondents by shifting the spatial coordinates of the cluster in the published data (Burgert et al. 2013). The spatial coordinates for rural locations are displaced by 0–5 km in any direction. Additionally, a small fraction of coordinates, 1 percent, are randomly shifted up to 10 km. For urban locations, the displacement is up to 2 km only. DHS recommends that researchers average any environmental data over a 5–10 km buffer around each DHS rural cluster with the specific community falling somewhere within the disc around each point (Perez-Heydrich et al. 2016). This approach to maintaining confidentiality while collecting survey information has been adopted by other international organizations as well (e.g., World Bank's Living Standards Measurement Study). Building on the rapid growth of literature around activity space, the geographic theory of close things being more alike (Tobler's First Law), as well as the understanding that people interact disproportionately with the landscape immediately surrounding a settlement, we propose an alternative method for evaluating environmental and contextual variables (Tobler 1970; Miller, 2004). Instead of calculating a 5–10 km buffer around each published point, we propose that the user selects a settlement near the DHS’ published cluster location and measures the environmental conditions around the settlement using a buffer much smaller than 10 km. We assume that the “true” context is a small, precise buffer around the correct settlement. We hypothesize that a small, precise buffer around an incorrect settlement is a better measure of truth than is an overly large buffer around the published point. Settlements can be identified through interpreting remotely sensed imagery. Corresponding features—for example, types of land-use strategies or adjacency to reservoirs for irrigation—can be more easily identified and evaluated when using a much more precise buffer. While the settlement that is being used to provide this contextual information is likely not the original DHS cluster, it is a neighbor of the cluster and we assume that neighboring settlements are more similar to each other than to the broader environment in which they are situated. We theorize that this approach will introduce less measurement error than the larger 10 km buffer. To test this theory, we select three countries that are topographically diverse and that represent unique regions of the world—Burkina Faso, Kenya, and Tajikistan. As with most of the poorest countries in the world, these countries are heavily dependent on the landscape to produce food and earn money. However, each of these countries is quite distinct from the others in terms of environmental characteristics (rainfall and topography) and cultural characteristics (the types of crops produced as well as the farming strategies used to produce the crops). We select these countries to develop a thorough understanding of how our methodology will function under different settings. We evaluate a remotely-sensed estimate of cultivated area and vegetation features of the DHS clusters using the 5–10 km buffer approach and our proposed neighboring settlement approach. We compare these to the true values through the use of the actual, confidential, locations of settlements in the DHS sample. Background Scale in Geography Scale in geography refers to the spatial size of objects or processes. Geographers often find it helpful to distinguish between cartographic scale, analysis scale, and phenomenon scale (Montello 2001). Cartographic scale is not immediately relevant to this study. Analysis scale is the scale at which data are collected or used. Sometimes, the analysis scale is not controlled by the researcher, such as when economic statistics are produced nationally, or when the resolution of satellite imagery is fixed by the sensor. At other times, however, the scale is controllable by the researcher, such as when the DHS recommends that a 10 km buffer be placed around public geocodes. The only justification for choosing a 10 km buffer is that presumably it includes the true point and overlaps significantly with the 10 km buffer around the true point. But other buffer sizes could be chosen. Phenomenon scale refers to the size at which geographic structures or processes exist. For example, in the case of individuals in the DHS sample, their daily lives occur within a particular geographic scale. The size of the local environment that produces crops for food and cash for a village has a particular scale. The recent explosion of thinking and research on activity spaces is fundamentally about phenomenon scales (see Perchoux et al. 2013 and Zenk et al. 2011 and many others) and has been explored in less developed countries most often in ways that relate to geographic access to services (for some examples see Yao et al. 2013; Buor 2003; Tanser et al. 2006). Geographic Scale Matters We imagine sparsely settled rural environments in which the natural landscape is “lumpy.” That is, there are places which constitute the immediate environmental context of a village, and then there are the in-between places that do not strongly constitute the context of any village. If this is so, then it is reasonable to assume that the “lumps” formed by settlements and their environs are more similar to each other than they are to the in-between places. Survey theory tacitly makes this assumption when it identifies villages as the primary sample unit; i.e., that villages are statistically exchangeable for one another. An alternative survey frame, which maintains that space is homogeneous, would just pick random spatial coordinates uniformly. But we know that geographic space is lumpy, and this motivates us to hypothesize that a buffer around a settlement, any settlement, is a better measure of local context than a buffer around a non-settlement. This, of course, depends on both the scale of processes in rural settlements and on the degree of lumpiness in the natural environment. When the buffers are small, we expect that there may be a large difference between the local context of settled and non-settled places (Figure 1). In contrast, as the buffers get larger, the difference between the context of settled and non-settled places becomes less apparent (Figure 2). Similarly, in places with relatively even rural development or cultivation, we expect to see less difference between the local context of settled and non-settled places. Figure 1Open in figure viewerPowerPoint Schematic of sampled and unsampled villages buffered by 2 km, and a public geocode point buffered by 2 km Figure 2Open in figure viewerPowerPoint The same villages as in Figure 1, except buffered by 10 km NOTE: If 10 km is the appropriate scale, then the difference between large buffers is not as great as the difference between small buffers. When using spatial population data like the DHS, these issues of scale obviously vary according to the level of development, infrastructure, and resources, and can have important implications on research design and analysis. For example, in a developing country where walking is the main mode of transit, intensive cultivation is much more likely within a 2 km radius than a 10 km radius. If so, then we suggest that a 2 km buffer is a more appropriate measure of the relevant environmental context of a settlement than a 10 km buffer. In this study, we will not address the “correct” geographic scale, but we will explore whether the analysis scale matters (it does) and suggest an alternative approach for measuring environmental context. Applications Researchers in physical, social, and health sciences have increasingly focused on merging data, methods and theory from across multiple disciplines. For those interested in any aspect of human-environment interaction, merging large-scale survey data (e.g., Demographic and Health Surveys (DHS) data or World Bank Living Standards Measurement Study (LSMS)) with climate data or landscape data requires problem solving. One of the ongoing challenges relates to understanding the consequences of the displacement processes used with spatially-referenced survey data. Strategies like those explored here will provide some possible suggestions into how remotely-sensed data can be used with survey data to better contextualize and differentiate the communities where people live. This approach can help researchers explicitly incorporate context, culture, and spatial thinking into their analyses while guarding the confidentiality of the respondents (VanWey et al. 2005). While researchers are empirically and conceptually exploring strategies to merge survey data and physical/climate data to investigate a variety of outcomes (see, for example Noor et al. 2009; Tatem et al. 2007; Tatem et al. 2012; Tanner et al. 2015; Nawrotzki and Bakhtsiyarava 2017; Brown et al. 2014; Shively et al. 2015), most research does not explore strategies for incorporating the spatial displacement (see Dorélien et al. 2013 for a discussion of related urban DHS issues). We aim to produce an alternative approach that accounts for the displacement of geocodes. Data DHS data Our analysis is organized around the most recent spatially referenced DHS data for each of the three countries under study. In each country, we have restricted the sample to rural settlements since our interest here is on (natural) environmental context. Public DHS records contain a geocoded coordinate for each sampled settlement cluster; however, these coordinates are displaced up to 10 km prior to publication and do not identify an actual settlement. For our analyses, we have access to the true settlement location. We will calculate the vegetation measures (in this case remotely sensed based estimates of vegetation) for each of the “true” DHS spatial coordinates as well as the displaced (publicly available) coordinates. Settlement Locations We use Digital Globe11 https://www.digitalglobe.com/products/digitalglobe-basemap (∼35 cm resolution) remotely sensed imagery to identify the location of settlements near the publicly-released DHS cluster geocodes. This highly detailed imagery reveals landscape characteristics—namely clusters of dwellings, agricultural plots, and road networks—that signify the presence of human settlement. Recent imagery with low cloud cover, so that villages can be identified clearly, is used in the settlement identification. Specific details on the strategy will be described in the methods section below. Normalized Difference Vegetation Index (NDVI) We use Landsat 8 Normalized Difference Vegetation Index (NDVI) (USGS 2017) to estimate community-level food production for each spatially referenced DHS cluster. The Landsat 8 satellite was launched in 2013 and images the Earth's surface every 16 days at 30 m resolution. The range of NDVI is –1 to 1, where a value of 1 indicates more greenness (or vegetation). NDVI is a commonly used measure within the applied remote sensing and food security communities and by remote monitoring systems to measure the amount of vegetation growth, agricultural production, or to estimate food production (see Tucker 1979; Husak et al. 2008; Grace et al. 2017; Grace et al. 2014; Brown et al. 2014). It is an ideal variable to test our theory as agriculture reflects landscape characteristics and human response to, and engagement with, the environment. For the countries under study (like most developing countries), no other annual crop data exist, including annual maps of cropped area. NDVI provides just one of multiple variables that could be investigated in this way. The most important consideration is that the variable under study reflects the general way that people who are near to each other would engage with the environment. The spatial scale must also be relatively fine so that actual spatial variation within a specified buffer is present. Famine Early Warning System (FEWS NET) Livelihood Zones We also use the Famine Early Warning System Network's (FEWS NET)22 FEWS NET is funded by the USAID and is a leading provider of early warning and analysis on food insecurity – http://fews.net livelihood zones data. These maps and associated reports are developed based on quantitative and qualitative information related to land use, food production, climate, economics, trade routes, and historical information. They provide insight into the strategies that people within a particular area generally use to procure food or income and are useful for contextualizing the role of rainfall in agricultural production. The maps and reports are freely available and have been used in a wide range of research (see Brown et al. 2014; Grace et al. 2014; Shively et al. 2015; Grace 2017). Methods Our primary goal is to determine correct ways to measure the environmental context of rural settlements in the Demographic and Health Surveys while also preserving the confidentiality of the individuals and communities selected. In this research, we are interested in the integration of DHS data with satellite imagery to determine the local environmental context of a community. We restrict our study to the most recent surveys from Burkina Faso (2010), Kenya (2014), and Tajikistan (2012) (ICF 2010; 2012; 2014). These countries were selected based on their representation of a diverse range of livelihood strategies (see FEWS NET livelihood zones), land use patterns, and quality (or lack thereof) of available geospatial census data. Settlement Sample Selection In each country, we identify settlements around DHS public geocodes using high resolution (∼35cm) satellite imagery from Digital Globe taken within (approximately) 1–3 years of the survey. We use the DHS rural/urban coding to select out rural clusters. We then generate a random sample of DHS clusters from each country, draw a 5 km buffer around the public geocode point, and overlay the buffer on the Digital Globe imagery. Figure 3 displays the high-resolution imagery overlaid with rural DHS cluster buffers for Burkina Faso. Figure 3Open in figure viewerPowerPoint 5 km buffers around Burkina Faso 2010 DHS clusters We then create a “fishnet” within each buffer to facilitate identification of the settlements. Multiple settlements could be identified within a buffer and we aimed for at least one settlement identification per buffer. In Figure 4, the fishnet, the DHS cluster, and the identified settlements are shown. Note that at this scale, the settlements are nearly impossible to identify and require the analyst to zoom in for a closer look (Figure 5). Figure 4Open in figure viewerPowerPoint Fishnet over DHS public geocode (triangle) and identified settlements (circles)—Burkina Faso Figure 5Open in figure viewerPowerPoint Identified settlement within 5km of a DHS survey location —Tajikistan (top), Burkina Faso (bottom) Culture, topography, politics, level of development, community infrastructure, and livelihood/economic systems may influence the ways that settlements form and households cluster. We looked for specific features—road/path networks and housing structures that seemed comparatively dense/populated enough to justify a settlement cluster. Figure 5 provides a zoomed in view of an identified settlement in Burkina Faso and in Tajikistan. Generation of alternative spatial coordinates Our prime hypothesis is that it is more accurate to measure spatial context as centered around actual settlements (even the wrong settlements) than it is to measure it as centered on a random (non-settlement) location. To make this comparison, we identify three possible locations for the rural cluster: the true location33 These data are accessible through the third, seventh, and eighth authors and for use in methodological studies that look at ways to use the DHS data after the displacement confidentiality procedure. (T), the current location (0) that is generated by randomly displacing the true location by 0–10 km, and the alternative location (1) that is the set of settlements within 5km of the current location, as identified by manual interpretation. To reduce interpreter burden, up to three locations were identified. Calculating environmental context From the Landsat 8 satellite imagery, we calculated the median NDVI around these locations; at buffer sizes of 2 km, 5 km (to capture the activity space for the settlement and true locations), and 10 km (for the current location). Ten km is selected because that is the current recommendation given by DHS. Two and five km are selected because they more closely represent the activity space of rural residents in these countries; however, this distance is not tested or calibrated here. Landsat 8 images a scene every 16 days. For each DHS cluster, we scanned the Landsat 8 archive to find all scenes acquired within August, September, or October, and with less than 5 percent cloud cover. In some instances, no image could be found, and we eliminated that point. When more than one image could be found, we selected the image with the lowest cloud cover. Apart from this, we did not filter out cloudy pixels from the scene. To moderate the effect of clouds, we used the median, rather than mean, to characterize NDVI around a settlement. For the settlement sizes considered here, we do not believe that there will be any systematic bias between the cloudiness of settled and non-settled pixels (note that with remotely sensed imagery, smoke from cooking fires, for example, would not create the type of cloudiness that would exclude a pixel (Gao et al. 2006)). We did not search for images to correspond with the time of the DHS survey. Since this is a methodological study meant to look at data approaches, and we are not using the DHS survey information, this is not a problem. We are interested only in the question of whether the environmental context of a settlement in the DHS sample is different than the context around the public geocodes, and for this question it suffices to simply choose images around these points in any year. Problems may arise due to the gap between the Landsat 8 imagery and the Digital Globe imagery used to identify settlements if settlements are constructed or vacated during this gap44 We do note that under certain conditions entire villages have relocated due to flooding, for example. In future uses of this method, timing may determine the best source of remotely sensed imagery to use.; however, the chances of this occurring in two or three years is so low as to be ignorable. Hypothesis Testing In application, we aim to detect any difference between the context around the public point and the true point. Given the importance of scale and context, specifically, we compare 5 and 10 km buffers of the public point with the 2 and 5 km buffers of the true point. We aim to determine if the alternative solution we proposed is either statistically different from the value of the true location or from the DHS recommendation. In other words, does our suggested alternative improve upon the DHS recommendation? Five and ten km buffers are selected as they reflect the DHS recommendations (Perez-Heydrich et al. 2016). We test the following two hypotheses: A 5 km buffer around the alternative (1) is closer to the true 5 km buffer (T) than is the DHS recommended 10 km buffer around public point (0) A 2 km buffer around the alternative (1) is closer to the true 2 km buffer (T) than is the finer DHS-recommended 5 km buffer around public point (0) where x indicates the vegetation estimate, the subscript indicates the spatial location of the buffer center, and the superscript indicates the radius of the buffer. As a reminder, T indicates the true location, 0 is used to indicate the publicly-available DHS geocodes, and 1 refers to our alternative settlements approach. In other words, we are comparing the differences in the vegetation values calculated for the settlements and the true locations to the differences calculated between the DHS public geocodes and the true locations. The null hypothesis states that these differences would be statistically equivalent, while the alternative hypothesis states that the difference between the settlement location and the true location would be statistically smaller than the difference between the public geocode and the true location. The difference between the two hypotheses is the buffer size. In hypothesis 1, we are using the larger buffer size recommended by DHS (10 km) and comparing it to a 5 km activity space. In hypothesis 2, we are using the smaller buffer size recommended by DHS (5 km) and comparing it to a 2 km activity space. This approach is consistent with our interest in determining if a smaller buffer around an incorrect settlement is a better estimate than the larger “catch-all” buffer. We use permutation tests to compare the error between the two methods. Compared to a t-test or F-test, a permutation makes fewer assumptions about the distribution and allows more flexibility in the choice of a test statistic. As a test statistic, we choose the median of error ratios: R = median(), using hypothesis 1 as an example. We choose the median because it is robust to the presence of small numbers in the denominator. The null hypothesis of no difference between methods is equivalent to the hypothesis that R = 1. To evaluate the significance of our sample ratio R, we permute the labels of the X1 and X0 within each cluster (i.e., we preserve the pairing within clusters). Under the null hypothesis that the errors are equal, these labels are not meaningful, and they are therefore randomly assigned. We repeated this permutation 10,000 times, calculating the test statistic, R, under each permutation. Then, using the distribution of these permuted R statistics, we calculated the tail probability of the sample R value. Results Burkina Faso For the first hypothesis, when comparing differences irrespective of livelihood zone, we reject the null hypothesis (p<0.01). This finding suggests that the difference in the vegetation values, when comparing the settlements and true locations, is statistically smaller than the difference between the DHS recommended approach and the true locations. The second hypothesis, which compares estimates using smaller buffers to capture activity space, produced similar statistically significant findings (p = 0.03), again suggesting that the settlement approach produces less of a difference when compared to the true locations versus the DHS-recommended approach. Table 1 provides the mean values, standard deviations, and sample sizes across the country and according to each livelihood zone. Table 1. Means and standard deviations (in parentheses) of median NDVI calculated for the settlement location, for the true location, and for the public DHS cluster location—Burkina Faso Settlements Public Points True Points Livelihood Zone 2 km 5 km 5 km 10 km 2 km 5 km N South tubers and cereals (Z1) 0.263 (0.034) 0.269 (0.032) 0.267 (0.032) 0.270 (0.030) 0.269 (0.040) 0.271 (0.035) 11 Southwest fruits, cotton, and cereals (Z2) 0.301 (0.046) 0.305 (0.034) 0.304 (0.037) 0.307 (0.031) 0.297 (0.043) 0.303 (0.036) 19 West cotton and cereals (Z3) 0.241 (0.018) 0.249 (0.014) 0.246 (0.015) 0.252 (0.018) 0.241 (0.016) 0.248 (0.013) 14 West cereals and remittances (Z4) 0.232 (0.018) 0.239 (0.015) 0.238 (0.015) 0.238 (0.014) 0.233 (0.021) 0.238 (0.016) 11 Central plateau cereals and market gardening (Z5) 0.197 (0.021) 0.203 (0.024) 0.203 (0.024) 0.204 (0.022) 0.198 (0.023) 0.202 (0.024) 35 North and east livestock and cereals (Z7) 0.183 (0.014) 0.18 (0.017) 0.181 (0.016) 0.179 (0.016) 0.18 (0.016) 0.179 (0.014) 13 Southeast cereals, livestock, forestry and fauna (Z9) 0.213 (0.220) 0.219 (0.020) 0.219 (0.020) 0.221 (0.018) 0.212 (0.022) 0.219 (0.018) 21 Overall 0.225 (0.048) 0.230 (0.047) 0.229 (0.047) 0.231 (0.047) 0.225 (0.047) 0.229 (0.047) 130 Note: The sum over the livelihood zones does not equal the country-level sample size. Only livelihood zones with sample sizes greater than five are represented in the table. Figure 6 presents the livelihood zone map and DHS locations for Burkina Faso which guides the next portion of the analysis. We account for differences in landscapes using the livelihood zones, and then compare the values across livelihoods. For livelihood zones 2, 3, and 4, we reject the null hypothesis for hypothesis 1 (p = 0.02, 0.02, and 0.06, respectively) and we fail to reject the null hypothesis for the remaining zones. In the case of hypothesis 2, we reject the null for zones 1, 2, and 4 (p = 0.07, 0.05, and 0.07 respectively). We note that many of the livelihood zones in Burkina Faso have relatively small sample sizes, possibly affecting the statistical power. Figure 6Open in figure viewerPowerPoint Livelihood Zones and DHS public released clusters used in analysis—Burkina Faso Tajikistan In the case of Tajikistan, our results suggest that there is a statistical difference in the differences calculated when comparing the DHS recommended approach to our alternative approach for hypothesis 1 but not for hypothesis 2 (p <0.01 and 0.96, respectively). When comparing the values across the different types of locations and buffers and while accounting for livelihood zones, a statistically significant difference in the values is revealed for hypothesis 1 and for hypothesis 2. The median NDVI values of the settlements using a 5 km buffer is closer to the true location than when using the public DHS geocodes with a 10 km buffer for some livelihood zones (see Figure 7). This result is not equivalent across all zones, however. In the cases of zones 6, 7, 8, and 12, the differences in the true and estimated NDVI values are statistically smaller (p = 0.02, 0.06, 0.02, and 0.08, respectively) when using settlements than when using the DHS geocodes. For hypothesis 2, using the smaller buffers, the settlement approach used in zone 12 produces results that are closer to those of the true locations as compared to the DHS approach (p = 0.04). This result highlights the “place”-specific nature of environmental context as measured by NDVI. Possible explanations include different activity spaces under different land-use contexts or given different levels of wealth and development as captured by livelihood zone. Table 2 provides the mean values, standard deviations, and sample sizes overall and according to each livelihood zone. Figure 7Open in figure viewerPowerPoint Livelihood Zones and DHS public released clusters used in analysis—Tajikistan Table 2. Means and standard deviations of NDVI calculated for the settlement location, for the DHS cluster location, and for the true location—Tajikistan Settlement Public True Points Livelihood Zone 2 km 5 km 5 km 10 km 2 km 5 km N Southern Khatlon cotton, vegetables and wheat (Z6) 0.272 (0.062) 0.284 (0.068) 0.287 (0.076) 0.298 (0.084) 0.276 (0.133) 0.284 (0.061) 26 Khatlon rainfed wheat and livestock (Z7) 0.256 (0.099) 0.273 (0.121) 0.280 (0.124) 0.272 (0.122) 0.272 (0.119) 0.281 (0.122) 11 Central and eastern agro-industrial (Z8) 0.130 (0.084) 0.126 (0.083) 0.117 (0.080) 0.121 (0.079) 0.115 (0.099) 0.116 (0.081) 59 Eas

Ver no editor

Altmetric

PlumX

Entrar

Lembrar minha senha

Receber meu e-mail de confirmação

Integrating Environmental Context into DHS Analysis While Protecting Participant Confidentiality: A New Remote Sensing Method