Geographic Information Science Center of Excellence, South Dakota State University, Brookings, SD, USA

Southeastern Cooperative Wildlife Disease Study, University of Georgia, Athens, GA, USA

Warnell School of Forestry and Natural Resources, University of Georgia, Athens, GA, USA

Abstract

Background

Disease maps are used increasingly in the health sciences, with applications ranging from the diagnosis of individual cases to regional and global assessments of public health. However, data on the distributions of emerging infectious diseases are often available from only a limited number of samples. We compared several spatial modelling approaches for predicting the geographic distributions of two tick-borne pathogens:

Results

Incorporating either spatial autocorrelation or spatial heterogeneity resulted in substantial improvements over the standard logistic regression model. For

Conclusion

Spatial autocorrelation can improve the accuracy of predictive disease risk models by incorporating spatial patterns as a proxy for unmeasured environmental variables and spatial processes. Spatial heterogeneity can also improve prediction accuracy by accounting for unique ecological conditions in different regions that affect the relative importance of environmental drivers on disease risk.

Background

Maps of disease risk have a broad spectrum of applications in the health sciences. Disease maps can aid the diagnosis of individual cases by providing information about the likelihood of exposure to specific infectious agents

One solution to this problem is to model disease risk as a function of one or more environmental variables. This approach is based on the assumption that the environment influences development and transmission of pathogens, habitats for disease vectors and hosts, or human exposure to pathogens. To be used in disease mapping, environmental data must be available as complete spatial coverages that allow model calibration at sites where disease data exist, and model-based predictions at other locations where disease data are unavailable. Climate is recognized as a major constraint on the geographic ranges of infectious diseases, and interpolated climate datasets have been used to predict the distributions of tick vectors in the United States

Spatial autocorrelation is an important statistical consideration in the development of predictive models of disease risk. Sites located close to one another tend to have similar disease risk because they share similar environments and are connected via communicable disease spread or vector and host dispersal. Ordinary least squares regression, generalized linear models, and other standard statistical modelling methods assume that any spatial pattern in the response variable can be entirely explained by the set of predictor variables, and that model residuals are independent and identically distributed

Despite these challenges, spatial autocorrelation also presents opportunities for improving model predictions when the association between disease risk and the available environmental data is weak. Put simply, if disease risk exhibits some degree of spatial clustering, a location surrounded by sites with high disease risk would be expected to have a high disease risk, and a location surrounded by sites with low disease risk would be expected to have a low disease risk. Spatial interpolation based on associations with neighbouring sites can be implemented using a variety of statistical techniques. A study of the tick-borne pathogen

Another consideration in developing disease risk models is the phenomenon of spatial heterogeneity

This study compared alternative methods for developing predictive maps of the geographic distributions of two tick-borne pathogens in the southern United States.

Although a variety of methods have been proposed for improving predictive spatial models by incorporating spatial autocorrelation or spatial heterogeneity into environmental models, there have been no comparative assessments of the accuracy that is gained by applying these more complex approaches in disease risk mapping. The main goal of this research was to determine whether incorporating spatial autocorrelation and spatial heterogeneity would improve environmental predictions of the geographic distributions of

Methods

Serology Data

Data on the county level distributions of

Presence/absence of

**Presence/absence of Ehrlichia chaffeensis and Anaplasma phagocytophilum in the southeastern United States based on serology of white-tailed deer herds**.

A descriptive analysis was carried out to quantify differences in the spatial patterns of these pathogens. Indicator semivariograms

Environmental Data

Environmental variables characterizing climate, land cover, and host populations within each county were obtained from a variety of sources. Climate variables were computed from 1-km Daymet grids which summarized monthly minimum temperature, maximum temperature, and precipitation over the period 1980–97 ^{2 }; (3) 15–30 deer/km^{2 }; (4) 30–45 deer/km^{2}; and (5) > 45 deer/km^{2}. The map was digitized, georeferenced and converted to a 1-km grid. Deer density was summarized for each county as the density index that characterized the majority of the county.

A set of predictor variables was previously chosen for each pathogen through a multi-model comparison exercise

Predictor variables used to develop environmental models of

**Predictor variables used to develop environmental models of Ehrlichia chaffeensis and Anaplasma phagocytophilum in the southeastern United States.**

Geographic zones were previously identified to characterize spatial heterogeneity in the influences of environmental variables on the distributions of

Geographic zones of the southeastern United States used in the development of the local environmental models

**Geographic zones of the southeastern United States used in the development of the local environmental models.** The zones were derived in a previous study using

Statistical Models

We used a hierarchical Bayesian modelling approach to fit statistical models of pathogen presence/absence at the county level. We chose this technique because it allowed us to examine environmental correlates, spatial autocorrelation, and spatial heterogeneity in a consistent statistical framework. The binary response variable, _{i}, denoting pathogen presence/absence in each county was assumed to follow a Bernoulli distribution _{i }~ Bernoulli(_{i})

where _{i }was the probability of pathogen presence in county

(1) The _{i }as a function of co-located environmental variables. In the global model, a single parameter was fitted to quantify the influence of each environmental variable across the entire study area.

where _{0 }was the intercept, _{j }were the parameters, and _{ij }were the environmental variables.

(2) The _{i }as a function of co-located environmental variables. To account for spatial heterogeneity, multiple parameters were fitted for each environmental variable to account for spatial heterogeneity in environmental effects across geographic zones (Figure

where _{ was the number of geographic zones, b00 was the intercept for the baseline zone, bj0 were the parameters for the baseline zone, zk were indicator variables for the s-1 other zones, b0k were the deviations of the intercept in zone k from b00, and bjk were the deviations of the parameter for environmental variable j in zone k from bj0. Zone 1 was used as the baseline zone in all models (Figure 3).}

(3) The _{i }

_{i}) = _{0 }+ _{i}

where _{i }was a spatial random effect that was modelled as a conditionally autoregressive (CAR) process. These random effects adjusted the endemicity probability up or down depending on the values of _{i }in surrounding counties

(4) The

(5) The

Models were fitted via Markov Chain Monte Carlo (MCMC) simulation using WinBUGS software _{j }~ _{jk }~ ^{6}). The spatial random effect was modelled as a conditional autoregressive (CAR) process in which the distribution of each spatial effect had a Gaussian distribution centred on the mean of the neighbouring values.

where _{ij }were the neighbourhood weights and ^{2}_{ρ }was a hyperparameter specifying the prior variance of the spatial random effects. The _{ij }were specified based on a queen's adjacency rule, in which counties sharing a common boundary were considered neighbours. In spatial Bayesian models, a hyperprior for 1/^{2}_{ρ }is commonly specified as a gamma distribution such as Γ(0.001, 0.001) or Γ(0.5, 0.0005) ^{2}_{ρ}, which has been suggested as one alternative to the gamma distribution ^{2}_{ρ }~ ^{2}_{ρ }was always positive. Sensitivity analyses using alternative specifications of ^{2}_{ρ }~ ^{2}_{ρ }~ ^{2}_{ρ}. Flat priors were used for the intercepts _{0 }and _{00}.

The data used to fit the models and generate predictions included _{i }values for the counties with serology data, along with _{i }for all counties. Convergence of the MCMC algorithm was evaluated through visual examination of the trace plots and through Brooks-Rubin-Gelman diagnostic plots _{i }across the 20,000 steps.

Model Evaluation

Cross-validation was used to compare model performance at predicting pathogen presence in unsampled counties. The 567 counties with serological data were randomly split into four subsets of approximately equal size, and four WinBUGS runs were carried out for each model. In each of these runs, one of the four subsets was set aside for model evaluation, and the remaining three subsets were used to fit the model.

The predictive capabilities of the models were evaluated by computing the area under the receiver operating characteristic curve (AUC) for each model. The receiver operating characteristic curve describes relationship between the true positive rate and the false positive rate using a range of thresholds to classify pathogen presence and absence based on _{i }_{i }value than a randomly selected county where the pathogen is absent. We also selected an optimal classification threshold for each model by computing classification accuracy (percent of counties correctly classified) for a range of thresholds and choosing the threshold with the highest accuracy. Sensitivity (percent of positive counties correctly classified) and specificity (percent of negative counties correctly classified) were also computed using this optimal threshold.

Maps of the predicted distributions of each pathogen were generated by plotting the spatial distribution of predicted _{i }values for each of the five models. To generate these maps, models were fitted using pathogen data from all 567 counties with serology data to utilize all the available information and generate the best possible spatial predictions. Pathogen presence/absence data from the serology database were overlaid on the predicted endemicity probabilities to visually assess spatial patterns of prediction accuracy for the various models

Results

Semivariograms were computed for each pathogen using a bin width of 75 km (Figure

Indicator semivariograms (1 = present, 0 = absent) of the geographic distributions of

**Indicator semivariograms (1 = present, 0 = absent) of the geographic distributions of Ehrlichia chaffeensis and Anaplasma phagocytophilum.**

Parameters for exponential models fitted to indicator semivariograms of the distributions of two tick-borne pathogens

Pathogen

Range (

Nugget (_{0})

Partial Sill (_{1})

Normalized sill

128.7 km

0.0767

0.140

0.646

137.3 km

0.151

0.102

0.402

Total sill is the maximum semivariance, _{0 }+ _{1}.

Normalized sill is the ratio of the partial sill to the total sill, _{1}/(_{0 }+ _{1})

Both pathogens had positive relationships with temperature, humidity, and forest cover and negative relationships with precipitation. In addition,

Parameter Estimates. Mean posterior parameters values from the Bayesian hierarchical models with 2.5% and 97.5% Bayesian credible intervals.

Click here for file

For

Predictive accuracy of five statistical models for the distribution of

Model

AUC

Accuracy

Sensitivity

Specificity

Threshold

Global environmental

0.745

0.776

0.905

0.497

0.555

Local environmental

0.801

0.801

0.905

0.575

0.550

Spatial autoregressive

0.838

0.822

0.948

0.547

0.510

Global environmental- autoregressive

0.833

0.818

0.954

0.525

0.480

Local environmental- autoregressive

0.829

0.824

0.961

0.525

0.417

For

Predictive accuracy of five statistical models for the distribution of

Model

AUC

Accuracy

Sensitivity

Specificity

Threshold

Global environmental

0.700

0.658

0.592

0.721

0.504

Local environmental

0.756

0.700

0.567

0.828

0.611

Spatial autoregressive

0.748

0.679

0.776

0.586

0.456

Global environmental- autoregressive

0.765

0.704

0.570

0.831

0.581

Local environmental- autoregressive

0.777

0.713

0.621

0.800

0.564

Spatial patterns of predicted endemicity probabilities differed among the models. For

Predicted endemicity probabilities for

**Predicted endemicity probabilities for Ehrlichia chaffeensis in the southeastern United States obtained from five Bayesian hierarchical models.**

For

Predicted endemicity probabilities for

**Predicted endemicity probabilities for Anaplasma phagocytophilum in the southeastern United States obtained from five Bayesian hierarchical models.**

Discussion

Predicted endemicity probabilities based on environmental variables reflect the ecology of the tick vectors and mammalian host communities. Development rates of larval, nymph, and adult ticks increase with temperature

In addition to suitable microhabitats, ticks require sufficient populations of mammalian hosts for blood meals. These hosts may also serve as reservoirs for tick-borne pathogens, allowing their transmission to the next generation of uninfected ticks. Because white-tailed deer are hosts for all three life-stages of

The ecological differences between

For both

In contrast to

Besides improving prediction accuracy, spatial heterogeneity can also provide insights into the underlying ecological processes controlling the distributions of zoonotic pathogens. Spatial variability in environmental relationships may reflect genetic variability in pathogens, vectors, or hosts that leads to dominance by different genotypes in different areas

A challenge in developing spatially heterogeneous models such as the ones used in this study is the need to specify geographic zones for the local analysis. One approach is to use an existing ecological stratification such as the ecoregion maps developed by the U.S. Environmental Protection Agency

Conclusion

Predictive modelling of disease risk can be enhanced using spatially explicit methods that account for either spatial autocorrelation (the tendency for pathogen distributions to be clustered in space) or spatial heterogeneity (the potential for environmental influences on pathogens to vary predictably in space). However, the modelling approach that is most effective will depend on the ecology of the underlying zoonotic cycle and the spatial pattern of the resulting pathogen distributions. For pathogens such as

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

MCW designed the study, carried out the statistical analysis, and was the lead writer of the manuscript. ADB was responsible for development and management of the geospatial databases and contributed to the writing of the manuscript. MJY contributed to the development of the study, the interpretation of the statistical results, and the writing of the manuscript.

Acknowledgements

This study was supported by the National Institutes of Health, National Institute of Allergy and Infectious Diseases (grant 1 R03 AI062944-01 to MCW).