Khalil, George Magdy
MetadataShow full item record
Maximizing the utility of surveys while not adding questions is of utmost importance to surveillance systems. Public health agencies need to keep the ever-decreasing number of participants from breaking off after an interview is started. A common reason a participant breaks off is due to the length of the survey. It is therefore important that organizations conducting surveillance investigate innovative techniques of combining data from multiple, less extensive surveys. Data fusion is one such technique that has been used to integrate databases to save time and money. Health insurance status is a good topic to use for the validation of data fusion because this variable is common to many data sources and has a body of literature documenting factors associated with being insured. Besides data availability, respondents are thought to be accurate in reporting health insurance status and type (Call et al., 2008a). The goal of this research was to create "statistical twins" based on health insurance status from two data sources. Matched respondents were considered "statistical twins" and used to test whether data fusion is an effective method of predicting a variable not originally asked in the survey, given the respondent’s profile. Data from the Behavioral Risk Factor Surveillance System’s (BRFSS’s) survey and the National Health Interview Survey (NHIS) were matched by first harmonizing the variables from the two data sources. A propensity score was calculated, which was then used to perform Mahalanobis and Nearest Neighbor matching across the two surveys. The efficiency of the match was then validated: 88.2% of the 297,734 BRFSS respondents reported being covered by a health insurance, while 83.0% of the 27, 921 NHIS respondents reported currently being insured. Propensity scores were left-modal for both the NHIS and the BRFSS. Quantile- Quantile (QQ) plots, which plot the quantiles of one data set against another data revealed that after the match, the empirical distributions were similar in the BRFSS and NHIS groups. Compared to the original BRFSS dataset, the 2-to1 Nearest Neighbor (NN) algorithm was the closest to the BRRFSS respondents (86.2% [86.0, 86.50] versus 88.2% [88.1, 88.3], respectively). This is quite good considering national estimates differ by a few percentage points from survey to survey. Our imputed estimates are not within the confidence interval of the BRFSS. However, being within the narrow BRFSS confidence interval may be too rigorous a standard because of the very large sample size of the BRFSS. Sensitivities and specificities reveal that 2-to-1 NN with replacement and Mahalanobis were more accurate than Nearest Neighbor methods with caliper, without replacement and 1-to-1 matching.