Department of Animal and Dairy Science, The University of Georgia, Athens, GA, USA

Department of Statistics, The University of Georgia, Athens, GA, USA

Institute of Bioinformatics, The University of Georgia, Athens, GA, USA

PCOM, Suwanee, Athens, GA, USA

Abstract

Background

Misclassification has been shown to have a high prevalence in binary responses in both livestock and human populations. Leaving these errors uncorrected before analyses will have a negative impact on the overall goal of genome-wide association studies (GWAS) including reducing predictive power. A liability threshold model that contemplates misclassification was developed to assess the effects of mis-diagnostic errors on GWAS. Four simulated scenarios of case–control datasets were generated. Each dataset consisted of 2000 individuals and was analyzed with varying odds ratios of the influential SNPs and misclassification rates of 5% and 10%.

Results

Analyses of binary responses subject to misclassification resulted in underestimation of influential SNPs and failed to estimate the true magnitude and direction of the effects. Once the misclassification algorithm was applied there was a 12% to 29% increase in accuracy, and a substantial reduction in bias. The proposed method was able to capture the majority of the most significant SNPs that were not identified in the analysis of the misclassified data. In fact, in one of the simulation scenarios, 33% of the influential SNPs were not identified using the misclassified data, compared with the analysis using the data without misclassification. However, using the proposed method, only 13% were not identified. Furthermore, the proposed method was able to identify with high probability a large portion of the truly misclassified observations.

Conclusions

The proposed model provides a statistical tool to correct or at least attenuate the negative effects of misclassified binary responses in GWAS. Across different levels of misclassification probability as well as odds ratios of significant SNPs, the model proved to be robust. In fact, SNP effects, and misclassification probability were accurately estimated and the truly misclassified observations were identified with high probabilities compared to non-misclassified responses. This study was limited to situations where the misclassification probability was assumed to be the same in cases and controls which is not always the case based on real human disease data. Thus, it is of interest to evaluate the performance of the proposed model in that situation which is the current focus of our research.

Background

Misclassification of dependent variables is a major issue in many areas of science that can arise when indirect markers are used to classify subjects or continuous traits are treated as categorical

Researchers indicated that there is a common assumption under most approaches that disorders can be distinguished without error which is seldom the case

Unfortunately, finding these errors in clinical data is not trivial. Even in the best case scenario when well-founded suspicion exists about a sample, re-testing is often not possible and the best that could be done is to remove the sample leading to power reduction. Recently, several research groups

In supervised learning, if individuals are wrongly assigned to subclasses, false positive and erroneous effects will result if these phenotypes are used when trying to identify which markers or genes can distinguish between disease subclasses. Researchers carried out a study of misclassification using gene expression data with application to human breast cancer

To overcome these issues it would be advantageous to develop a statistical model that is able to account for misclassification in discrete responses. There have been several approaches proposed on how to handle misclassification. Researchers have suggested Bayesian methods

In 2001, a Bayesian approach was proposed for dealing with misclassified binary data

Methods

Detecting discrete phenotype errors

Let **y** = (_{1}, _{2}, …, _{
n
}) ', be a vector of binary responses observed for **r** = (_{1}, _{2}, …, _{
n
}) ', where each r_{i} is the outcome of an independent Bernoulli trial with a success probability, p_{i} specific to each response. Misclassification then occurs when some of the r_{i} become switched. Assuming this error happens with probability

With _{i} = _{i}(1 - _{i})

The success probability for each observation (_{i}) is then modeled as a function of the unknown vector of parameters **r**, given

where _{
i
}(_{
i
} is a function of the vector of parameters

Let **α** = [_{1}, _{2}, …, _{
n
}] ', where_{
i
} is an indicator variable for observation _{i} = 1) if r_{i} is switched and 0 otherwise. Supposing each α_{i} is a Bernoulli trial with success probability π, then **α** and **r** given

Furthermore, the true unobserved binary data could be written as a function of the observed contaminated binary responses and the vector **α** as:

Notice that when _{
i
} = 0(no switching), the formula in (2) reduces to _{
i
} = _{
i
}

Using the relationship in (2), the joint probability distribution of **α** and **y** given **β** and **π** becomes:

To finalize the Bayesian formulation, the following priors were assumed to the unknown parameters in the model

where **β**
_{min}, **β**
_{max}, **β**
_{min} and **β**
_{max} were set to -100 and 100 respectively conveying, thus, a very vague bounded prior. With real data, it is often the case that the number of SNPs is much larger than the number of observations. In such scenario, an informative prior is needed to make the model identifiable and often a normal prior is assumed.

The resulting joint posterior density of π, **β**, **α** is:

Implementation of the model in (4) could be facilitated greatly by using a data augmentation algorithm as described by fellow researchers _{
i
}, that relates to the binary responses through the following relationship:

where

The model at the liability scale could be written as:

where _{
ij
} is the genotype for SNP _{
j
}is the effect of SNP _{
i
}is the residual term. To make the model in (4) identifiable, two restrictions are needed. It was assumed that the T = 0 and _{
i
}) = 1.

At the liability scale and using the prior distributions specified in (3), the full conditional distributions needed for a Bayesian implementation of the model via Gibbs sampler are in closed form being normal for the position parameters _{
i
}

where α_{-i} is vector **α** without α_{i}.

For the misclassification probability, its conditional distribution is proportional to

Hence, π is distributed as _{
i
}, _{
i
}) with ∑ _{
i
} is the total number of misclassified (switched) observations.

Given

where n = 2000 is the number of data points.

For each element in the vector

where _{
j
} is a column vector of genotypes for SNP **X** is an ^{th} row and column set to zero and _{-j
}is the vector ^{th} position.

For each element in the liability vector,

This is a truncated normal (TN) distribution to the left if _{
i
} = 1 and to the right if _{
i
} = 0 (Sorensen et al., 1995) where _{-i
} is the vector ^{th} position.

In all simulation scenarios, the Gibbs sampler was run for a unique chain of 50,000 iterations of which the first 10,000 iterations were discarded as burn-in period. The convergence of the chain was assessed heuristically based on the inspection of the trace plot of the sampling process.

Simulation

PLINK software

Based on the OR distribution (moderate and extreme) and the level of misclassification (5 or 10%), four data sets were generated: 5% misclassification rate and moderate OR (D1); 5% misclassification and extreme OR (D2); 10% misclassification rate and moderate OR (D3); and 10% misclassification rate and extreme OR (D4). For each dataset, 10 replicates were generated.

To further test our proposed method, a more diverse and representative data was simulated using the basic simulation procedure previously indicated. For this second simulation, a dataset consisting of 1800 individuals (1200 controls and 600 cases) was genotyped for 40,000 linked SNPs assuming an additive model. Five hundred SNPs were assumed to be influential with OR set equal to 1:4 (80 SNPs), 1:2 (120 SNPs), and 1:1.8 (300 SNPs). Only the 5% misclassification rate scenario was considered.

Results and discussion

For all simulation scenarios, the true misclassification probability was slightly underestimated. In fact, the posterior mean (averaged over 10 replicates) was 3 and 6% for D1 and D3, respectively. However for moderate OR, the true misclassification probability values still lie within their respective HPD95% interval indicating the absence of systematic bias (Table

**Moderate**
^{
1
}

**Extreme**

^{1}Moderate effects for influential SNPs; ^{2} PM = Posterior mean; ^{3} HPD95% = High probability density interval.

True

PM^{2}

HPD95%

PM

HPD95%

5%

0.03

0.01-0.05

0.04

0.03-0.06

10%

0.06

0.04-0.09

0.07

0.06-0.09

Table

**5%**

**10%**

**Moderate**
^{
2
}

**Extreme**

**Moderate**

**Extreme**

^{1}True effects were calculated based on analysis of the true data (M1); ^{2}Moderate effects for influential SNPs.

M2

0.828

0.664

0.714

0.558

M3

0.925

0.843

0.864

0.815

Using the data set simulated under a more realistic scenario (imbalance between cases and controls, larger SNP panel) the results were similar in trend and magnitude to those observed using the first four simulations. In fact, the posterior mean of the misclassification probability was 0.04 and the true value (0.05) was well within the HPD95% interval. Furthermore, the correlations between SNP effect estimates using M2 and M3 were 0.54 and 0.70, respectively. This 30% increase in accuracy using M3 indicates a substantial improvement of the model when our proposed method is used. This is of special practical importance as the superiority of the method was maintained with a dataset similar to what is currently used in GWAS.

It is clear that across all simulation scenarios our proposed method (M3) showed superior performance. Accounting for misclassification in the model increases the predictive power by eliminating or at least by attenuating the negative effects caused by these errors, allowing for better estimates of the true SNP effects. This is essential in GWA studies for correctly estimating the proportion of variation in cause of disease associated with SNPs. Complex diseases which are under the control of several genes and genetic mechanisms are moderately to highly heritable

To further investigate the consequences of misclassification errors on estimating SNP effects we observed the changes in magnitude and the ranking of influential SNPs. As mentioned before the benefits of GWAS lies in its ability to correctly detect polymorphisms associated with a disease. This is driven by how well the model can estimate SNP effects so that the polymorphisms with significant associations will have the largest effects. Figure

Distribution of SNP effects for 5% misclassification rate

**Distribution of SNP effects for 5% misclassification rate.** The effects are sorted in decreasing order based on estimates using M1 when odds ratios of influential SNPs are moderate **(A)** and extreme **(B)**. M1: misclassification was not present in the data. M2: misclassification was present in the data set but was not addressed. M3: misclassification was addressed using the proposed method.

Distribution of SNP effects for 10% misclassification rate

**Distribution of SNP effects for 10% misclassification rate.** The effects are sorted in decreasing order based on estimates using M1 when odds ratios of influential SNPs are moderate **(A)** and extreme **(B)**. M1: misclassification was not present in the data. M2: misclassification was present in the data set but was not addressed. M3: misclassification was addressed using the proposed method.

In addition to an inaccurate estimation of significant SNPs, M2 tends to report non-zero estimates for truly non-influential SNPs, especially under scenario D2, contrary to M1 and M3. For example, under scenario D1, 3 out of the 15 most influential SNPs (top 10%) were not identified by M2 (Table

**5%**

**10%**

**Moderate**
^{
1
}

**Extreme**

**Moderate**

**Extreme**

^{1}Moderate and extreme OR for influential SNPs.

M2

12

10

10

9

M3

14

13

13

12

To further evaluate the effectiveness of our proposed methods, we looked at its ability of correctly identifying misclassified observations. For that purpose, we calculated the posterior probability of misclassification of each observation in all four scenarios. Figure

Average posterior misclassification probability for the 113 miscoded observations (a: moderate and c: extreme) and the 1887 correctly coded observations (b: moderate and d: extreme) when the misclassification rate was set to 5%

**Average posterior misclassification probability for the 113 miscoded observations (a: moderate and c: extreme) and the 1887 correctly coded observations (b: moderate and d: extreme) when the misclassification rate was set to 5%.**

Average posterior misclassification probability for the 205 miscoded observations (a: moderate and c: extreme) and the 1795 correctly coded observations (b: moderate and d: extreme) when the misclassification rate was set to 10%

**Average posterior misclassification probability for the 205 miscoded observations (a: moderate and c: extreme) and the 1795 correctly coded observations (b: moderate and d: extreme) when the misclassification rate was set to 10%.**

In real data set application, the miscoded observations will be unknown and a reliable cutoff probability is desired. Table

**D1**

**D2**

**D3**

**D4**

^{1}Hard: cut off probability was set at 0.5. Soft: cut off probability was equal to the overall mean of the probabilities of being misclassified over the entire dataset plus two standard deviations; ^{2}Misclass: individuals which were misclassified. Correct: Correctly coded individuals.

Misclass^{2}

Correct

Misclass

Correct

Misclass

Correct

Misclass

Correct

Hard^{1}

0.27

0

0.95

0

0.24

0

0.90

0

Soft

0.94

0

0.99

0

0.79

0

0.97

0

Conclusions

Misclassification of discrete responses has been shown to occur often in datasets and has proven to be difficult and often expensive to resolve before analyses are run. Ignoring misclassified observations increases the uncertainty of significant associations that may be found leading to inaccurate estimates of the effects of relevant genetic variants. The method proposed in this study was capable of identifying miscoded observations, and in fact these individuals were distinguished from the correctly coded set and were detected at higher probabilities over all four simulation scenarios. This is essential as it shows the capability of our algorithm to maintain its superior performance across different levels of misclassification as well as different odds ratios of the influential SNPs.

More notably, our method was able to estimate SNP effects with higher accuracy compared to estimation using the “noisy” data. Running analyses on data that do not account for potential misclassification of binary responses, such as M2 in this study, will lead to non-replicative results as well as causing an inaccurate estimation of the effect of polymorphisms which can be correlated to the disease of interest. This severely reduces the power of the study. For instance, it was determined that conducting a study on 5000 cases and 5000 controls with 20% of the samples being misdiagnosed has the power equivalent to only 64% of the actual sample size

Abbreviations

SNP: Single nucleotide polymorphism; OR: Odds ratios; GWAS: Genome-wide association studies; PM: Posterior mean; HPD95%: High posterior density 95% interval.

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

The first author (SS) has contributed to all phases of the study including data simulation, analysis, discussion of results and drafting. EHH helped with data analysis and drafting. NF participated in the development of the general idea of the study and drafting. RR has participated and supervised all phases of the project. All authors read and approved the final manuscript.

Acknowledgements

The first author was supported financially by the graduate school and the department of Animal and Dairy science at the University of Georgia.