Department of Statistics, University of Georgia, Athens, GA 30602, USA

Computational Biology Service Unit, Cornell University, Ithaca, NY 14853, USA

Abstract

Background

Data on single-nucleotide polymorphisms (SNPs) have been found to be useful in predicting phenotypes ranging from an individual’s class membership to his/her risk of developing a disease. In multi-class classification scenarios, clinical samples are often limited due to cost constraints, making it necessary to determine the sample size needed to build an accurate classifier based on SNPs. The performance of such classifiers can be assessed using the Area Under the Receiver Operating Characteristic (

Results

For coded SNP data from

Conclusion

For multi-classes, we have developed a sample size determination methodology and illustrated its usefulness in obtaining a required sample size from the estimated learning curve. For classification scenarios, this methodology will help scientists determine whether a sample at hand is adequate or more samples are required to achieve a pre-specified accuracy. A PDF manual for R package “SampleSizeSNP” is given in Additional file

Background

Data on single-nucleotide polymorphisms (SNPs) have been found to be useful in predicting an individual’s class membership or his/her response to a drug, susceptibility to environmental factors such as toxins, and the risk of developing a particular disease, among others

Recently Liu

While Liu

This article revisits the problem of sample size determination in classification scenarios involving coded SNP data, but uses the

R software was used to carry out all the computations. A PDF manual for R package “SampleSizeSNP” is given in Additional file

**Manual of R package “SampleSizeSN”.**

Click here for file

**R package “SampleSizeSNP” in ZIP file.**

Click here for file

Methods

Assumptions

Suppose there are _{1},…,_{
D
}, consisting of _{1},…,_{
D
} subjects, respectively. For each subject, we observe a _{
j
}=0,1,2, which denotes the number of minor alleles in the genotype “aa”, “Aa” and “AA”, respectively. It is possible that some of the SNPs are highly correlated, leading us to choose one SNP to represent a set of highly correlated ones. For classification and sample size determination, we make the following assumptions:

1. For an

2. For each _{
j
}) belonging to class

where _{
k,j
} is the minor allele frequency at locus _{
k,j
}∈(0.01,0.5). Here, _{
k,j
}<0.5 because it is the minor allele frequency, and _{
k,j
}>0.01 ensures that the polymorphism is not a _{
k
}.

3. There is a percentage

The optimal classifier and its **
PCC
**

By the assumptions above, the conditional mass function of _{
k
},

Suppose _{
k
} given

For any fixed _{
k
} if

for all ^{′}≠_{
k
} if

for all ^{′}≠

Then, the

In Additional file

**Appendix 1–5.**

Click here for file

where **
ϕ
** is the (

In Additional file

A linear classifier and its PCC

Motivated by the form of the optimal Bayes classifier in (2), we consider the following linear classifier that classifies _{
k
} if

for all ^{′}≠_{
j,n
}(^{′}) in (5) are determined in the following way: For each ^{′}≠_{
j,n
}(^{′})=1 if _{
j,n
}(^{′})=0. In Additional file

In Additional file

Note that

**
AUC
** and

For any (^{′}), define

Then, for the optimal Bayes classifier in (2) we have from (4) that

and similarly, for the linear classifier in (5), we have from (6) that

for _{2,2} vs. (1−_{1,1}). Then, the

However, when the number of classes

By replacing _{
k.k
} by

Computation of **
VUS
**

As is evident from (9), the computation of _{
k
}=_{
k,k
}, _{
k
}’s are positive.

To find as many ^{
k
}, which also contributes to the integration. We use the

Now, to compute the volume, _{
o
l
d
}−_{
n
e
w
}|<0.001. When this criterion is satisfied, we obtain the value of

Sample size determination using **
VUS
** or

Given a threshold

For the case

Results

Monte Carlo simulations

Before we illustrate the performance of our sample size determination method based on _{
k
}=

When _{1,j
}∼_{
i,j
}∼

**
D = 3
**

**
h
**

**
m
**

**
n
**

Here, _{1,j
}∼_{
i,j
}∼

0.02

50

50

0.3013

0.1772

0.1657

-0.0116

0.02

50

100

0.3015

0.1793

0.1742

-0.0052

0.02

100

50

0.3662

0.1807

0.1874

0.0067

0.02

100

100

0.366

0.1837

0.1974

0.0136

0.05

50

50

0.5469

0.2229

0.2442

0.0213

0.05

50

100

0.5467

0.2517

0.2845

0.0328

0.05

100

50

0.6988

0.2448

0.2912

0.0463

0.05

100

100

0.6987

0.2848

0.3377

0.0529

0.1

50

50

0.8686

0.4179

0.4675

0.0496

0.1

50

100

0.8687

0.4958

0.55

0.0542

0.1

100

50

0.9667

0.4776

0.5342

0.0566

0.1

100

100

0.9667

0.5692

0.6341

0.0649

**
D = 4
**

**h**

**m**

**n**

0.02

50

50

0.1319

0.048

0.0462

-0.0018

0.02

50

100

0.1318

0.05

0.0512

0.0013

0.02

100

50

0.1892

0.0503

0.057

0.0068

0.02

100

100

0.189

0.0531

0.0614

0.0082

0.05

50

50

0.3891

0.0893

0.0923

0.003

0.05

50

100

0.3893

0.1175

0.1144

-0.0032

0.05

100

50

0.5832

0.1092

0.1127

0.0034

0.05

100

100

0.5831

0.1458

0.1285

-0.0174

0.1

50

50

0.8376

0.2933

0.2705

-0.0228

0.1

50

100

0.8378

0.4059

0.3517

-0.0542

0.1

100

50

0.9623

0.3653

0.3119

-0.0534

0.1

100

100

0.9626

0.4962

0.4085

-0.0877

Next, we determine the smallest _{
S
} and _{
L
} such that _{
S
})>0 and _{
L
})<0, and set _{
M
}=[(_{
S
}+_{
L
})/2]. The algorithm begins by selecting a small _{
S
} and a large _{
L
}; (ii) If _{
M
})_{
S
})<0, then reset _{
L
}=_{
M
}; or else, reset _{
S
}=_{
M
}. In either case, return to step (i), unless _{
L
}−_{
S
}≤1, in which case, the smallest sample _{
L
}; (iii) Use the smallest (total) sample of size _{
L
}, with _{
L
} from each class, _{1},…,_{
D
}. We implemented this algorithm for each value of

**
n
**

**
D
**

**
h
**

**
m = 30
**

**
m = 50
**

**
m = 100
**

**
m = 200
**

Here, _{1,j
}∼_{
i,j
}∼

3

0.05

1957

2040

2091

2040

3

0.1

489

475

412

288

3

0.15

189

161

105

69

4

0.05

1923

2051

2137

2122

4

0.1

490

476

417

297

Application to the HapMap data

The aim of the International HapMap Project is to develop a haplotype map of the human genome, which will describe the common patterns of human DNA sequence variation.

The HapMap data (Phase III) consists of eleven populations with about ^{6} SNPs. Here, we consider the following nine populations in order to illustrate our sample size determination algorithm: ASW—African ancestry in Southwest USA with 87 subjects; CEU—Utah residents with Northern and Western European ancestry from CEPH collection with 167 subjects; CHB—the Han Chinese individuals from Beijing with 137 subjects; CHD—Chinese in Metropolitan Denver, Colorado with 109 subjects; GIH—Gujarati Indians in Houston, Texas with 101 subjects; JPT—the Japanese individuals from Tokyo with 113 subjects; MEX—Mexican ancestry in Los Angeles, California with 86 subjects; TSI—Toscans in Italy (TSI) with 102 subjects; and YRI—Yoruba in Ibadan, Nigeria with 203 subjects. With these, we created four sample size determination studies, of which the first three involve three populations (

Based on all the available subjects, we extracted pair-wise independent SNPs using the following steps. Suppose

Next, we set

Total sample sizes needed for classification to

**Total sample sizes needed for classification to **** HapMap populations CEU, GIH, and MEX.** For the linear classifier based on the SNP data from the three populations, the estimated learning curve gives the required total sample size for different values of the threshold,

Total sample sizes needed for classification to

**Total sample sizes needed for classification to **** HapMap populations ASW, TSI, and YRI.** For the linear classifier based on the SNP data from the three populations, the estimated learning curve gives the required total sample size for different values of the threshold,

Total sample sizes needed for classification to

**Total sample sizes needed for classification to **** HapMap populations CHB, JTP, and CHD.** For the linear classifier based on the SNP data from the three populations, the estimated learning curve gives the required total sample size for different values of the threshold,

Total sample sizes needed for classification to majority

**Total sample sizes needed for classification to majority **** HapMap populations CHB, JTP, CHD and GIH.** For the linear classifier based on the SNP data from the three populations, the estimated learning curve gives the required total sample size for different values of the threshold,

For example, if we set

The results from the four HapMap studies suggest that the

It is well known in the classification literature that the performance of a classifier depends on how well separated the classes are. Similarly, the studies above involving the HapMap data show that the performance of our sample size determination methodology also depends on the extent of separation between populations. While our methodology provides a formal way of determining an approximate total sample size for each specified value of

Discussion

We have built an optimal Bayes classifier and a linear classifier based on coded SNP data from two or more classes. For these classifiers, we have considered the two commonly used scalar performance measures, the Area Under the

The fact that the

Conclusion

In summary, for multiple classes, we have developed an asymptotic methodology based on

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

XL developed and implemented the proposed model, performed simulation and application, and drafted the manuscript. TNS participated in model development and helped manuscript preparation. YW participated in HapMap data analysis. All authors read and approved the final manuscript.

Acknowledgements

We would like to thank the Editor and the two reviewers for their careful reading and insightful suggestions, which greatly improved the content and the presentation of the article. T.N.S. was supported by a grant from the National Security Agency [H98230-11-1-0188] and the National Science Foundation [#1309665].