Department of Genetics, University of Georgia, Athens, GA 30602-7223, USA

Bioinformatics Research Center, North Carolina State University, Campus Box 7566, Raleigh, NC 27695-7566, USA

Abstract

Background

Studies on the distribution of indel sizes have consistently found that they obey a power law. This finding has lead several scientists to propose that logarithmic gap costs,

Results

A group of simulated sequences pairs were globally aligned using affine, logarithmic, and log-affine gap costs. Alignment accuracy was calculated by comparing resulting alignments to actual alignments of the sequence pairs. Gap costs were then compared based on average alignment accuracy. Log-affine gap costs had the best accuracy, followed closely by affine gap costs, while logarithmic gap costs performed poorly. Subsequently a model was developed to explain the results.

Conclusion

In contrast to initial expectations, logarithmic gap costs produce poor alignments and are actually not implied by the power-law behavior of gap sizes, given typical match and mismatch costs. Furthermore, affine gap costs not only produce accurate alignments but are also good approximations to biologically realistic gap costs. This work provides added confidence for the biological relevance of existing alignment algorithms.

Background

Sequence alignments are essential to the study of molecular biology and systematics because they purport to reveal regions in sequences that are homologous. Because sequences gain and lose residues as they evolve, alignments are necessary for revealing such gaps in sequence data. Therefore, researchers usually need to align sequences before they can be studied. For example, most algorithms that construct phylogenetic trees from sequences require a sequence alignment (e.g.

There are two main types of alignment algorithms: local and global. Local alignment algorithms like FASTA

Most global alignment algorithms fall into two categories: finite state automata (FSA) or hidden Markov models (HMM)

An important observation is that alignment accuracy depends on the assumptions used in picking parameters. Costs (the parameters in our approach) that are based on abiological assumptions are likely to produce bad alignments. For example, if the costs of gaps are less than the cost of a match, then the best alignment for a pair of sequences will say that all residues align with gaps, i.e. the sequence pair is unaligned. Only in a limited number of applications will this be a biologically plausible result. A more prudent concern is how to pick the nature of gap costs because using an abiological model of gap costs can render any heuristic optimization of gap costs worthless. Gap costs are typically based on the affine model, where the cost of a gap of length

However, some researchers have raised questions about the biological justification for the affine gap model. Studies on the distribution of indel lengths have revealed that the size of an indel is linearly related to its frequency on a log-log scale, and therefore gap-sizes obey a power law ^{-z}/

where _{c }is the number of columns in category

Example alignment pair

**Example alignment pair**. Numbers identify the residues in the sequences. _{1 }columns – A5B-, A6B-, A7B5, and A8B6 – are found in only the left alignment. _{2 }columns – A7B-, A8B-, A5B5, and A6B6 – are found in only the right alignment. _{3 }columns – A1B1, A2B2, A3B3, and A4B4 – are found in both alignments. Alignment identity is _{3})/(_{3 }+ _{1 }+ _{2}) = (2 × 4)/(2 × 4 + 4 + 4) = 1/2.

when the first alignment is taken as the hypothesized alignment and the second alignment is taken as the "true" alignment. Unlike the alignment fidelity of Holmes and Durbin

Not all sequence pairs are equally easy to align, and the accuracy of a hypothesized alignment is expected to decrease as sequence pairs become more distantly related due to substitution saturation and indel accumulation. Therefore, an appropriate measure of expected alignment accuracy for a specific gap cost needs to average across multiple branch lengths and multiple sequence pairs. Branch lengths are often measured in "substitution time", where a unit branch length is equal to 1 substitution, on average, per nucleotide. According to coalescent theory and neutrality, the number of generations separating any pair of sequences in the same diploid population depends on the effective population size, _{e}, and has approximately an exponential distribution with mean 4_{e }_{e}

Results

For the set of sequence pairs, the minimum branch length for any pair was 1.83 × 10^{-05 }mean substitutions per nucleotide, and the maximum branch length was 1.76. Furthermore, the distribution of observed gap sizes, plotted on a log-log scale, is shown in Figure

Gap Sizes Obey a Powerlaw

**Gap Sizes Obey a Powerlaw**. Log-Log plot of the distribution of gap sizes measured from the 5000 true alignments. The line is the maximum likelihood fit of a power-law distribution: ln

The best gap costs were identified by having the highest average alignment accuracy, i.e. they produced alignments that had the highest average identity to the "true" alignments. The best costs for aligning sequences under the log-affine, affine, and logarithmic schemes were identified respectively as _{A }(_{L }(

The curves of the best gap costs

**The curves of the best gap costs**. A) The entire range of the curves and B) a magnification of the beginning of the curves. The best gap costs were decided for each scheme based on highest average alignment identity. Log-Affine: _{A} (x)_{L} (k)

Accuracy distribution of best gap costs

**Accuracy distribution of best gap costs**. Best log-affine (solid), best affine (dashed), and best logarithmic (dotted). Accuracy is measured via alignment identity. See Figure 3 for details on the exact gap costs.

Absolute accuracy properties of the best gap costs

Absolute Identities

Log- Affine

Affine

Logarithmic

Minimum

0.383

0.324

0.183

1st Quartile

0.926

0.904

0.512

Mean

0.941

0.925

0.687

Median

0.976

0.970

0.717

3rd Quartile

0.994

0.992

0.874

Maximum

1.0

1.0

1.0

Accuracy is measured via alignment identity. Log-Affine: _{A} (x)_{L} (k)

Relative accuracy properties of the best gap costs

Relative Identities

Log-Affine

Affine

Logarithmic

Minimum

0.710

0.501

0.193

1st Quartile

0.993

0.971

0.549

Mean

0.992

0.973

0.717

Median

1.0

0.993

0.745

3rd Quartile

1.0

1.0

0.892

Maximum

1.0

1.0

1.0

Relative accuracy was calculated as the alignment identity produced by a gap cost for each sequence pair divided by the largest alignment identity produced by any gap cost for the same sequence pair. See notes of Table 1.

Figure

Accuracies of best costs plotted by divergence

**Accuracies of best costs plotted by divergence**. _{A}, and _{L }are the alignment identities produced by the best log-affine, affine, and logarithmic gap penalties, respectively. See Figure 3 for more information. a-c) Alignment identities plotted by the branch length of the alignments. Divergence time is plotted on a uniform scale,

To compare the best gap costs on a per sequence pair basis, Figure

Accuracies of best costs compared per sequence

**Accuracies of best costs compared per sequence**. Ratio of identities produced by a) best affine gap cost and b) best logarithmic gap cost to the identities produced by the log-affine gap cost plotted for each sequence pair by divergence time. See Figure 5 for more information.

Instead of trying to find gap costs that have the highest average accuracy, we can find the gap costs that have the highest accuracy for each sequence pair. Therefore, an alternative approach to comparing schemes is to look at the maximum identity produced by each scheme for each sequence pair. Similar to Figure

Maximum accuracies plotted by divergence

**Maximum accuracies plotted by divergence**. _{A}, and _{L }are the maximum alignment identity produced for each sequence pair by log-affine, affine, and logarithmic gap costs respectively. The subfigures are the same as in Figure 5.

Maximum accuracies compared per sequence

**Maximum accuracies compared per sequence**. Ratio of maximum identities produced by a) affine gap costs and b) logarithmic gap costs to the maximum identities produced by log-affine gap costs plotted for each sequence pair by divergence time. See Figures 5-7 for more information.

Discussion

The first issue that we will consider is whether the parameter space was properly sampled. For log-affine and affine schemes, the best values were found inside the sampled parameter space, representing local maxima and perhaps global maxima. However, for logarithmic gap penalties, the best penalty was found on the edge of the parameter space. Subsequent expansion of the parameter space confirmed that _{L }(

In the simulations, branch lengths were randomly drawn based on _{e}^{-9}, then the effective population size would be 50 million. This is high for most populations, but it does produce many branch lengths that can represent species-species divergence times. When calculating the best gap costs, it is possible to use importance sampling to weight the identities in a way that reflects another distribution of branch lengths. Similar results (not shown) were obtained when weighting to produce a

An interesting feature of the data is that alignment identity improves at the longest branch lengths. According to Figure

Clearly from the results, logarithmic gap costs are a poor choice for aligning sequences even though biological results would seem to suggest them. Logarithmic gap costs perform poorly because they increase slowly (Figure

It is definitely surprising that logarithmic gap costs do so poorly compared to affine and log-affine gap costs, given that initially there seems to be little biological justification for having a linear component in the gap cost. However, as we show in Appendices A and B, converting a maximum likelihood search into a minimum cost search through shifting and scaling introduces a linear component into the gap cost which can dominate the logarithmic component. In other words, the power law does not imply that gap costs should be logarithmic, instead it implies that gap costs should be log-affine.

In Appendix A we use techniques from the field of statistical alignment to develop a probability model for our alignments. The model is similar to the model of Knudsen and Miyamoto

From Appendix A, the log-likelihood of a pairwise, global alignment given observed sequences

where _{1 }. . . _{N}). Furthermore, in Appendix B we convert this log-likelihood into minimum cost search, producing the following gap cost derived from Equation 2:

Since in the simulations

and Equation 3 reduces to

This gap cost is very close to the top gap cost found in the simulations:

Furthermore, based on unweighted least squares, the following affine cost bests fits Equation 5:

Furthermore, because the linear component of Equation 5 dominates the logarithmic component, logarithmic gap costs will clearly provide worse fits than affine gap costs. Therefore, one can surmise that the linear component to the gap cost function derives from the conversion of a maximum likelihood search into a minimum cost search via shifting and scaling to fit specific substitution costs. Furthermore, this linear component dominates the gap cost allowing the log component to be removed and the gap opening cost re-waited. These results also open the possibility that the gap extension cost can be moved into the substitution matrix and eliminated from the gap cost entirely, potentially speeding up alignment algorithms.

The linear component of the affine approximation is derived solely from the shifting and scaling introduced by fixing the substitution costs. Because the extension cost is not influenced by the distribution of gap lengths, the Zipf power-law distribution of gap sizes appears to be approximated by a discrete uniform distribution. Although this result is rather unexpected, it makes sense in two ways. First, Zipf distributions have fat tails, and sections of the tail can be well approximated by a uniform distribution. And second, the numbers of matches, mismatches, and gapped positions are not independent of one another (Appendix B); therefore, matches and mismatches carry information about gap lengths. The uniform approximation for a Zipf distribution may prove to be more useful than geometric

Conclusion

From these results I propose that, if a researcher knows or is willing to assume

This research has demonstrated that logarithmic gap costs, although suggested by biological data on the surface, are not a good solution for aligning pairs of sequences through a finite state automata. In fact, despite previous suggestions, e.g.

Methods

Five thousand sequence pairs were generated on unrooted trees using the sequence simulation program, Dawg

Pairwise, global alignments were done with Ngila

The statistical software, R

Appendices

A. Alignment log-likelihood

In this appendix we will develop a statistical model for alignment similar to

To calculate Equation 6 completely for two sequences related by a common ancestor, one would have to consider all sequences that could be the most recent common ancestor of A and B and all possible branch lengths between this ancestor and

It is possible to derive Equation 7 from an evolutionary process. Specifically the probability that

where _{b}. If _{a},

The probability that an indel occurs at any position is 1 - ^{-λt}, and, if we ignore the issue of overlapping indels, there are _{b }-

where _{g}, and

For simplicity we will not integrate Equation 7 to find

Upon removing factors that are constant for sequences

The alignment is quantified by the number of matches (_{1 }. . . _{N}). The likelihood of an alignment depends on three evolutionary parameters: the rate of indel formation per unit branch length (

B. Gap costs

As developed by Smith et al. _{i}, and the penalties of gaps of length _{k}, can be used to calculate the alignment with maximum log-likelihood:

where _{i }is the number of residue matches of type _{k }is the number of gaps of length

To begin constructing the minimum cost analog, let _{i }= (_{i})/

The lengths of the sequences being aligned, _{i }+ ∑_{k}. Using this relationship, Equation 12 can be expressed as

From this it can be clearly seen that _{i}_{i }+ ∑_{k}} maximizes the likelihood of the alignment, where _{k})/

Acknowledgements

This work was supported by a NSF Predoctoral Fellowship and N.I.H. grant GM070806. The author would like to thank Wyatt Anderson, Jim Hamrick, Jessica Kissinger, Ron Pulliam, Ben Redelings, Paul Schliekelman, Jeff Thorne, John Wares, and three anonymous reviewers for their helpful suggestions.