Mathematical properties of the r2 measure of linkage disequilibrium (2024)

Abstract

Statistics for linkage disequilibrium (LD), the non-random association of alleles at two loci, depend on the frequencies of the alleles at the loci under consideration. Here, we examine the r2 measure of LD and its mathematical relationship to allele frequencies, quantifying the constraints on its maximum value. Assuming independent uniform distributions for the allele frequencies of two biallelic loci, we find that the mean maximum value of r2 is ~0.43051, and that r2 can exceed a threshold of 4/5 in only ~14.232% of the allele frequency space. If one locus is assumed to have known allele frequencies – the situation in an association study in which LD between a known marker locus and an unknown trait locus is of interest – we find that the mean maximum value of r2 is greatest when the known locus has a minor allele frequency of ~0.30131. We find that in 1/4 of the space of allowed values of minor allele frequencies and haplotype frequencies at a pair of loci, the unconstrained maximum r2 allowing for the possibility of recombination between the loci exceeds the constrained maximum assuming that no recombination has occurred. Finally, we use rmax2 to examine the connection between r2 and the D′ measure of linkage disequilibrium, finding that r2/rmax2=D2 for ~72.683% of the space of allowed values of (pa, pb, pab). Our results concerning the properties of r2 have the potential to inform the interpretation of unusual LD behavior and to assist in the design of LD-based association-mapping studies.

1 Introduction

Linkage disequilibrium (LD) refers to a non-random association in the occurrence of alleles at two loci (Hudson, 2001; ; Slatkin, 2008). LD finds applications in diverse contexts, including the inference of demographic events in human evolutionary history (Tishkoff et al., 1996; ), the fine-mapping of disease genes after localization via linkage analysis (), and the modeling, selection, and evaluation of sets of informative single-nucleotide polymorphisms for use in detecting disease-susceptibility alleles in genome-wide association studies (Kruglyak, 1999; Carlson et al., 2004; Eberle et al., 2007). Measurements of LD are typically based on comparisons of the observed frequencies of haplotypes to the frequencies expected based on the frequencies of the alleles that comprise the various haplotypes. Statistically estimated haplotype frequencies are used in place of observed frequencies when observed frequencies are unavailable.

One of the challenges inherent in measuring LD is that the ranges of LD measures can depend on the frequencies of alleles at the loci under consideration. Hedrick (1987) showed that for several LD statistics, holding the allele frequencies of one of the two loci in a pair constant, the maximal values of the statistics could occur only when the allele frequencies of the second locus were equal to those of the first locus; further, in some cases, the maximum itself was frequency-dependent. The only statistic considered by Hedrick (1987) whose range was frequency-independent was D′ (Lewontin, 1964), which ranges from −1 to 1 for any set of allele frequencies for a pair of polymorphic biallelic markers. However, even D′ is not independent of allele frequencies in most senses of the concept of “independence” (Lewontin, 1988).

For biallelic markers, one of the most commonly used measures for LD is r2 (), the square of the correlation coefficient between two indicator variables – one representing the presence or absence of a particular allele at the first locus and the other representing the presence or absence of a particular allele at the second locus. In a disease association context, the r2 statistic is often used in calculations of power to detect disease-susceptibility loci. Under some conditions, the power to detect disease association with a marker locus when using a case-control sample of size N is approximately equal to the power to detect disease association with the true causal locus when using a sample of size Nr2, where r2 here denotes the value of the r2 statistic for the marker locus and the causal locus (; ; ). The r2 statistic also underlies popular methods for identifying informative markers for use in LD-based association studies (Carlson et al., 2004; de Bakker et al., 2005).

Like most LD statistics, r2 has a frequency-dependent range. The maximum value of r2 as a function of the allele frequencies of two loci under consideration drops sharply with the extent of the minor allele frequency difference between the loci (Wray, 2005; Eberle et al., 2006; Amos, 2007). Thus, in some settings, matching loci by allele frequencies prior to measurement of LD can provide a way to circumvent the frequency dependence of r2. Using genotypes from 71 unrelated individuals of European, African-American, and Chinese descent, Eberle et al.. (2006) found that by restricting their calculations to matched loci with similar allele frequencies, their ability to identify high LD values using r2 increased considerably, revealing excess LD in genic regions.

Although the frequency dependence of r2 has often been noted (e.g. ; ), relatively little is known about the mathematical properties of this dependence. Wray (2005) showed that if two loci have a value of r2 above a specified cutoff and one of the loci has known allele frequencies, then the frequencies at the other locus must lie in a narrow range. Eberle et al. (2006) studied the properties of r2 in a genealogical context, examining the predictions made by a coalescent model about the expected value of r2 conditional on the allele frequencies at a pair of loci in the absence of recombination. In this paper, we consider the mathematical relationship between r2 and allele frequencies in detail. We investigate the maximum possible value of r2 for a given set of allele frequencies, compute the mean value of rmax2 when frequencies at one of the loci are assumed to be known, and determine the range of possible allele frequencies for one locus when r2 and the frequencies for the other locus are known. We also use two possible genealogical histories (a scenario similar to that of Eberle et al. (2006)) to investigate the effect of recombination on the value of r2. Finally, we determine the relationship between r2 and D′ using a connection to rmax2, the maximum value of r2 possible given the allele frequencies at a pair of loci.

2 Theory

Consider two biallelic loci, locus 1 with alleles a and A and locus 2 with alleles b and B. Suppose the frequencies for alleles a and A are respectively pa and 1−pa, and the frequencies for alleles b and B are pb and 1 − pb. Because pa and pb each range from 0 to 1, the pair (pa, pb) ranges over the (open) unit square. The set of combinations of allele frequencies (pa, pb) can be split into eight components, which we label S1, S2, …, S8 for convenience (Figure 1). Each of the other seven components, S2, …, S8, corresponds to a transformation of S1 in which alleles are swapped at locus 1, alleles are swapped at locus 2, loci 1 and 2 are swapped, or two or more of these exchanges are performed. We will use this symmetry to simplify some of our calculations.

Figure 1.

Open in a new tab

The r2 measure of linkage disequilibrium is defined as

r2(pa,pb,pab)=(pabpapb)2pa(1pa)pb(1pb),(1)

where pab is the frequency of haplotypes having allele a at locus 1 and allele b at locus 2 (). As the square of a correlation coefficient, r2 (pa, pb, pab) can range from 0 to 1 as pa, pb and pab vary.

2.1 rmax2 (pa, pb)

Our first computation is of rmax2(pa,pb), the maximum value of r2 for given values of pa and pb, considering all possible values of pab. Given (pa, pb), the denominator of r2 is fixed. Therefore, to maximize r2, it suffices to choose the value for pab that maximizes the numerator. The possible values of pab are constrained by the fact that the frequency of a haplotype can be no more than the frequency of the least frequent allele that it contains and no less than 0 or the minimum overlap that can occur between two alleles based on their frequencies. It is at one of these extremes – the highest or lowest possible haplotype frequency – that the numerator is maximized. Thus, the maximum value of r2 occurs either at pab = min(pa, pb) or at pab = max(0, pa + pb − 1), depending on the component, Si, in which the given (pa, pb) is located. For S1 and S4, the maximum occurs at pab = pa + pb − 1, so

rmax2(pa,pb)=(1pa)(1pb)papb.(2)

For S2 and S7, the maximum occurs at pab = pa:

rmax2(pa,pb)=pa(1pb)(1pa)pb.(3)

For S3 and S6, the maximum occurs at pab = pb:

rmax2(pa,pb)=(1pa)pbpa(1pb).(4)

Finally, for S5 and S8, the maximum occurs at pab = 0:

rmax2(pa,pb)=papb(1pa)(1pb).(5)

Table 1 summarizes these results. Note that rmax2 is continuous on the boundaries between components. An important consequence of equations 25 is that rmax2(pa,pb)=1 if and only if pb = pa or pb = 1−pa.

Table 1.

rmax2,Dmax2,andDrmax22 in the eight components of the space of possible allele frequencies.

Component

pa<12

pb<12

pa < pbpa + pb < 1pab that

rmax2

Dmax2

Drmax22

produces rmax2D < 0D > 0
S1yesnoyesnopa + pb − 1

(1pa)(1pb)papb

[(1 − pa)(1 − pb)]2[pa(1 − pb)]2[(1 − pa)(1 − pb)]2
S2nonoyesnopa

pa(1pb)(1pa)pb

[(1 − pa)(1 − pb)]2[pa(1 − pb)]2[pa(1 − pb)]2
S3nonononopb

(1pa)pbpa(1pb)

[(1 − pa)(1 − pb)]2[(1 − pa)pb]2[(1 − pa)pb]2
S4noyesnonopa + pb − 1

(1pa)(1pb)papb

[(1 − pa)(1 − pb)]2[(1 − pa)pb]2[(1 − pa)(1 − pb)]2
S5noyesnoyes0

papb(1pa)(1pb)

(papb)2[(1 − pa)pb]2(papb)2
S6yesyesnoyespb

(1pa)pbpa(1pb)

(papb)2[(1 − pa)pb]2[(1 − pa)pb]2
S7yesyesyesyespa

pa(1pb)(1pa)pb

(papb)2[pa(1 − pb)]2[pa(1 − pb)]2
S8yesnoyesyes0

papb(1pa)(1pb)

(papb)2[pa(1 − pb)]2(papb)2

Open in a new tab

Combining equations 25, Figure 2 shows a 3-dimensional plot of rmax2 for all combinations (pa, pb). The X-shape of the figure illustrates the symmetries of r2 as a function of pa and pb, as well as the property that r2 can only equal 1 if the two loci have the same minor allele frequency. Additionally, the graph shows a very steep decay of rmax2 moving away from the diagonals, indicating that even small differences in allele frequency between the two loci, especially if the frequencies are not near 1/2, can reduce the range of possible values for r2 considerably.

Figure 2.

Open in a new tab

We can quantify the effect of differences in minor allele frequency observed in Figure 2 by calculating the average rmax2 value assuming independent Uniform(0,1) distributions for pa and pb. This computation amounts to evaluating the volume below rmax2 over the unit square. Using symmetry, the total volume can be calculated by finding the volume over one of the eight components in Figure 1 and multiplying by eight. Denoting the volume of rmax2 over component S1 by V1, we have

V1=0121pa1(1pa)(1pb)papbdpbdpa=012(1pa)pa[ln(1pa)+pa]dpa=112π212(ln2)2+12ln2780.05381.

The last step uses the dilogarithm function z0 ln(1−t)/t dt = Li2(z), where Li2(0) = 0 and Li2(1/2) = π2/12 − (ln 2)2/2 (Weisstein, 2003). Consequently the mean rmax2 given pa ~ Uniform(0, 1), pb ~ Uniform(0, 1), and assuming pa and pb are independent is 8V1 = 2π2/3−4(ln 2)2+4(ln 2)−7 ≈ 0.43051.

This result and the shape of Figure 2 suggest that it is only possible to achieve high values of r2 over relatively small portions of the space of possible values of pa and pb. For a constant c, 0 ≤ c ≤ 1, we can calculate the proportion of the allele frequency space where it is possible for r2 to exceed c, p(c). Again using symmetry, we can restrict our attention to S6. Using equation 4, the portion of S6 in which rmax2(pa,pb) ≥ c, whose area we denote by A6, satisfies

pbcpa1pa+cpa.

Considering the complement of the area of interest in S6, we have

18A6=0120cpa1pa+cpa1dpbdpa=c2(1c)cln(12+12c)(1c)2.

Thus, the proportion of the allele frequency space where it is possible for r2 to exceed c is 8A6, or

p(c)=1+4c1c+8cln(12+12c)(1c)2.(6)

Figure 3 shows that the proportion of the allele frequency space where it is possible for r2 to exceed c declines faster than linearly. For example, only over ~0.39709 of the allele frequency space is it possible for r2 to exceed 1/2 and only over ~0.14232 of the space is it possible for r2 to exceed 4/5.

Figure 3.

Open in a new tab

2.2 rmax2(pa,pb) with pa fixed

In contrast to the previous computations, in which we performed integrations over possible values of pa and pb, we now consider the case in which pa is fixed. This computation enables us to identify the allele frequencies for a locus that is able to have high r2 values across the broadest range of allele frequencies for a second locus. Assuming pb has a Uniform(0,1) distribution, we can calculate m(pa)=E[rmax2(pa,pb)|pbUniform(0,1)], the mean maximum r2 value as a function of pa. We can assume that pa ≤ 1/2 and then consider pa > 1/2 by observing that m(pa) = m(1−pa). For pa ≤ 1/2 we perform piecewise integration across components S6, S7, S8, and S1 using equations 25:

m(pa)=0pa(1pa)pbpa(1pb)dpb+pa12pa(1pb)(1pa)pbdpb+121papapb(1pa)(1pb)dpb+1pa1(1pa)(1pb)papbdpb=2(1pa)pa[pa+ln(1pa)]+2pa(1pa)[ln(12)12lnpa+pa].

Figure 4 shows that the mean of rmax2(pa,pb), averaging over values of pb, has an m-shape as a function of pa. The maximum of this mean occurs at pa ≈ 0.30131 and pa ≈ 0.69869 and equals ~0.53091. Notice that the largest values of m(pa) occur for intermediate minor allele frequencies rather than for minor allele frequencies close to 1/2. This finding can be explained by examining the contour plot of Figure 2, which suggests that slices through the graph made at intermediate frequencies for pa contain more space with higher values of rmax2 than do other slices.

Figure 4.

Open in a new tab

2.3 rmax2(pa,pb) with pa and papb fixed

We now consider the situation in which pa and the difference between allele frequencies |papb| are known. This situation is similar to the scenario considered by Wray (2005) in which r2 was assumed to exceed some known threshold, pa was assumed to be known, and pb = pa + v was investigated.

Let pa and pb be minor allele frequencies (≤ 1/2) with papb, so that we are considering component S6. Define d = papb ≥ 0. Treating rmax2 as a function of pa and d, we can rewrite equation 4:

rmax2(pa,d)=1dpa(1+dpa).(7)

Figure 5 shows a 3-dimensional plot of rmax2 as a function of the larger minor allele frequency (pa) and the difference between minor allele frequencies (papb). The twisted surface illustrates that for pa fixed, rmax2 decreases faster with the difference in minor allele frequency when pa has smaller values. This observation corresponds to the steeper decline from the diagonals further from the center in Figure 2.

Figure 5.

Open in a new tab

Holding d constant and nonnegative in equation 7, the maximum of rmax2 for pa ≤ 1/2 occurs at pa = 1/2:

r21d12(1+d12).(8)

By rearranging this equation, we can calculate the maximum value of |papb| possible given a known value of r2,

d1r22(1+r2).(9)

As equation 9 is based on the maximum of rmax2 over all possible minor allele frequency values for pa, it represents the broadest range possible for the difference in allele frequencies. For d to achieve the maximum value of (1 − r2)/[2(1 + r2)], pa must equal 1/2.

The computation above assumes a known r2 and determines the maximum for d. However, if we know pa in addition to r2, then we can solve exactly for the set of allowable values of pb. Assuming again that pa and pb are minor allele frequencies (that is, at most 1/2), we must consider two cases: papb and papb. For the first case, (pa, pb) is in S6. Rearranging equation 4,

pbr2pa1+r2papa,(10)

so that r2pa/(1 + r2papa) ≤ pbpa. In the second case, (pa, pb) is in S7 so we can rearrange equation 3 to obtain

pbpar2r2pa+pa.(11)

Recalling our assumption that pb ≤ 1/2 and combining equations 10 and 11, we find

r2pa1+r2papapbmin(12,par2r2pa+pa).

This result accords with the values that appear in Table 2 of Wray (2005).

2.4 rmax2(pa,pb) and recombination

We have previously been examining r2 with the assumption that it is possible for pab to take any value within its allowable range. This amounts to an assumption that we are not constraining the recombination history of the two loci under consideration. In this section, we consider a different situation: how does recombination affect r2 for two loci that have not previously experienced recombination? This depends on the genealogical history of the loci.

Consider two possible genealogies (Figure 6). In Genealogy 1, a mutation at locus 1 arises later, but on the same side of the tree, as a mutation at locus 2. In Genealogy 2, the mutations at loci 1 and 2 arise on different sides of the tree so that no haplotypes carry both mutations. Assuming papb ≤ 1/2 so that the minor alleles are derived rather than ancestral, then without recombination, pab = pa for Genealogy 1, and r2(pa, pb) = pa (1− pb)/[(1 − pa)pb] (Eberle et al., 2006). For Genealogy 2, without recombination pab = 0, so r2(pa, pb) = papb/[(1 − pa)(1 − pb)] (Eberle et al., 2006).

Figure 6.

Open in a new tab

In typical settings, recombination reduces linkage disequilibrium, as recombination separates new alleles from the haplotypic background on which they arose. For Genealogy 1 in Figure 6, the unconstrained maximum r2 allowing pab to take on any possible value is precisely rmax2(pa,pb) (equation 3), the value taken when pab = pa and no recombination has occurred. Thus, any recombination events that reduce pab below pa will lead to a decrease in r2. However, with Genealogy 2 we can see that situations do exist in which recombination can lead to an increase in LD. Consider Genealogy 2 and suppose recombination occurs such that the frequency of the recombinant haplotype (ab) becomes pab > 0. This haplotype can arise through recombination events between Ab haplotypes and aB haplotypes. Is it possible for r2, with recombination events allowed, to be greater than r2 in the absence of recombination? Solving the inequality

(pabpapb)2pa(1pa)pb(1pb)>papb(1pa)(1pb),

we obtain

pab>2papb.

For each pa ≤ 1/2 and pb ≤ 1/2 it is possible to choose a value of pab that satisfies pab ≥ 2papb. Recall our assumption that papb, which restricts pabpa. Thus, if the fraction of recombinant haplotypes satisfies

2papb<pabpa,(12)

then the occurrence of recombination produces an increase in r2 compared to the maximum possible value had no recombination occurred on the genealogy. Figure 7 shows an example of the variation in r2 as a function of pab for pa = 0.3 and pb = 0.4. Once pab exceeds 2papb = 0.24, the value of r2 between loci increases above the initial value in the absence of recombination.

Figure 7.

Open in a new tab

Using inequality 12, we can determine the fraction of the space of allowed values for (pa, pb, pab) in which the unconstrained maximum r2 permitting recombination (pab not necessarily equal to 0) exceeds the maximum under the assumption that no recombination occurs (pab = 0). The volume of the region where recombination inflates r2 is

0120pb2papbpa1dpabdpadpb=1192.

The volume of the region of allowed values for (pa, pb, pab), assuming pabpapb ≤ 1/2, is

0120pb0pa1dpabdpadpb=148.

Taking the quotient of these results, the fraction of the space of possible values in which recombination inflates r2 is 1/4. Thus, averaging over possible values for (pa,pb) with papb ≤ 1/2, on average 1/4 of possible values for pab lead to r2(pa, pb, pab) > r2(pa, pb, 0).

2.5 The relationship between r2(pa, pb, pab) and D′(pa, pb, pab)

So far, we have focused on the r2 measure of LD and on various properties of its maximum value. A second LD statistic, namely D′, is defined based on maxima and minima. Our computations with rmax2 provide a basis for examining the connection between r2 and D′.

D′ is defined as

D=DDmax,(13)

where D = pabpapb, Dmax = min(papb, (1−pa)(1−pb)) if D < 0, and Dmax = min(pa(1−pb), (1−pa)pb) if D > 0 (Lewontin, 1964). Given any values for pa and pb, D′ can take on any value from −1 to 1, thus differing from r2 in that its range is not frequency-dependent (Hedrick, 1987; Lewontin, 1988).

D′ is equal to D normalized by its maximum given the allele frequencies; r2 can similarly be normalized by its maximum to obtain r2/rmax2. This quotient is the squared correlation coefficient between allelic indicator variables at two loci, standardized by the maximum squared correlation possible given the frequencies of the alleles at the two loci. As D2 and r2/rmax2. both have numerator D2, it is natural to compare their different normalization procedures to determine if they represent the same quantity. We can rewrite r2/rmax2. as

r2rmax2=D2Drmax22.(14)

Here, Drmax2 is defined as pabpapb evaluated at the value of pab that produces the maximum of D2 as a function of pa and pb. This quantity differs across components of the allele frequency space, as described in Section 2.1. Comparing equation 14 to equation 13, we find that D2=r2/rmax2ifDmax2=Drmax22.

The sign of D determines how Dmax is computed. Thus, whether Dmax=Drmax2, and consequently D2=r2/rmax2 depends on the sign of D. For example, consider S1, in which pa < 1/2, pb > 1/2, and pa + pb > 1. In this component, Dmax equals Drmax2=(1pa)(1pb) only when D is less than 0 (pab < papb). In each of the eight components, Dmax2=Drmax22 either when D < 0 or when D > 0, but not in both cases (Table 1). Thus, the region in which D2=r2/rmax2 includes some but not all of the space of possible values of pa, pb, and pab. When D2r2/rmax2 D2 is always greater than r2/rmax2

As D2 and r2/rmax2 are functions of pa, pb, and pab, we can fix one of these three variables and examine the relationship between D2 and r2/rmax2 as a function of the other two variables. If we fix pab, then the domain for (pa, pb) is a triangle, as papab, pbpab, and pa + pb − 1 ≤ pab. Inside this triangle, Figure 8 shows the values of (pa, pb) where D2=r2/rmax2 for pab=0.1, 0.4, and 0.7. The three graphs represent the three qualitatively different patterns observed for such graphs as pab varies from 0 to 1. For pab = 0.1, the domain spans all eight components, S1 to S8. For pab = 0.4, the domain spans all eight components, but in two of these components there is no region in which D2=r2/rmax2 and in two other components there is no region in which D2r2/rmax2. Finally, for pab = 0.7, the domain spans only two components, S2 and S3. The transition points between the three cases occur at pab = 1/4, where the boundary line pab = papb crosses the point (1/2, 1/2), and at pab = 1/2 where the space of allowable (pa, pb) becomes restricted to the upper right quadrant.

Figure 8.

Open in a new tab

As a function of pab, we can calculate the fraction of the space of possibilities where D2=r2/rmax2. For a given pab, the space of possible values of (pa, pb) is bounded by pa = pab, pb = pab, and pab = pa + pb − 1, producing a triangle of area (1 − pab)2/2. For 0 ≤ pab ≤ 1/4, we calculate the area where D2=r2/rmax2 by subtracting the area where the two quantities are not equal from the total area possible, yielding

12(1pab)2[2121pabpabpa1dpbdpa+12pab2+2pab12pabpa121dpbdpa].(15)

For 1/4 ≤ pab ≤ 1/2, we calculate the area where D2=r2/rmax2 by summing areas in each quadrant and noting that the upper left and lower right quadrants have the same area. This area is

2pab12pabpa1+pabpa1dpbdpa+122pab12pabpa1dpbdpa+(12pab)2.(16)

For 1/2 ≤ pab ≤ 1, the calculation of the area where D2=r2/rmax2 is simplified due to the restriction of the space of possible values of (pa, pb) to the upper right quadrant. This area is

pab1pabpabpa1dpbdpa.(17)

Computing the integrals in equations 1517 and then dividing by the area of the space of possible values of (pa, pb), we find that the fraction of the space where D2=r2/rmax2 is

14+pab(14ln2)pablnpab12(1pab)2,0<pab1454+pab(4ln23)+3pablnpab12(1pab)2,14<pab12pab(pab1lnpab)12(1pab)2,12<pab<1.(18)

Figure 9 shows a plot of this function. The minimum fraction of the space where D2=r2/rmax2 is 0.31357, which occurs at pab ≈ 0.37162. The fraction is generally large for large pab; when pab is large the probability is quite high that D > 0. In S2 and S3, positive D leads to D2=r2/rmax2.

Figure 9.

Open in a new tab

By integrating the function in equation 18 from 0 to 1, we can obtain the fraction of the space of all three variables – pa, pb, and pab – in which D2=r2/rmax2. Again using the dilogarithm, we obtain

13π24(ln2)2+328Li2(14)

for the probability that a set of values of pa, pb and pab chosen from the space of possible values leads to r2/rmax2=D2 Numerically, this probability is ~0.72683.

3 Discussion

In this paper, we have examined the mathematical relationship between r2 and allele frequencies, producing a variety of results concerning the frequency dependence of r2. By evaluating the volume below rmax2(pa,pb) we found that the mean rmax2 over the space of possible allele frequencies is only ~0.43051. This number is rather low, implying that for much of the allele frequency space, the value of r2 is severely restricted. We also calculated the formula for the proportion of the allele frequency space where it is possible for r2 to exceed some constant c (equation 6). Using the cutoff of c = 4/5 commonly employed for examining the genomic coverage of a set of “tag SNPs” in association studies (e.g. ), we find that it is possible for r2 to be greater than or equal to this value in only ~0.14232 of the allele frequency space.

An additional scenario that we considered is the case in which one of the allele frequencies was set to a fixed known value. This is the situation, for example, in an association study in which a marker locus with fixed known allele frequencies is used to detect a trait locus of unknown allele frequencies. By assuming a uniform distribution for the frequency of an allele at the other locus, we found that the marker minor allele frequency able to detect high LD with the largest range of values for the minor allele frequency of the trait locus was ~0.30131, not 1/2 as might have been expected from an assumption that the most polymorphic markers have the greatest potential for LD detection. Although the specific location of the optimum may change with the distribution of allele frequencies in an actual population, this result has the implication that algorithms that choose informative markers for detecting LD might produce improved performance if they ensure that a considerable fraction of markers near the optimum frequency are selected. The sharp allele frequency dependence of r2 may also mean that it is desirable to choose a range of allele frequencies among “tag SNP” markers in order to increase the probability of capturing LD with unknown trait loci.

Another perhaps surprising result, obtained by considering the effect of recombination on the value of r2 for different genealogical histories, is that in certain contexts recombination can increase rather than decrease the value of r2. This is somewhat counterintuitive; a typical scenario of loss of LD with recombination involves a decoupling of derived mutations that have occurred sequentially on the same lineage, such as in recombination events between haplotypes ab and AB of Genealogy 1 in Figure 6. In our scenarios where recombination can increase LD, in Genealogy 2 of Figure 6, the LD is produced by recombination that produces sufficient coupling between derived mutations that have occurred in parallel on separate lineages. This type of scenario is likely to be a rather unusual outcome under common assumptions about evolutionary processes; however, we did observe that such scenarios accounted for a nontrivial proportion of the space of possible values for (pa, pb, pab).

Finally, we considered the relationship between r2 and another commonly used measure of LD, D′. We found that a close connection exists between r2 and D′, in that D2 is often equal to r2/rmax2. For any haplotype frequency pab, this equality occurs over at least ~31.357% of the space of possible allele frequencies (pa, pb), and when r2/rmax2. and D2 are not equal, r2/rmax2. is always less than D2. Because of its connections to both r2 and D′, there may exist some potential for r2/rmax2. which we term r2′, to serve as a useful LD measure. Although many measures of LD have situations in which they are particularly applicable (Hedrick, 1987; ; Hudson, 2001; Morton et al., 2001), r2′ – the squared correlation coefficient between allelic indicator variables at two loci standardized by the maximum squared correlation possible given the frequencies of the alleles at the two loci – is one of relatively few that can be used when a measure with allele-frequency-independent range is desired.

Note that in various computations we have considered the entire unit square as the domain for pa and pb. Some treatments of LD reorient alleles and loci so that only S6 or S7 is examined (e.g. Amos, 2007), or otherwise use a reorientation that spans more than one of the eight components in Figure 1 (e.g. Morton et al., 2001). Consideration of only a single component in some cases will yield results that are identical on the allowed domain to those presented (e.g. Figure 2). Particularly in the comparison between r2′ and D2, however, restriction of the space of allele frequencies may lead to somewhat different results. Within a component, rmax2 is achieved when the haplotype with the major alleles at both loci has as high a frequency as possible, so that the normalization in the computation of r2′ depends only on the allele frequencies pa and pb. However, the normalization in the computation of D′ additionally takes into account which alleles are coupled, so that it depends on whether or not pab exceeds papb. Thus, reorienting alleles so that papb, pa ≤ 1/2, and D > 0, as is done by Morton et al. (2001), leads to a domain for pa and pb that cannot be obtained by dividing the plots in Figure 8 along one of their lines of symmetry. Consequently, given pab, the reorientation of Morton et al. (2001) will produce a different result for the probability that r2′ is equal to D2 over the allowed domain.

We have additionally assumed Uniform(0,1) distributions of allele frequencies in many computations. This assumption can be viewed as a basis for assessing functions of allele frequencies across their entire ranges, rather than as an assumption that these distributions apply in any particular population. Our primary interest has been to provide details on the theoretical properties of r2; future work may have the potential to exploit the properties that we have uncovered, such as in interpreting unusual LD behavior, or in improving the design of disease-mapping studies that rely on patterns of LD.

Acknowledgments

We thank two reviewers for their comments. This work was supported by NIH grants R01 GM081441 and T32 HG00040 and by grants from the Alfred P. Sloan Foundation and the Burroughs Wellcome Fund.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Contributor Information

Jenna M. VanLiere, Center for Computational Medicine and Biology, University of Michigan, Ann Arbor, MI 48109-2218

Noah A. Rosenberg, Department of Human Genetics, Center for Computational Medicine and Biology and the Life Sciences Institute, University of Michigan, Ann Arbor, MI 48109-2218

References

  1. Amos CI. Successful design and conduct of genome-wide association studies. Human Molecular Genetics. 2007;16:R220–R225. doi: 10.1093/hmg/ddm161. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Carlson CS, Eberle MA, Rieder MJ, Yi Q, Kruglyak L, Nickerson DA. Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium. American Journal of Human Genetics. 2004;74:106–120. doi: 10.1086/381000. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. de Bakker PIW, Yelensky R, Pe’er I, Gabriel SB, Daly MJ, Altshuler D. Efficiency and power in genetic association studies. Nature Genetics. 2005;37:1217–1223. doi: 10.1038/ng1669. [DOI] [PubMed] [Google Scholar]
  4. Devlin B, Risch N. A comparison of linkage disequilibrium measures for fine-scale mapping. Genomics. 1995;29:311–322. doi: 10.1006/geno.1995.9003. [DOI] [PubMed] [Google Scholar]
  5. Eberle MA, Ng PC, Kuhn K, Zhou L, Peiffer DA, Galver L, Viaud-Martinez KA, Taylor Lawley C, Gunderson KL, Shen R, Murray SS. Power to detect risk alleles using genome-wide tag SNP panels. PLoS Genetics. 2007;3:1827–1837. doi: 10.1371/journal.pgen.0030170. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Eberle MA, Rieder MJ, Kruglyak L, Nickerson DA. Allele frequency matching between SNPs reveals an excess of linkage disequilibrium in genic regions of the human genome. PLoS Genetics. 2006;2:1319–1327. doi: 10.1371/journal.pgen.0020142. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Hedrick PW. Gametic disequilibrium measures: proceed with caution. Genetics. 1987;117:331–341. doi: 10.1093/genetics/117.2.331. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Hill WG, Robertson A. Linkage disequilibrium in finite populations. Theoretical and Applied Genetics. 1968;38:226–231. doi: 10.1007/BF01245622. [DOI] [PubMed] [Google Scholar]
  9. Hudson RR. Linkage disequilibrium and recombination. In: Balding DJ, Bishop M, Cannings C, editors. Handbook of Statistical Genetics. Chichester, UK: Wiley; 2001. pp. 309–324. chapter 11. [Google Scholar]
  10. Jorgenson E, Witte JS. Coverage and power in genomewide association studies. American Journal of Human Genetics. 2006;78:884–888. doi: 10.1086/503751. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Kruglyak L. Prospects for whole-genome linkage disequilibrium mapping of common disease genes. Nature Genetics. 1999;22:139–144. doi: 10.1038/9642. [DOI] [PubMed] [Google Scholar]
  12. Lewontin RC. The interaction of selection and linkage. I. General considerations; heterotic models. Genetics. 1964;49:49–67. doi: 10.1093/genetics/49.1.49. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Lewontin RC. On measures of gametic disequilibrium. Genetics. 1988;120:849–852. doi: 10.1093/genetics/120.3.849. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Morton NE, Zhang W, Taillon-Miller P, Ennis S, Kwok P-Y, Collins A. The optimal measure of allelic association. Proceedings of the National Academy of Sciences USA. 2001;98:5217–5221. doi: 10.1073/pnas.091062198. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Plagnol V, Wall JD. Possible ancestral structure in human populations. PLoS Genetics. 2006;2:972–979. doi: 10.1371/journal.pgen.0020105. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Pritchard JK, Przeworski M. Linkage disequilibrium in humans: models and data. American Journal of Human Genetics. 2001;69:1–14. doi: 10.1086/321275. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Slatkin M. Linkage disequilibrium – understanding the evolutionary past and mapping the medical future. Nature Reviews Genetics. 2008:9. doi: 10.1038/nrg2361. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Terwilliger JD, Hiekkalinna T. An utter refutation of the ‘Fundamental Theorem of the HapMap’. European Journal of Human Genetics. 2006;14:426–437. doi: 10.1038/sj.ejhg.5201583. [DOI] [PubMed] [Google Scholar]
  19. Tishkoff SA, Dietzsch E, Speed W, Pakstis AJ, Kidd JR, Cheung K, Bonné-Tamir B, Santachiara-Benerecetti AS, Moral P, Krings M, Pääbo S, Watson E, Risch N, Jenkins T, Kidd KK. Global patterns of linkage disequilibrium at the CD4 locus and modern human origins. Science. 1996;271:1380–1387. doi: 10.1126/science.271.5254.1380. [DOI] [PubMed] [Google Scholar]
  20. Weisstein EW. CRC Concise Encyclopedia of Mathematics. 2nd edition. Boca Raton: Chapman & Hall/CRC; 2003. [Google Scholar]
  21. Wray NR. Allele frequencies and the r2 measure of linkage disequilibrium: impact on design and interpretation of association sutdies. Twin Research and Human Genetics. 2005;8:87–94. doi: 10.1375/1832427053738827. [DOI] [PubMed] [Google Scholar]
  22. Zondervan KT, Cardon LR. The complex interplay among factors that influence allelic association. Nature Reviews Genetics. 2004;5:89–100. doi: 10.1038/nrg1270. [DOI] [PubMed] [Google Scholar]
Mathematical properties of the r2 measure of linkage disequilibrium (2024)
Top Articles
Latest Posts
Recommended Articles
Article information

Author: Amb. Frankie Simonis

Last Updated:

Views: 6428

Rating: 4.6 / 5 (56 voted)

Reviews: 95% of readers found this page helpful

Author information

Name: Amb. Frankie Simonis

Birthday: 1998-02-19

Address: 64841 Delmar Isle, North Wiley, OR 74073

Phone: +17844167847676

Job: Forward IT Agent

Hobby: LARPing, Kitesurfing, Sewing, Digital arts, Sand art, Gardening, Dance

Introduction: My name is Amb. Frankie Simonis, I am a hilarious, enchanting, energetic, cooperative, innocent, cute, joyous person who loves writing and wants to share my knowledge and understanding with you.