Genetics of Complex Traits - Genetic Studies
Population genetics is the study of allele and genotype frequencies within groups and measurement of frequencies in succeeding generations. The genetic information carried by interbreeding individuals within a group constitutes the gene pool which is the result of evolution modified by processes as mutation, selection, recombination and genetic drift.
The Hardy-Weinberg Equilibrium (HWE)
In 1908, Hardy and Weinberg showed that sexual reproduction on its own does not change the frequencies of alleles from one generation to another if:
The HWE has three important properties:
The Hardy-Weinberg law can be used to calculate allele frequencies and heterozygote carriers in a population where the frequency of the trait is already known. The estimation of allele frequencies is different for X-linked genes where males have half the amount found in females. So the frequency of affected males is very high when compared to females for X-linked recessive disorders while it is only a half for X-linked dominant traits. When testing for equilibrium in a population the expected and observed genotype frequencies are compared using the chi-square statistic.
Deviations from HWE might suggest that mutation, selection, migration and drift might be present in the population although deviation may also be due to population stratification (Econs & Speer, 1996). Deviations with excess heterozygotes may indicate the presence of over dominant selection or out-breeding, while excess homozygotes might indicate inbreeding. Homozygote excess at marker loci that deviates from HWE was observed to be in linkage disequilibrium (LD) with mutated alleles in groups of affected individuals. Also HWE was found to be useful for fine mapping of dominant disease alleles with markers in LD showing excess heterozygotes (Botstein & Risch, 2003). Although deviation from HWE might indicate technical problems (Salanti et al., 2005), studies showed that HWE is still not sensitive enough to detect these errors (Cox & Kraft, 2006).
One of the greatest challenges for geneticists is the identification of genes that are responsible for complex traits. Unlike classical Mendelian disorders, these diseases do not show Mendelian patterns of inheritance and involve the interactions of both environmental and genetic factors. Also confounding factors such as heterogeneity, phenocopies, genetic imprinting and penetrance further complicate the identification of susceptibility genes.
When performing a genetic study, correlation between phenotype and genotype is sought. In complex traits, this correlation might be very low due to incomplete penetrance where not all individuals having the same susceptibility allele are affected or due to affected individuals not having a susceptibility allele (phenocopies). These factors lead to a wide range of severity of disease even within a single family. Further more, late onset diseases such as cardiovascular disease and osteoporosis complicate the setting of the disease phenotypes as unaffected individuals tested today might become affected in the near future. Late onset diseases are more sensitive to environmental factors and are observed to have a higher level of genetic variation due to weak selective pressures on these variants that are usually neutral in early life (Wright et al., 2003). Also besides testing for a qualitative trait where individuals are classified as either affected or normal, one can use quantitative or a continuous measurement such as BMD. When using such a variable one must be very cautious as it might not completely correlate with the disease and can also be dependent on a number of other non-genetic factors.
Complex disorders are most often polygenic where multiple genes contribute to the given phenotype. Complex patterns of inheritance might be due to allelic heterogeneity where different mutations in the same gene are responsible for the disease or else due to the effects of other distinct loci (locus heterogeneity). When studying complex disorders therefore one is looking for susceptibility alleles at multiple loci that together increase the individual’s risk of the disease. In polygenic traits, penetrance is determined by the genotypes of other loci and therefore it is likely to be low and varies between individuals. To increase the chance of successful gene mapping one has to identify families from probands with extreme phenotypes, earlier age at onset or else study families from an isolated population with a very high incidence of disease. Wright and colleagues (2003) suggested that it is important to identify genes with the largest contribution to the extremes of the trait and avoid quantitative trait loci (QTL) that have minimal effects on the individual or disease mechanism. Using single extended families from populations that are homogeneous and consanguineous has proved to be successful for the localisation of genes and novel mutations for a disease such as type 2 diabetes (Kambouris, 2005). Using one extended family, Kambouris reported similar results to those obtained from previous genome-wide scans using hundreds of individuals (Hanson et al., 1998). This shows that costs and time can be significantly reduced to identify novel genes responsible for complex disorders, when using extended and consanguineous families coming from homogeneous populations.
In linkage analysis, the non-independent co-segregation of marker and disease locus is tested in families with multiple affected individuals for the disease. Linked alleles (marker with disease causing allele) on the same chromosome segregate together more often than expected by chance; that is against Mendel’s law of independent assortment. Gene mapping of a trait identifies chromosomal loci that are shared among affected individuals and that differ between affected and non-affected family members. Family members from pedigrees with affected and unaffected individuals are genotyped for a set of polymorphic markers either across the whole genome or at specific chromosomal loci. Genetic linkage is measured by the recombination fraction that is the probability that a parent will produce a recombinant offspring that depends upon the distance between loci and from which the physical distance can be calculated (Haines & Pericak-Vance, 1998).
The more distant two markers are from each other the higher is the chance that a recombination event occurs between them during meiosis. The recombination fraction theta (θ) ranges from 0 for completely linked markers to 0.5 for unlinked loci. Genetic linkage is measured in centiMorgans (cM), where 1cM represents 1% recombination or θ = 0.01 that is equivalent to 1 million base pairs. These measurements might not be so accurate for longer chromosomal distances where multiple crossovers might occur during a single meiotic event, a phenomenon known as interference. Two mapping functions to convert recombination fraction into map distance are Haldane’s that does not assume interference and Kosambi’s, which assumes interference as 1 - 2θ (Curtis, 2000).
Parametric Linkage Analysis
Parametric linkage analysis is a statistical approach using the logarithm of the odds of the likelihood ratio (LOD score) to assess the strength of linkage. This is also known as a model based linkage where the mode of inheritance, frequencies of disease and marker loci together with penetrance must be known. The statistic assumes the likelihood (or probability) that a disease and marker loci in a family are not inherited together (θ = 0.5) compared with the likelihood that they are linked over a selected range of recombination fractions (θ range of 0 to 0.5). The LOD score is the base ten logarithm of the likelihood ratio that is calculated for each value of θ. A two point LOD (z) score (between disease locus and marker) is calculated using the following equation:
z(x) = log10 [L(θ=x) ÷ L(θ=0.5)]
Where x is a value of recombination fraction and L is the likelihood
This formula translates to:
LOD (Z) = log10 [(1 - θ)n - R X θR ÷ 0.5n ]
Where n is the total number of family members, R is the number of recombinants and n-R is the number of non-recombinants
Significant evidence of linkage is taken at a LOD score of 3.0 or higher and linkage is completely excluded with a LOD score of -2.5. A LOD score of 3.0 corresponds to odds of 1000:1 that means that it is 1000:1 more likely that the alternate hypothesis in favour of linkage holds while a LOD score of 3.5 is equivalent to odds of 3162:1. The LOD score can be converted to a chi-square statistic by simply multiplying by 4.6 and calculating p-value at 1 degree of freedom (df) for ordinary LOD and at 2 df for heterogeneity LOD scores (HLOD), under the null hypothesis (Ott, 1991). The p-values obtained are always divided by 2 for one-sided tests except when calculating p-values for multi-point LOD (MLOD). Using these calculations a LOD score of 3.0 is equivalent to a p-value of 0.0001 while that of 3.6 is equivalent to 0.00002. Lander and Kruglyak (1995) suggested that a highly significant linkage must be taken as statistical evidence to occur 0.001 times in the whole genome and there is only a 5% chance of randomly finding a region with a p-value of 0.00002. For whole-genome analysis in complex traits the threshold usually used is that of LOD = 3.3. The authors also suggested that linkage must be reconfirmed by other independent investigators where a nominal p value of 0.01 should be required, while they advised for caution when reporting LOD scores that are less than 3.0 and so are only suggestive for linkage. In case of suggestive linkage additional family data will be required before conclusions are drawn (Lander & Kruglyak, 1995).
LOD scores can be influenced by a number of factors including the phase or whether parental genotypes are known, misspecification of disease and marker allele frequencies, penetrance, heterogeneity and mostly by phenotypic misclassification. Also for more accurate linkage information and to better localize the disease gene, multi-point linkage analysis is preferred over two-point analysis. Statistical analyses in complex pedigrees are carried out using software such as MLINK and GENEHUNTER where the LOD score can also be adjusted for locus heterogeneity (HLOD) (Kruglyak et al., 1996).
Another kind of analysis which is thought to be useful when analysing linkage data for complex traits is the MOD-score. The reason is because in complex traits both the genetic model and disease allele frequency are very difficult to be correctly specified. Wrong assumption of the genetic model can significantly affect the analysis and can result in a false negative result. The MOD score is calculated by maximising the LOD score over a number of replicates using different penetrances and disease allele frequencies, to obtain a maximum LOD score using the best genetic model (Strauch et al., 2003). To control type I errors, it was found that a MOD-score of 3.0 should be adjusted by a value ranging from 0.3 – 1.0 where it was proposed that a MOD-score of 2.5 is indicative of suggestive linkage (Berger et al., 2005). MOD-score analysis can be used to determine the best genetic model for those regions indicated by an initial genome scan using ordinary LODs and it can also be calculated assuming paternal or maternal imprinting. When assuming imprinting a heterozygote paternal penetrance is also used with the other three penetrances with a total of four penetrance values. If a low heterozygote frequency is calculated for paternal imprinting it means that maternal genes are preferentially expressed at that locus (Strauch et al., 2005; Berger et al., 2005).
Non-Parametric Linkage Analysis
Since for complex disorders the mode of inheritance is uncertain, evidence of linkage might be missed by using the LOD score method described above. A more appropriate approach is that described by Kruglyak et al (1996) known as a non-parametric linkage (NPL) or a model free analysis. The NPL statistic measures allele sharing among affected relative pairs (ARP) and/or affected sib-pairs (ASP) within a pedigree. By chance it is expected that siblings share zero, one or two marker alleles identical by descent (IBD) with a probability of 0.25, 0.50 and 0.25. If disease and marker alleles are linked then affected siblings will share these alleles more frequently than expected by chance regardless of the mode of inheritance. Comparison between expected and observed allele sharing between ASPs is then analysed using the chi-square statistic. Highly heterozygous markers, multipoint linkage and genotyping non-affected siblings when parents are not available help to increase the sharing information. One great advantage of the NPL statistic is that data from markers on a chromosome can also be evaluated in a multipoint approach using software as GENEHUNTER that uses the Lander-Green algorithm to calculate IBD distribution (Kruglyak, 1996).
Association analysis tests for the relationship between alleles and phenotype in a population. There are two types of association studies, the case-control and family based or transmission disequilibrium test (TDT). In case-control analysis, marker alleles (usually single nucleotide polymorphisms) are compared between a group of unrelated affected individuals and a control group that are age and sex matched. If the marker allele is found more frequently in the affected group when compared to the control group, a positive association is found. The frequencies between these groups are analysed by the chi-square statistic. In the candidate gene approach, background knowledge of physiology and disease is used to select single nucleotide polymorphisms (SNPs) in known genes that are most likely to be involved in the disease. Biologically, a disease susceptibility allele positively associated with disease might be either directly involved in the disease or else it might be in linkage disequilibrium (LD) with a disease susceptibility locus. In this case the closer the marker locus is to the disease locus, the stronger is the association. The detection of LD is useful for fine mapping as the distance between loci is generally very small, usually much less than 1cM.
Problems with association studies especially with replication of results arise most often due to sample size. Different suggestions have been made about sample size that depends upon both the relative risk and frequencies of susceptibility alleles involved. It was also suggested that power will be increased by increasing the number of individuals rather than by increasing the number of SNPs (Long & Langley, 1999; Pharoah et al., 2004). On the other hand, large scale association studies might still not give valid answers and conclusions when genetic variables are very heterogenous (Ioannidis et al., 2003).
Population stratification is another reason leading to inconsistencies of results. Freedman and co-workers analysed this problem in a number of case-control studies and concluded that stratification occurred even in very well designed studies (Freedman et al., 2004). Population stratification refers to the differences in allele frequencies between cases and controls that are not due to the association of genes with the disease but most likely due to inappropriate selection of these groups. This problem is common in heterogenous populations with recent admixture due to intercontinental migrations (Freedman et al., 2004). Inconsistency in results might also be due to allelic heterogeneity together with the influence of interactions between environmental and genetic factors (SNPs) affecting the phenotype (Colhoun et al., 2003).
Linkage Disequilibrium (LD)
Linkage disequilibrium refers to the non-random association between alleles that are so tightly close together, that they are not separated by recombination and so are inherited more often than expected by chance. LD plays a very important role in fine mapping of complex disease genes and also it gives information about population history and origins. When a mutation occurs in a gene every marker allele on a chromosome is associated with the disease allele but when this chromosome is passed to future generations rearrangements of the alleles occur during recombination. The closer the alleles are to the disease allele the less is the chance that a recombination event separates them. Therefore it takes a longer time for the disease and marker alleles to be inherited together from one generation to the next until recombination occurs between them. With time only very close loci continue to be inherited together and so the extent of LD deteriorates over a number of generations, a phenomenon known as LD decay.
It was shown that in most instances LD does not extend beyond 1cM and Kruglyak (1999) concluded that LD is unlikely to extend beyond 3 kb. The genetic distance of LD is thought to vary between different populations and it was observed to be highly structured into discrete blocks separated by regions of recombination hot spots (Goldstein, 2001). LD can be affected by both genomic and population factors such as selection, mutation, genetic drift, recombination rates, population size and admixture (Pitchard & Przeworski, 2001). Also gene conversion and major population bottlenecks might be the reasons for the difference in LD between African and European populations where in the former, LD decays faster and therefore more polymorphic markers will be needed to perform a disease association study (Frisse et al., 2001). Isolated populations were observed to exhibit LD and haplotype sharing over longer genetic distances especially where a major founder bottleneck occurred only about 20 generations ago (De La Chapelle & Wright, 1998; Service et al., 2006).
Classic linkage analysis proved successful for the identification of genes having a strong effect on the phenotype and showing a Mendelian pattern of inheritance. LD is a promising approach to identify genes involved in most complex disorders, where penetrance is low and the effect of a gene is influenced by other genes and environmental factors. Unlike linkage analysis, which is performed in families, LD mapping is carried out in populations where to detect LD between marker and disease one needs to compare cases and controls. To be associated with disease, marker loci (haplotypes) inherited from a common ancestor are observed to occur more frequently than expected by chance in the affected group. In order to perform high resolution mapping one needs to genotype a very large number of polymorphic markers across the whole genome or at interesting loci already indicated by a previous linkage study. Biallelic single nucleotide polymorphisms (SNPs) are usually used since they are found approximately one every one thousand bases. Kruglyak (1999) suggested that for a genome-wide LD mapping approximately 500,000 SNPs will be required to be genotyped assuming that LD does not extend beyond 3 kb, that is now possible to perform with the introduction of the 500K micro-array. Since LD seems to be structured forming haplotype blocks it was proposed that by knowing the association between SNPs one can reduce dramatically the amount of SNP genotyping to perform a genome-wide LD mapping with a very little loss of information (International HapMap Consortium, 2003; Clark, 2003). The only drawback of this approach is when the disease is caused by very rare alleles (<1%), which makes LD mapping not so effective. In these instances using large pedigrees might be much more appropriate to detect association by co-segregation of flanking markers or by using TDT (Clark, 2003).
As discussed earlier, one of the main problems frequently encountered with genetic association studies is the lack of replication of results. To resolve this problem, results from different studies are being analysed together using an approach known as meta-analysis. Meta-analysis offers one way to perform statistical analysis of very large samples, usually made up of thousands of individuals, which are very difficult to analyse in a single laboratory. There are mixed opinions about the usefulness of this approach but still it is thought that, if carefully used, this analysis can convincingly identify variants having a modest effect on disease (Lohmueller et al., 2003). Data collected from previous published studies must be assessed carefully by applying strict inclusion and exclusion criteria. Studies with genotypes that deviated from Hardy-Weinberg equilibrium should be excluded and proper assessment for heterogeneity to verify whether this is due to ethnicity, phenotype or gender, should be done (Munafo & Flint, 2004). Major pitfalls of meta-analysis include the risk of false positive results due to publication bias (due to publications in different languages and bias towards positive results) and other problems such as population stratification, environmental interactions and variability of LD in the genome. For these reasons it is believed that the best approach for the detection of alleles having a modest effect on disease is to use very large and well designed primary studies using thousands of individuals followed by meta-analysis (Salanti et al., 2005; Munafo et al., 2005; Uitterlinden et al., 2006).
Meta-analysis was also applied for linkage studies where data from different genome-wide scans was analysed together. This type of analysis suffers from major problems like those encountered in meta-analysis of association studies (Etzel & Guerra, 2002). Best results can be obtained by pooling the raw data from different studies and performing linkage analysis again and not by just comparing results.
Defining the Genotypes
As already mentioned, to perform a successful gene mapping study a number of polymorphic markers have to be typed in affected and non-affected individuals in order to be able to identify genes that increase the risk of disease. Different types of genotyping markers were used in recent years and new techniques for typing are constantly being developed to increase efficiency, accuracy and throughput while reducing costs.
Short tandem repeats (STRs) or microsatellites are widely distributed in the genome and so are useful tools for genome-wide scans. These tandem repeats can be dinucleotide, trinucleotide or tetranucleotide repeats where polymorphism is generated by gain or loss of repeats usually as a result of both replication slippage and point mutation (Ellegren, 2002). Microsatellites have several advantages for typing, the most important of which is that they are highly polymorphic with a very high heterozygosity (>70%), so making them ideal to be used for linkage studies. Another advantage is that they can be very easily typed using PCR techniques where fluorescently labelled primers flanking the polymorphic region are designed. The variable number of repeats creates amplicons of different sizes which can be typed using automatic sequencers such as those by Applied Biosystems (ABI) (PE Applied Biosystems Division, Foster City, CA). In 1994, Reed and co-workers described a method where they developed chromosome-specific sets of markers that were fluorescently labelled using different dyes as FAM, TET and HEX across all autosomes and X chromosome with an average spacing of 13cM (Reed et al., 1994). Today, different sets of markers across the whole genome are electronically available from databases such as those of Marshfield Institute of Genetics (http://research.marshfieldclinic.org/genetics/) and the Cooperative Human Linkage Centre (http://gai.nci.nih.gov/CHLC/). Markers can be selected from these databases either across the whole genome or else at candidate loci usually with an average spacing of 10cM and for a higher resolution at 5cM. To increase throughput and reduce costs, the amplified fragments are carefully pooled in sets in such a way that the allele size range does not overlap within a set and by using different dyes for different sets.
Single Nucleotide Polymorphisms
Single nucleotide polymorphisms (SNPs) are found more frequently in the genome and are used for high resolution gene mapping. Since they are biallelic, a disadvantage of SNPs is their low heterozygosity. On the other hand their stability, due to very low mutation rates, makes them more useful for LD mapping. In order to reach this aim, a very dense map of SNPs will be needed to perform a genome-wide LD mapping (Kruglyak, 1999). More than 1.5 million SNPs were found in public databases (in 2002) such as dbSNP (http://www.ncbi.nlm.nih.gov/SNP/), most of which were observed to occur in clusters of 10 – 15kb regions, randomly distributed across the genome (Carlston & Newman, 2002). The number of reported SNPs in the human genome increased to approximately 9.5 million in 2007 (as checked on NCBI in September 2007).
SNPs in association studies can be either directly tested for their functional effects, especially if found in coding regions of genes, or else as a marker for LD. Changes in non-coding sequences and synonymous changes in exons are more common than those that result in amino acid substitutions due to a greater selective pressure on the latter (Gray et al., 2000). The most common changes in the genome are transitions with CpG islands showing the highest mutation rates due to deamination. It was also estimated that approximately 20% of non-synonymous SNPs actually have a deleterious effect on the structure and/or function of the protein. These SNPs are most likely to affect the phenotype, the number of which in the genome was estimated to reach 103 (Sunyaev et al., 2001). The current approach used in complex disease is to select SNPs found in candidate genes for the disease phenotype based on the available knowledge of physiology. This approach is unsatisfactory as knowledge in the pathophysiology and molecular biology of complex diseases is very limited to this day.
Single nucleotide polymorphisms can be analysed using various techniques such as restriction fragment length polymorphisms (RFLP) that is quite a laborious and time consuming technique. To be able to analyse thousands of SNPs in a genome-wide LD mapping, novel techniques are being developed that are more robust, cheap and fully automated and where different SNPs can be multiplexed and analysed together. These include various hybridization techniques using real-time polymerase chain reaction (Real Time-PCR) and microarrays (Carlston & Newman, 2002; Kirk et al., 2002). Matrix-assisted desorption ionization time-of-flight (MALDI-TOF) mass spectrometry is a promising technique that is being developed for large scale SNP genotyping, which is very efficient and cheap (Griffin & Smith, 2002). Also DNA pooling of affected individuals and controls is useful to perform an initial screen for genetic association although using this technique one can only estimate allele frequencies but cannot perform haplotyping (Bansal et al., 2002).
Concepts of Genetics, 5th Edition, (1997) Prentice Hall Inc, New Jersey, USA
Human Genetics, 3rd Edition, (1997) Springer-Verlang, Berlin Heidelberg, New York
Human Molecular Genetics, (2004) Garland Publishing, New York, USA
Some images taken from Wikipedia online free encyclopedia