Gao T. Wang, Dajiang J. Liu and Suzanne M. Leal
Liu DJ, Leal SM. A novel adaptive method for the analysis of next-generation sequencing data to detect complex trait associations with rare variants due to gene main effects and interactions. PLoS Genet. 2010 6:e1001156. PubMed PMID: 20976247; PubMed Central PMCID: PMC2954824.
There is solid evidence that rare variants contribute to complex disease etiology. Next-generation sequencing technologies make it possible to uncover rare variants within candidate genes, exomes, and genomes. Working in a novel framework, the kernel-based adaptive cluster (KBAC) was developed to perform powerful gene/locus based rare variant association testing. The KBAC combines variant classification and association testing in a coherent framework. Covariates can also be incorporated in the analysis to control for potential confounders including age, sex, and population substructure. To evaluate the power of KBAC: 1) variant data was simulated using rigorous population genetic models for both Europeans and Africans, with parameters estimated from sequence data, and 2) phenotypes were generated using models motivated by complex diseases including breast cancer and Hirschsprung's disease. It is demonstrated that the KBAC has superior power compared to other rare variant analysis methods, such as the combined multivariate and collapsing and weight sum statistic. In the presence of variant misclassification and gene interaction, association testing using KBAC is particularly advantageous. The KBAC method was also applied to test for associations, using sequence data from the Dallas Heart Study, between energy metabolism traits and rare variants in ANGPTL 3,4,5 and 6 genes. A number of novel associations were identified, including the associations of high density lipoprotein and very low density lipoprotein with ANGPTL4.
Li B, Wang G, Leal SM. PhenoMan: phenotypic data exploration, selection, management and quality control for association studies of rare and common variants. Bioinformatics. 2014 30:442-4. PubMed PMID: 24336645; PubMed Central PMCID: PMC3904519.
Motivation: Next-generation sequencing and other high-throughput technology advances have promoted great interest in detecting associations between complex traits and genetic variants. Phenotype selection, quality control (QC) and control of confounders are crucial and can have a great impact on the ability to detect associations. Although there are programs to perform association analyses, e.g. PLINK and GenABEL, they cannot be used for comprehensive management and QC of phenotype data. To address this need PhenoMan was developed: to select individuals based on multiple phenotype criteria or population membership; control for missing covariate data; remove related individuals, duplicate samples and individuals with incorrect sex specification; recode primary traits and covariates; transform data; remove or winsorize outliers; select covariates for analysis; and create residuals. To ensure consistency and harmonization between analyses, a report is generated for every dataset. Summary statistics are also provided in graphical or text format. PhenoMan can be used for selection and manipulation of quantitative, disease and control data.
Summary: PhenoMan is freeware that provides approaches for efficient exploration and management of phenotype data. Proper QC of phenotypes before proceeding to the association analysis is critical to ensure control of type I and II errors, reliable effect estimates and consistent results between studies. PhenoMan is highly beneficial for the preparation of qualitative and quantitative trait data for association studies using new datasets as well as those obtained from public repositories.
Biao Li, Gao T. Wang and Suzanne M. Leal
Gao T. Wang, Di Zhang and Suzanne M. Leal
SEQLinkage implements a collapsed haplotype pattern (CHP) method to generate markers from sequence data for linkage analysis. The core concept is that instead of treating each variant a separate marker, we create regional markers for variants in specified genetic regions (e.g. genes) based on haplotype patterns within families, and perform linkage analysis on markers thus generated. SEQLinkage takes sequence data in VCF format and perform two-point linkage analysis. It reports both LOD and HLOD scores for linkage analysis of multiple families.
Wang GT, Li B, Santos-Cortez RP, Peng B, Leal SM. Power analysis and sample size estimation for sequence-based association studies. Bioinformatics. 2014 30:2377-8. PubMed PMID: 24778108; PubMed Central PMCID: PMC4133582.
Motivation: Statistical methods have been developed to test for complex trait rare variant (RV) associations, in which variants are aggregated across a region, which is typically a gene. Power analysis and sample size estimation for sequence-based RV association studies are challenging because of the necessity to realistically model the underlying allelic architecture of complex diseases within a suitable analytical framework to assess the performance of a variety of RV association methods in an unbiased manner.
Summary: We developed SEQPower, a software package to perform statistical power analysis for sequence-based association data under a variety of genetic variant and disease phenotype models. It aids epidemiologists in determining the best study design, sample size and statistical tests for sequence-based association studies. It also provides biostatisticians with a platform to fairly compare RV association methods and to validate and assess novel association tests.
Availability and Implementation: The SEQPower program, source code, multi-platform executables, documentation, list of association tests, examples and tutorials are available at http://bioinformatics.org/spower.
Leal SM, Yan K, Müller-Myhsok B. SimPed: a simulation program to generate haplotype and genotype data for pedigree structures. Hum Hered. 2005;60:119-22. PubMed PMID: 16224189; PubMed Central PMCID: PMC2909095.
With the widespread availability of SNP genotype data, there is great interest in analyzing pedigree haplotype data. Intermarker linkage disequilibrium for microsatellite markers is usually low due to their physical distance; however, for dense maps of SNP markers, there can be strong linkage disequilibrium between marker loci. Linkage analysis (parametric and nonparametric) and family-based association studies are currently being carried out using dense maps of SNP marker loci. Monte Carlo methods are often used for both linkage and association studies; however, to date there are no programs available which can generate haplotype and/or genotype data consisting of a large number of loci for pedigree structures. SimPed is a program that quickly generates haplotype and/or genotype data for pedigrees of virtually any size and complexity. Marker data either in linkage disequilibrium or equilibrium can be generated for greater than 20,000 diallelic or multiallelic marker loci. Haplotypes and/or genotypes are generated for pedigree structures using specified genetic map distances and haplotype and/or allele frequencies. The simulated data generated by SimPed is useful for a variety of purposes, including evaluating methods that estimate haplotype frequencies for pedigree data, evaluating type I error due to intermarker linkage disequilibrium and estimating empirical p values for linkage and family-based association studies.
Li B, Wang G, Leal SM. SimRare: a program to generate and analyze sequence-based data for association studies of quantitative and qualitative traits. Bioinformatics. 2012 28:2703-4. PubMed PMID: 22914216; PubMed Central PMCID: PMC3467746.
Motivation: Currently, there is great interest in detecting complex trait rare variant associations using next-generation sequence data. On a monthly basis, new rare variant association methods are published. It is difficult to evaluate these methods because there is no standard to generate data and often comparisons are biased. In order to fairly compare rare variant association methods, it is necessary to generate data using realistic population demographic and phenotypic models.
Result: SimRare is an interactive program that integrates generation of rare variant genotype/phenotype data and evaluation of association methods using a unified platform. Variant data are generated for gene regions using forward-time simulation that incorporates realistic population demographic and evolutionary scenarios. Phenotype data can be obtained for both case-control and quantitative traits. SimRare has a user-friendly interface that allows for easy entry of genetic and phenotypic parameters. Novel rare variant association methods implemented in R can also be imported into SimRare, to evaluate their performance and compare results, e.g. power and Type I error, with other currently available methods both numerically and graphically.
Zong-Xiao He and Suzanne M. Leal
He Z, O'Roak BJ, Smith JD, Wang G, Hooker S, Santos-Cortez RL, Li B, Kan M, Krumm N, Nickerson DA, Shendure J, Eichler EE, Leal SM. Rare-variant extensions of the transmission disequilibrium test: application to autism exome sequence data. Am J Hum Genet. 2014 94:33-46. PubMed PMID: 24360806; PubMed Central PMCID: PMC3882934.
Many population-based rare-variant (RV) association tests, which aggregate variants across a region, have been developed to analyze sequence data. A drawback of analyzing population-based data is that it is difficult to adequately control for population substructure and admixture, and spurious associations can occur. For RVs, this problem can be substantial, because the spectrum of rare variation can differ greatly between populations. A solution is to analyze parent-child trio data, by using the transmission disequilibrium test (TDT), which is robust to population substructure and admixture. We extended the TDT to test for RV associations using four commonly used methods. We demonstrate that for all RV-TDT methods, using proper analysis strategies, type I error is well-controlled even when there are high levels of population substructure or admixture. For trio data, unlike for population-based data, RV allele-counting association methods will lead to inflated type I errors. However type I errors can be properly controlled by obtaining p values empirically through haplotype permutation. The power of the RV-TDT methods was evaluated and compared to the analysis of case-control data with a number of genetic and disease models. The RV-TDT was also used to analyze exome data from 199 Simons Simplex Collection autism trios and an association was observed with variants in ABCA7. Given the problem of adequately controlling for population substructure and admixture in RV association studies and the growing number of sequence-based trio studies, the RV-TDT is extremely beneficial to elucidate the involvement of RVs in the etiology of complex traits.
Variant Association Tools
Wang GT, Peng B, Leal SM. Variant association tools for quality control and analysis of large-scale sequence and genotyping array data. Am J Hum Genet. 2014 94:770-83. PubMed PMID: 24791902; PMCID: PMC4067555.
Currently there is great interest in detecting associations between complex traits and rare variants. In this report, we describe Variant Association Tools (VAT) and the VAT pipeline, which implements best practices for rare-variant association studies. Highlights of VAT include variant-site and call-level quality control (QC), summary statistics, phenotype- and genotype-based sample selection, variant annotation, selection of variants for association analysis, and a collection of rare-variant association methods for analyzing qualitative and quantitative traits. The association testing framework for VAT is regression based, which readily allows for flexible construction of association models with multiple covariates and weighting themes based on allele frequencies or predicted functionality. Additionally, pathway analyses, conditional analyses, and analyses of gene-gene and gene-environment interactions can be performed. VAT is capable of rapidly scanning through data by using multi-process computation, adaptive permutation, and simultaneously conducting association analysis via multiple methods. Results are available in text or graphic file formats and additionally can be output to relational databases for further annotation and filtering. An interface to R language also facilitates user implementation of novel association methods. The VAT's data QC and association-analysis pipeline can be applied to sequence, imputed, and genotyping array, e.g., "exome chip," data, providing a reliable and reproducible computational environment in which to analyze small- to large-scale studies with data from the latest genotyping and sequencing technologies. Application of the VAT pipeline is demonstrated through analysis of data from the 1000 Genomes project.