The current version is 0.9, released on Jan 19, 2011.
You are encouraged, but not required, to send Dr. Guan an e-mail for registration. An registration e-mail should have a title "bimbam registration", and simply provide your name, institution, and the OS you are using in a single line delimited by "," or ";". For example: John Belmont, Baylor College of Medicine, Linux.
Source code available upon request.
- View the pdf version.
- Below is the online version with links to the contents.
- What piMASS Can Do?
- Input File Formats
- Running piMASS
- Output Files
- Log file: prefix.log
- SNP information file: prefix.snp.txt
- Sampling path file: prefix.path.txt
- Sampled model file: prefix.gamma.txt
- Sampled model file: prefix.mcmc.txt
- Perform multi-SNP association analysis either genome-wide, or on a small region.
- Estimate proportion of phenotypic variance explained by the genotypes.
- Estimate the posterior of the number of SNPs that have effects on phenotype.
- Report the marginal posterior inclusion probabilities of each SNP, a measure of the strength of the marginal association.
rs1, A, T, 0.02, 0.80, 1.50
rs2, G, C, 0.98, 0.04, 1.00
Example phenotype file with 3 individuals:
If the phenotypes are binary (e.g. in a case-control study) then the format is the same, but each entry should be 0, 1 or NA. It does not matter which group is denoted 0 and which denoted 1.
This quantile transformation does not fully solve the problem (it ensures that the phenotype is normal overall, but not necessarily normal within each genotype class). However, with the small effect sizes typical in genetic association studies it appears to be a simple sensible way to guard against strong departures from modeling assumptions. If you have other covariates that may be important predictors of phenotype (e.g. Age, Sex) we suggest first regressing the phenotype on these covariates using standard multiple linear regression software, and then running piMASS on the residuals from this regression (after applying a normal quantile transformation to these residuals).
The file contains three columns: the first column is the SNP ID, the second column is its physical location, and the third column contains its chromosome number. Note, it is OK if the rows are not ordered according to position, but the file must contain all the SNPs in the genotype files. If the genotype files contain SNPs across different chromosome, piMASS will sort SNPs based on its chromosome and position.
rs1, 1200, 1
rs2, 4000, 1
rs3, 3320, 1
Note: This file is strictly needed only if the order of the SNPs in the genotype file is not the same as the order of their physical locations along the chromosome, or if multiple genotype and phenotype files are used (see below). To align SNPs in the correct order is important for the sampling procedure because we propose exchange nearby SNPs to facilitate the mixing among correlated SNPs.
In some cases it may be convenient to provide genotypes (and corresponding phenotypes) in multiple files. For example, in a genome-wide study, it may be helpful to have one genotype file containing the case data, and a second genotype file containing the control data. Or one genotype files containing individuals in the first stage of the study and the another contains the second.
When using multiple genotype files piMASS does not require that the same SNPs be present in both files (although if the same SNP is present in both files then the SNP identifier should be the same in both files, to convey this information). However, SNPs missing in one of the files will cause the SNP to be excluded from the study because of the missingness.
When using multiple genotype files, the user must also provide multiple phenotype files, with each phenotype file corresponding to the individuals in a genotype file. The exception to this is that, for a case/control study, the case phenotypes can be specified by -p 1 and control phenotypes can be specified by -p z.
When merging genotypes from different studies, there arises the issue of whether or not the genotypes for a SNP were obtained on the same strand. In some cases this can be checked easily: for example, if a SNP in one study is A/G, and in the other is T/C, we infer that the two studies used different strands, and we can flip one of the SNPs to correct this. piMASS performs these kinds of flip automatically. However, if a SNP is A/T, or C/G, one cannot tell whether the strandedness is the same or different across studies without external information. In this situation, piMASS assumes that genotypes for a single SNP in multiple input files refer to the same strand.
Note: if genotypes at a SNP are not compatible with the SNP being bi-allelic, even after strand flips (as might happen when multiple genotypes are used, see below), then the SNP is considered to be ``bad" and piMASS will exclude the SNP from the study.
First some general comments:
- piMASS is a command line based program. The command should be typed in a terminal window, in the directory in which piMASS executable exists.
- The command line should be all on one line: the line-break in the example is only because the line is too large to fit on one page.
- Unless otherwise stated, the ``options" (-g -p -pos -o, etc.) are all case-sensitive.
- A single genotype file and a single phenotype file
./pimass -g cohort.txt -p pheno.txt -w 10000 -s 100000 -o pref -num 10
The command line will run MCMC with burn-in steps 1000 and sampling steps 100000, every 10 steps a sample will be recorded. The output file names will begin with pref.
- Multiple genotype files and multiple phenotype files
./pimass -g cohort1.txt -p pheno1.txt -g cohort2.txt -p pheno2.txt -w 100000 -s 1000000 -o pref -pos pos.txt -num 10
This command line takes two genotype files and two phenotype files, merge them based on the SNP ID and sort them according to their locations in position files. In this example piMASS will run 100k warm-up steps and follow by 1M sampling steps, every 100 steps record states sampled.
- Binary phenotypes
./pimass -g case_mgt.txt -p 1 -g ctrl_mgt.txt -p z -pos pos.txt -o pref -w 10000 -s 1000000 -num 100 -cc
This command line asks piMASS to take two genotype files. The -p 1 assign all individuals in the matching genotype (case_mgt.txt in the example) as 1, and '-p z' assign all individuals in the matching genotype (ctrl_mgt.txt in the example) as 0. piMASS will run 10k warm-up steps and follow by 1M sampling steps, every 100 steps record states sampled. The -cc option tells piMASS that this data has binary phenotypes.
- Setup other parameters
./pimass -g cohort.txt -p pheno.txt -w 10000 -s 100000 -o pref -num 10 -hmin 0.01 -hmax 0.5 -pmin 1 -pmax 1000 -smin 1 -smax 200
This command line is identical to the first command line in the list except that it specifies ranges for the hyper-parameters h and p and the restrictions on the number of SNPs in the model. Specifically, it specifies that minimum and maximum of the h is 0.01 and 0.5 respectively, the minimum and maximum of is 1 1000 out of total number of SNPs, respectively. In addition, it restricts the minimum and maximum number of SNPs in a model to be 1 and 200.
piMASS will create output files in a directory named output/. (If this directory does not exist then it will be created.) Output files will be produced, each with a name beginning with ``prefix" that was specified by the -o option. We now describe the contents of these output files.
For quantitative phenotypes, it contains SNP ID, chromosome, position, estimates of the posterior inclusion probabilities based on simple counting, estimates of the posterior inclusion probabilities based on Rao-Blackwellisation, naive estimates of the posterior effect size, and Rao-Blackwellised estimates of the posterior effect size.
For binary phenotype (when -cc is used), the output file is different in that the two columns of the Rao-Blackwellised estimates are no longer there.
FILE I/O RELATED OPTIONS:
- -g arg can use multiple times, must pair with -p.
- -p arg can use multiple times, must pair with -g. arg can be a file name; z or 1, which indicates the pairing genotype individuals have phenotype 0 or 1.
- -pos arg can use multiple times. arg is a file name.
- -o arg arg will be the prefix of all output files, the random seed will be used by default.
- -cc calc bf of logit regression on binary phenotype.
- -w(warm) num specify number of burn-in steps for MCMC.
- -s(step) num specify number of sampling steps for MCMC.
- -num num specify thinning, record one states in every num steps.
- -r num specify random seed, system time by default.
- -hmin num specify minimum value for h.
- -hmax num specify maximum value for h.
- -pmin num specify minimum value for p.
- -pmax num specify maximum value for p.
- -smin num specify minimum value for number of SNPs in the model.
- -smax num specify maximum value for number of SNPs in the model.
- -nstart num specify number of SNPs (with top marginal signal) in the model to begin with.
- -v(ver) print version and citation
- -h(help) print this help
- -exclude-maf num exclude SNPs whose maf < num , default 0.01.
- -exclude-nopos num exclude SNPs that has no position information, 1 = yes (default), 0 = no
- -silence no terminal output.
This document was generated using the LaTeX2HTML translator Version 2008 (1.71)
The command line arguments were:
latex2html -image_type png -local_icons -split 0 pimass
The translation was initiated by Yongtao Guan on 2011-01-19