Accelerating Data Driven Discovery
The Multi-Omics Data Analysis assists with analysis of data generated by the major technology platforms of the CPRIT Core Facility including mass spectrometry (MS) metabolomics, MS-based proteomics and reverse phase protein array (RPPA) proteomics. In addition, we accept and processes high-throughput sequencing data such as transcriptomics (via RNA-Seq), cistrome (via ChIP-Seq, Reduced Representation Bisulfite Sequencing, and Whole Genome Bisulfite Sequencing), and genomics (Whole Genome Sequencing or Whole Exome sequencing). Using a consistent data work-flow we assist with performing Primary/Tier 1 data analysis in collaboration with each technology platform group of the Core Facility. , In addition we provide as independent support services Integrative/Tier 2 analysis across multiple omics data to lead to systems biology level insight, and also to generate robust testable hypotheses. Integrative analysis is performed using datasets from national and international projects such as The Cancer Genome Atlas (TCGA), Encode, the NIH Epigenomic Roadmap, International Human Epigenome Consortium (IHEC), The Metabolomics Workbench, as well as datasets from scientific community repositories such as the NIH Gene Expression Omnibus (GEO) and NIH Short Read Archive (SRA). Major scientific objectives supported by the core include but are not limited to 1) discovery of novel cancer drivers using analysis techniques such as machine learning 2) discovery of novel biomarkers by evaluation of refined signatures on clinically annotated patient cohorts (ex. TCGA, BCM Biobank) 3) discovery of novel therapeutics modalities via interrogation of drug molecular profiles. The Multi-Omics Data Analysis core will assist with visualization for the purpose of data exploration or generation of publication-quality figures in collaboration with the Research IT core.
The Data Sharing core proposes to address the needs of BCM investigators for data management to empower discoveries and high-impact publications and competitive grant applications.
The Multi-Omics Data Analysis and Data sharing cores act as fee-for-service arms under the umbrella of the Biostatistics and Informatics Shared Resource (BISR) of the Dan L Duncan Comprehensive Cancer Center and function under the academic leadership of Dr. Susan Hilsenbeck. A number of important benefits of this approach are: access to the Duncan Cancer Center cluster computing resources; access to licensed statistical analysis software; access as needed to the extensive technical and statistical expertise of other BISR personnel.
The Multi-Omics Data Analysis Core provides consultation on multiple topics prior to analysis
1) Consultation on experimental design
2) Consultation on integration of CPRIT and other core facilities data
3) Consultation on integration of publicly available data
4) After completion of analysis, and as needed during the analysis, review results with primary investigator and assist with results interpretation.
2. Primary Analysis of Multi-Omics Data Generated by BCM Core Facilities
Mass Spec Metabolomics
Both targeted and unbiased mass spectrometry metabolomics data with be generated and normalized by the CPRIT Metabolomics core at BCM directed by Dr. Nagireddy Putluri. Primary/Tier 1 analysis will detect differentially expressed compounds across experimental group using parametric and non-parametric methods; false discovery rate (FDR) methods would be used for multiple hypothesis testing correction. We will employ supervised learning to obtain parsimonious models of association with experimental groups, using methods such as k-nearest neighbor, linear discriminate analysis, support vector machines, and random forests. Integrative/Tier 2 analysis enables combination with other data types via commonly enriched pathways and processes, such as those compiled by the Gene Ontology (GO) or MSigDB significant metabolites are converted to KEGG enzyme/gene IDs using both in-house and public databases like BridgeDb or HMDB. We use over representation of pathways (ORA) with the hypergeometric distribution and pathway network analysis. We will use principal components analysis (PCA), hierarchical clustering of samples and/or metabolites for data visualization.
RPPA Proteomics Proteomics Analysis
Investigators will obtain the RPPA data from the BCM core led by Dr. Shixia Huang. Normalization will be performed by the core staff. In Primary Analysis (Tier 2a in the terminology of the RPPA core), significantly changed proteins among experimental groups will be determined using non-parametric tests (Wilcoxon rank-sum test, adjusted p-value<0.05). Integrative Analysis (Tier 2b in the terminology of the RPPA core) we will integrate this data with other datasets by determining enriched pathways (using the hypergeometric distribution, p<0.05). For cancer project, the core will evaluate clinical significance of RPPA signatures using the RPPA proteomics collected by the Cancer Genome Atlas Project (TCGA).
Investigators will obtain the MS Proteomics data from the BCM core led by Dr. Anna Malovannaya. Gene annotation and normalization will be performed by the core staff. In Primary Analysis significantly changed proteins among experimental groups will be determined using parametric (t-test, adjusted p-value<0.05) or non-parametric tests (Wilcoxon rank-sum test, adjusted p-value<0.05). Integrative Analysis we will integrate this data with other datasets by determining enriched pathways (using the hypergeometric distribution, p<0.05). Since in many cases single replicates are generated for experimental group, an effective analysis method is Gene Set Enrichment Analysis (GSEA) since it relies on permutation testing of the entire set of detected proteins, rather than enrichment on statistically significant protein/gene subset. Using proteomic profiles, we will perform integration with other publicly available datasets both at protein level (using data deposited in repositories such as ProteomeXchange) and transcriptomic level, using TCGA or Gene Expression Omnibus (GEO) datasets.
Transcriptomics, Genomics, and Epigenomics data
In addition to the CPRIT cores, often further insight can be achieved by integration with sequencing data, such as Transcriptomic, Genomics, or Epigenomics. BCM investigators can generate sequencing data using the BCM RNA and Genomic Profiling Sequencing Core (GARP) (https://www.bcm.edu/garp/) led by Dr. Lisa White, or by accessing public repositories such as TCGA or GEO. Primary/Tier 1 analysis. Sequencing data quality will be assessed using the FastQC software. Transcriptomic profiling via RNA-Seq: data will be mapped using TopHat2 onto the corresponding genome build and gene expression will be assessed using Cufflinks2. Significantly changed genes will be determined using the R packages limma, DeSeq2, or edgeR. Genomic data will be mapped using BWA or BOWTIE2 to the respective genome; variants will be inferred using the GATK software and annotated using the ANNOVAR package, and then filtered according to the specific project needs. For Epigenomic data, after mapping to respective genome as above, the MACS2 algorithm will be used to identify enriched regions (peaks), and enriched motifs will be inferred using the HOMER and MEME-CHIP tools. Bisulfite sequencing data will be mapped to the respective genome using Bismark; methylation changes will be detected using packages such as DMRcate. As part of Integrative/Tier 2 analysis, we will infer enriched pathways using the Gene Set Enrichment (GSEA) method, and the gene set collection from the Molecular Signature Database (MSigDB). We will visualize genome-wide maps using the Integrative Genomics Viewer (IGV) or the UCSC Genome Browser.
3. Integrative Multi-Omics Data Analysis
The Multi-Omics Data Analysis core evaluates quantitative associations between molecular signatures obtained via integration of metabolomics/proteomics/transcriptomics data and clinical variables, such as disease status, tumor grade or stage, using ANOVA (implemented in the R statistical system) or for survival association (log-rank test or Cox-proportional hazards models, using the R package survival). In addition to mining established patient cohorts such as TCGA, we are actively pursuing addition of metabolomics/proteomics components to clinical trials, as a cost-effective way to improve the lessons learned from these endeavors and enable further hypothesis generation.
4. Data Sharing
To ensure compliance with NIH Genomic Data Sharing (GDS) Policy, BCM datasets will be deposited in an appropriate NIH-designated repository, in a consistent and unified manner. These activities will be led by Dr. McKenna’s who has established collaborations with leading gene-centric (GeneCards), small molecule-centric knowledge bases (PubChem, ChEBI, DrugBank) and dataset metadata indexes (Thomson Reuters Web Of Knowledge) to increase the discoverability and accessibility of ‘omics datasets. Exposing essential information about dataset on these websites, which are used by thousands of scientists every day, increases the possibility that data will be discovered and new research collaborations formed. The Core will also offer distribution of metadata to these resources to increase awareness of the datasets with their users and will develop similar services for deposing other omics data sets ( proteomics and metabolomics) according to emerging NIH polices and guidelines. Additional services to be provided will include Journal Article Linking, assistance with acquiring Digital Object Identifiers (DOIs) and Open Researcher and Contributor Identifiers (ORCID) and Drafting of Data Sharing Plans for NIH grants. Journal Article Linking will include omics datasets associated with articles in journals published by partners to data visualization and analysis interfaces which will greatly increase the visibility and utility of these datasets to researchers. DOIs are a universal, persistent unique identifiers for scholarly digital objects, best known in the scientific arena as identifiers for journal articles. Branding of datasets with a unique BCM will unambiguously identify these datasets as products of BCM research, facilitating compliance reporting to NIH. ORCID minting is to unambiguously identify the dataset as a research product of the investigators involved. Data sharing plans for NIH grants will include project budgets that may be needed to support a proposed genomic or other omics data sharing plan.
The bioinformatics analysis will be performed using the infrastructure of the Dan L Duncan Comprehensive Cancer Center Computing Facilities. In addition to up to date desktop computers for all faculty and staff, which include both 32-bit and 64-bit personal computers for most bioinformatically oriented members, we have two major computing facilities – one in the Breast SPORE facilities on the main BCM campus and one on the Energy Transfer Data Center approximately one mile away. Both sit inside the BCM firewall, have nightly offsite backups, are protected through non-aqueous fire suppression, have redundant power and, for high and moderate capability machines, are accessible via a 10g Ethernet switched local area network (LAN). The availability of two physically separated facilities dramatically improves availability by allowing for more rapid recovery from a disaster such as a fire or flood that incapacitates one facility. We have a 35 node high performance compute cluster (each with two to eight cores each and newer nodes with 96 or 128 GB RAM) representing a total of 375 CPUs with 34 fast-access terabytes SAN storage for any high performance compute needs. We are in the process of expanding and substantially upgrading this capacity for 2013 with an extensible NetApp storage appliance. For archival storage, 10s-100s of TB can be readily leased at very low cost from BCM Information Technology. Four cluster nodes are set aside for interactive jobs; the remaining nodes are available for batch jobs. Queues are managed by Sun Grid Engine and the system itself is administered by an expert system architect with >10 years of experience in HPC. Access to these resources is supported by partial chargeback, commensurate with level of use. Because the cluster is self-contained (i.e., located in one location without yet having an identical sister cluster offsite), the nodes themselves do not enjoy the full benefit of disaster recovery from both sites, whereas the archive storage does. Outside of HPC availability, there are three Sun Sunfire X4170 Virtualization Servers at SUDC and two Cisco UCS C210 M2 Virtualization Servers at Breast facilities. Servers at both sites use VMware for creation of virtual servers that can run any operating system with varying system requirements. Each location’s virtualization servers have attached 37TB NetApp storage with vMotion in place to manage failover of the virtual machines from one site to another, should disaster situations arise. In addition, there are two HP servers with direct-attached 96TB storage for Oracle 11g (backed up off-site nightly); four Sun physical servers with 37 terabytes of storage running the Solaris operating system; etc.
References (Core Supported Publications)
1. Wangler MF, Chao YH, Bayat V, Giagtzoglou N, Shinde AB, Putluri N, Coarfa C, Donti T, Graham BH, Faust JE, McNew JA, Moser A, Sardiello M, Baes M, Bellen HJ. Peroxisomal biogenesis is genetically and biochemically linked to carbohydrate metabolism in Drosophila and mouse. PLoS Genet. 2017 Jun 22. PMID: 28640802
2. Roberts JM, Martin RS, Piyarathna DB, MacKrell JG, Rocha GV, Dodge JA, Coarfa C, Krishnan V, Rowley DR, Weigel NL . Vitamin D receptor activation reduces VCaP xenograft tumor growth and counteracts ERG activity despite induction of TMPRSS2:ERG. Oncotarget. 2017 Jul 4. PMID: 28591703
3. Shafi AA, Putluri V, Arnold JM, Tsouko E, Maity S, Roberts JM, Coarfa C, Frigo DE, Putluri N, Sreekumar A, Weigel NL. (2015). Differential Regulation of Metabolic Pathways by Androgen Receptor (AR) and its Constitutively Active Splice Variant, AR-V7. Oncotarget 6(31):31997-32012: PMID: 26378018.
4. Dasgupta S, Putluri N, Long W, Zhang B, Wang J, Kaushik AK, Arnold JM, Bhowmik SK, Stashi E, Brennan CA, Rajapakshe K, Coarfa C, Mitsiades N, Ittmann MM, Chinnaiyan AM, Sreekumar A, O'Malley BW. (2015). Coactivator SRC-2-dependent metabolic reprogramming mediates prostate cancer survival and metastasis. J Clin Invest. 125(3):1174-1188. PMID: 25664849. PMCID: PMC4362260.
5. Kettner NM, Voicu H, Finegold MJ, Coarfa C, Sreekumar A, Putluri N, Katchy CA, Lee C, Moore DD, Fu L. Circadian Homeostasis of Liver Metabolism Suppresses Hepatocarcinogenesis. Cancer Cell. 2016 Dec 12;30(6):909-924. PMID: 27889186.
6. Rundstedt FV, Kimal R, Ma J, Arnold J, Gohlke J, Putluri V, Krishnapuram R, Piyarathna DB, Lotan Y, Gödde D, Roth S, Störkel S, Levitt JM, Michailidis G, Lerner SP, Sreekumar A, Coarfa C, Putluri N. Integrative pathway analysis of metabolic signature in bladder cancer - a linkage to the Cancer Genome Atlas Project and prediction of survival. J Urol. 2016 Jan 20. PMID: 26802582. PMCID: PMC4693629
7. Park JH, Vithayathil S, Kumar S, Sung PL, Dobrolecki LE, Putluri V, Bhat VB, Bhowmik SK, Gupta V, Arora K, Wu D, Tsouko E, Zhang Y, Maity S, Donti TR, Graham BH, Frigo DE, Coarfa C, Yotnda P, Putluri N, Sreekumar A, Lewis MT, Creighton CJ, Wong LJ, Kaipparettu BA. Fatty Acid Oxidation-Driven Src Links Mitochondrial Energy Reprogramming and Oncogenic Properties in Triple-Negative Breast Cancer. Cell Rep. 2016 Mar 8;14(9):2154-65. PMID: 26923594. PMCID: PMC4809061
8. Chang C, Zhang M, Rajapakshe K, Coarfa C, Edwards D, Huang S, Rosen JM. (2015). Mammary Stem Cells and Tumor-Initiating Cells are More Resistant to Apoptosis and Exhibit Increased DNA Repair Activity in Response to DNA Damage. Stem Cell Reports, 5:378-391. PMID: 26300228.
9. Holdman XB, Rajapakshe K, Coarfa C, Mo Q, Huang S, Hilsenbeck SG, Edwards D, Rosen JM. (2015). Stroma Remodeling by EGFR Signaling Promotes FGFR1-Driven Breast Tumor Recurrence. Breast Cancer Research 17(1):141. PMID: 26581390.
10. Fleet T, Zhang B, Lin F, Zhu B, Dasgupta S, Stashi E, Tackett B, Thevananther S, Rajapakshe KI, Gonzales N, Dean A, Mao J, Timchenko N, Malovannaya A, Qin J, Coarfa C, DeMayo F, Dacso CC, Foulds CE, O'Malley BW, York B. (2015). SRC-2 orchestrates polygenic inputs for fine-tuning glucose homeostasis. Proc. Natl. Acad. Sci. U. S. A 112:E6068-E6077. PMID: 26487680. PMCID: PMC4640775.
11. Sun B, Fiskus W, Qian Y, Rajapakshe K, Raina K, Coleman KG, Crew AP, Shen A, Saenz DT, Mill CP, Nowak AJ, Jain N, Zhang L, Wang M, Khoury JD, Coarfa C, Crews CM, Bhalla KN. BET protein proteolysis targeting chimera (PROTAC) exerts potent lethal activity against mantle cell lymphoma cells. Leukemia. 2017 Jun 30. PMID: 28663582
12. Eedunuri VK, Rajapakshe K, Fiskus W, Geng C, Chew SA, Foley C, Shah SS, Shou J, Mohamed JS, Coarfa C, O'Malley BW, Mitsiades N. miR-137 Targets p160 Steroid Receptor Coactivators SRC1, SRC2, and SRC3 and Inhibits Cell Proliferation. Mol Endocrinology 2015. Aug;29(8):1170-83. PMID: 26066330
13. Coarfa C, Fiskus W, Eedunuri VK, Rajapakshe K, Foley C, Chew SA, Shah SS, Geng C, Shou J, Mohamed JS, O'Malley BW, Mitsiades N. Comprehensive proteomic profiling identifies the androgen receptor axis and other signaling pathways as targets of microRNAs suppressed in metastatic prostate cancer. Oncogene. 2015 Sep 14. PMID: 26364608
14. White MA, Lin C, Rajapakshe K, Dong J, Shi Y, Tsouko E, Mukhopadhyay R, Jasso D, Dawood W, Coarfa C, Frigo DE. Glutamine Transporters are Targets of Multiple Oncogenic Signaling Pathways in Prostate Cancer. Mol Cancer Res. 2017 May 15. PMID: 28507054
15. Geng C, Kaochar S, Li M, Rajapakshe K, Fiskus W, Dong J, Foley C, Dong B, Zhang L, Kwon OJ, Shah SS, Bolaki M, Xin L, Ittmann M, O'Malley BW, Coarfa C, Mitsiades N. SPOP regulates prostate epithelial cell proliferation and promotes ubiquitination and turnover of c-MYC oncoprotein. Oncogene. 2017 Apr 17. PMID: 28414305
16. Bhardwaj A, Singh H, Rajapakshe K, Tachibana K, Ganesan N, Pan Y, Gunaratne PH, Coarfa C, Bedrosian I. Regulation of miRNA-29c and its downstream pathways in preneoplastic progression of triple-negative breast cancer. Oncotarget. 2017 Jan 30. PMID: 28160548
17. Saenz DT, Fiskus W, Qian Y, Manshouri T, Rajapakshe K, Raina K, Coleman KG, Crew AP, Shen A, Mill CP, Sun B, Qiu P, Kadia TM, Pemmaraju N, DiNardo C, Kim MS, Nowak AJ, Coarfa C, Crews CM, Verstovsek S, Bhalla KN. Novel BET protein proteolysis-targeting chimera exerts superior lethal activity than bromodomain inhibitor (BETi) against post-myeloproliferative neoplasm secondary (s) AML cells. Leukemia. 2017 Jan 31. PMID: 28042144
18. Marisetty AL, Singh SK, Nguyen TN, Coarfa C, Liu B, Majumder S. REST represses miR-124 and miR-203 to regulate distinct oncogenic properties of glioblastoma stem cells. Neuro Oncol. 2016 Dec 31. PMID: 28040710
19. Blessing AM, Rajapakshe K, Reddy Bollu L, Shi Y, White MA, Pham AH, Lin C, Jonsson P, Cortes CJ, Cheung E, La Spada AR, Bast RC Jr, Merchant FA, Coarfa C, Frigo DE. Transcriptional regulation of core autophagy and lysosomal genes by the androgen receptor promotes prostate cancer progression. Autophagy. 2016 Dec 15:1-16. PMID: 27977328
20. Saenz DT, Fiskus W, Manshouri T, Rajapakshe K, Krieger S, Sun B, Mill CP, DiNardo C, Pemmaraju N, Kadia T, Parmar S, Sharma S, Coarfa C, Qiu P, Verstovsek S, Bhalla KN. BET protein bromodomain inhibitor-based combinations are highly active against post-myeloproliferative neoplasm secondary AML cells. Leukemia. 2016 Oct 25. PMID: 27677740.
21. Ware MJ, Keshishian V, Law JJ, Ho JC, Favela CA, Rees P, Smith B, Mohammad S, Hwang RF, Rajapakshe K, Coarfa C, Huang S, Edwards DP, Corr SJ, Godin B, Curley SA. Generation of an in vitro 3D PDAC stroma rich spheroid model. Biomaterials. 2016 Nov;108:129-42. PMID: 27627810. PMCID: PMC5082237
22. Chitsazzadeh V*, Coarfa C*, Drummond JA, Nguyen T, Joseph A, Chilukuri S, Charpiot E, Adelmann CH, Ching G, Nguyen TN, Nicholas C, Thomas VD, Migden M, MacFarlane D, Thompson E, Shen J, Takata Y, McNiece K, Polansky MA, Abbas HA, Rajapakshe K, Gower A, Spira A, Covington KR, Xiao W, Gunaratne P, Pickering C, Frederick M, Myers JN, Shen L, Yao H, Su X, Rapini RP, Wheeler DA, Hawk ET, Flores ER, Tsai KY. Cross-species identification of genomic drivers of squamous cell carcinoma development across preneoplastic intermediates. Nat Commun. 2016 Aug 30. PMID: 27574101. PMCID: PMC5013636
23. Wang Q, Trevino LS, Wong RL, Medvedovic M, Chen J, Ho SM, Shen J, Foulds CE, Coarfa C, O'Malley BW, Shilatifard A, Walker CL. Reprogramming of the Epigenome by MLL1 Links Early-Life Environmental Exposures to Prostate Cancer Risk. Mol Endocrinol. 2016 Aug;30(8):856-71. PMID: 27219490. PMCID: PMC4965842.