The Cancer resource - methods summarySummaryThe cancer resource consists of two different parts; 1) information of association between genome-wide RNA expression levels and survival of cancer patients (for nearly 8000 cancer patients representing 17 major types of cancer), and 2) examples of protein expression patterns in cancer tissues (for 216 tumors representing the 20 most common forms of human cancer). Key publicationUhlen M et al. (2017) “A pathology atlas of the human cancer transcriptome.” Science 357 (6352): aan2507 What can you learn from the Cancer resource?Learn about:
Data overview
How has the data been generated?Cancer tissues used for protein expression analysis were obtained from the Department of Pathology, Uppsala University Hospital, Uppsala, Sweden as part of the sample collection governed by the Uppsala Biobank (http://www.uppsalabiobank.uu.se/en/). Cases were selected after microscopical examination of representative HE sections. Cores with 1 mm diameter were subsequently obtained from corresponding tissue blocks and transferred into cancer tissue microarrays. All human tissue samples used in the present study were anonymized in accordance with approval and advisory report from the Uppsala Ethical Review Board Cancer patient samples used for mRNA expression and survival analysis were collected from The Cancer Genome Atlas (TCGA) project from the initial release of Genomic Data Commons (GDC) on June 6, 2016, and information regarding sex, age and other clinical information can be found here. Only samples with both clinical info and transcriptomic data available at that time point were used in this study. Mass spectrometry-based quantitative proteomic data is sourced from CPTAC and the Proteomic Data Commons (PDC), part of the National Cancer Institute’s Cancer Research Data Commons (CRDC). This section displays relative protein quantities obtained from TMT11-quantified proteomic datasets from 11 cancer types. Alongside proteomic data, clinical and demographic information, such as age, sex, tumor stage, and recurrence status, is available for download from the PDC. The data from the Proteomic Data Commons (PDC) is used under the CC-BY 4.0 license, allowing sharing and adaptation with proper attribution and indication of any changes. How has the data been analyzed?For protein expression analysis, sections from cancer tissue microarrays were immunohistochemically stained and corresponding slides scanned to generate digital images. All images were then analyzed by pathologists and annotated with respect to staining intensity and fraction of positive cancer cells for all approved antibodies. The result of immunohistochemistry-based protein expression was then summarized as high, medium, low or not detected. For RNA expression analysis, quantified raw sequencing data were downloaded from this site in the available format (FPKM tables). Each of the 20,090 genes with mapped RNA-seq data was classified into one of six categories for cancers based on the FPKM levels in 17 cancer types, respectively: (1) Not detected: FPKM <1 in all cancers; (2) Enriched: at least a 5-fold higher FPKM level in one cancer than in all other cancers; (3) Group enriched: a 5-fold higher average FPKM value in a group of 2-7 cancers than in all other cancers; (4) Expressed in all: detected in all 32 cancers with FPKM >1; (5) Enhanced: at least a 5-fold higher FPKM level in one cancer than the average value of all 17 cancers; and (6) Mixed: the remaining genes detected in 1-16 cancers with FPKM >1 that did not fit the above categories. Based on the FPKM value of each gene, we classified the patients into two groups and examined their prognoses. In the analysis, we excluded genes with low expression, i.e., those with a median expression among samples less than FPKM 1. The prognosis of each group of patients was examined by Kaplan-Meier survival estimators, and the survival outcomes of the two groups were compared by log-rank tests. To choose the best FPKM cut-offs for grouping the patients most significantly, all FPKM values from the 20th to 80th percentiles were used to group the patients, significant differences in the survival outcomes of the groups were examined and the value yielding the lowest log-rank P value is selected. Genes with log rank P values less than 0.001 were defined as prognostic genes. In addition, if the group of patients with high expression of a selected prognostic gene has a higher observed event than expected event, it is an unfavorable prognostic gene; otherwise, it is a favorable prognostic gene. For the mass spectrometry datasets, CPTAC harmonized the proteomic datasets using consistent workflows across all cancer types. Briefly, protein abundance was quantified using Tandem Mass Tag (TMT11) labeling, with pre-calculated protein assemblies and peptide ratios used for downstream analysis. Normal samples were included for certain cancers to enable comparison with healthy tissue. Protein expression data was analyzed using log-transformed ratios obtained from the protein builds of each respective dataset. Samples of poor quality and excluded samples annotated by CPTAC were removed. Differential expression between cancerous and matched normal tissues was calculated through fold changes and evaluated by statistical tests (Wilcoxon rank-sum) with adjusted p-values for multiple testing (Bonferroni). Significant proteins were identified, revealing cancer-specific dysregulation. What is presented in the resource?Kaplan-Meier survival plots which show the prognostic association between RNA expression of each protein-coding genes and patient survival of each of the 17 cancer types were generated. A summary of significant prognostic results is provided in the gene summary page. In addition, the Kaplan-Meier survival plots as well as a scatter plot showing the correlation between RNA expression of the gene and patient survival of a specific cancer type are shown in a cancer type specific gene summary page. The page is interactive, and users can select a subgroup of patients based on, for example, tumor stage i and generate specific plots for the selected subgroup on the website immediately. The user can also us any specific expression cutoff (FPKM value) to produce different Kaplan-Meier and scatter plots. An example of the cancer specific Kaplan-Meier and scatter plot for a gene is shown as below.
The RNA expression levels were summarized across 17 cancer types for all protein-coding genes. The results are presented as shown for the examples a gene enriched in liver cancer in the figure.
Similarly, the protein levels were determined across 20 cancer types for all protein-coding genes, and the results are presented as shown for the same gene as above in the figure.
Moreover, to exemplify protein expression patterns both within one cancer type and between different types of cancer, a multitude of IHC images for more than 17.000 protein-coding genes in 20 human cancer types are also provided in this resource. An example of IHC image for a gene from a selected cancer patient is shown as below.
Results from the CPTAC pan-cancer proteomic analyses are visualized on the Human Protein Atlas website as boxplots, featuring interactive volcano plots that compare protein expression between cancers and normal tissues. Volcano plots (Example below: Colon cancer) illustrate the comparison of protein expression between cancer and normal tissues using tandem mass tag (TMT) mass spectrometry analysis from the CPTAC dataset. The x-axis represents the log2 fold change of reporter intensity between cancer and normal, with positive values indicating proteins upregulated in cancer tissue and negative values indicating those downregulated. The y-axis represents the -log10 adjusted p-value from a Wilcoxon rank-sum test, where higher values indicate stronger statistical significance after multiple testing correction. Proteins highlighted in red are significantly upregulated in cancer tissue, while those in blue are significantly downregulated compared to normal tissue. Examples from the Colon cancer dataset identifies a set of upregulated proteins (GMPS, HSPA14, TAOK1, TBCE, and FEN1) and a set of downregulated proteins (CLEC4M, CLEC4G, ITGA9, and TNS2). Gray points represent non-significant proteins based on the log2 fold change and adjusted p-value thresholds.
|