Cell Line - Methods summarySummaryThe Cell line resource contains information of transcriptomics and proteomics of human cell lines. The transcriptomics analysis covers 1206 human cell lines, corresponding to 28 cancer types, one non-cancerous group and one uncategorised group of cell lines, and includes classification based on specificity, distribution and expression clusters. The proteomic part is based on the Pan-Cancer Atlas project and presents MS data for 542 cell lines overlapping between the datasets. Based on the transcriptomics profiles, cell lines were evaluated for their consistency to the corresponding TCGA (The Cancer Genome Atlas) disease cohort to help researchers to select the best cell lines as in vitro models for cancer research. In addition, based on biological data mining, for each cell line, the relative activity of 14 cancer-related pathways and 43 cytokines were inferred and presented to characterize the phenotype of the cell line. The addition of protein expression data from the Pan-Cancer Atlas contributes to a better understanding of the cell line models, and by enabling a relative comparison between RNA and protein expression levels the data can aid in selecting an appropriate cell line model for the target and application of interest. Key publicationJin H et al. (2023) "Systematic transcriptional analysis of human cell lines for gene expression landscape and tumor representation" Nat Commun 14, 5417 (2023). What can you learn from the Cell line resource?
Data overview
How has the data been generated?RNA-seq A genome-wide expression analysis of 1206 human cell lines, including 1132 cancer cell lines, was performed using RNA-seq with early-split samples as duplicates. Here, RNA-seq profiles of cell lines generated by the HPA (n = 69) and the Cancer Cell Line Encyclopedia (CCLE 2019; n = 1019) were integrated, with the 33 common cell lines averaged for their gene expression. MS Proteomics A mass-spectrometry (MS) based proteomics (dataset), obtained by SCIEX6600 TripleTOF mass spectrometry running DIA mode, containing data for 949 cell lines from over 40 cancer types was retrieved from the Pan-Cancer Atlas (Gonçalves E et al. (2022)). The data corresponds to three replicates for each cell line analysed on two different mass spectrometers resulting in six different injections and teh raw data is available here. From the 6,981 MS runs, the spectral library was then created using in-silico spectral library from DIA-NN containing 12,387 proteins and 144,578 precursors. The MS data was then searched and normalized using RT-dependent normalization. Only precursors from proteotypic peptides was retained in the output results and quantified with maxLFQ. The data was then log2 transformed and 117 files were removed due to experiment and sample quality control. How has the data been analyzed?RNA-seq The transcript abundance of each protein-coding gene was estimated using the average TPM value of the individual samples for each cell line. The transcriptomics data was then used to
MS Proteomics In the analysis data from the 542 cell lines from the Pan-Cancer dataset that could be matched to cell lines in the RNA-seq dataset was used. Normalized relative protein expression (nRPX) was estimated from the intensities by normalization of the protein profile in each cell line using Z-score normalization. The normalization was done regardless of missing values after having them quantified. Normalisation within each cell line takes into account the different levels of protein detection within cell lines in different injections and enables relative comparison of the expression of a protein across the different cell lines. What is presented in the resource?The RNA expression levels were determined for all protein-coding genes (n = 20162) across the 1206 human cell lines and the results are presented on the gene summary page of the Cell line resource as exemplified in the figure below. In this figure the MS protein expression data is represented by a circle for each cell line group with size corresponding to expression level and color intensity representing the percentage of cell lines in the group having MS data. This include non-detectable proteins in white and the size of the dot can be relatively compared by nRPX values. The stroke of the dot in the cell line groups also show the percentage of the availability of the dataset particularly.
On the cell line category specific pages, which are accessed by clicking on the piechart or the colored boxes on the Cell line resource page, plots showing the cancer-related pathway (PROGENy) and cytokine (CytoSig) activity relative to the average expression of all analyzed cell lines as the baseline are displayed. For 26 TCGA disease cohorts the ranking list of the cell lines based on gene expression similarity to the corresponding disease cohort is shown. This can be served as a reference for cell line selection for in vitro experiments when studying a specific cancer type. How has the classification of all protein-coding genes been done?A genome-wide classification of the protein-coding genes with regard to cell line distribution across all cancer cell lines as well as specificity across 28 cancer types has been performed using between-sample normalized data (nTPM). The results can serve as a reference for researchers interested in expression profiles of human cell lines at both the disease level and cell line level. The genes were classified according to specificity into (i) cancer enriched genes with at least four-fold higher expression levels in one cell line cancer type as compared with any other analyzed cell line cancer types; (ii) group enriched genes with enriched expression in a small number of cell line cancer types (2 to 10); and (iii) cancer enhanced genes with only moderately elevated expression. In addition, all genes were classified according to distribution in which each gene is scored according to the presence (expression levels higher than a cut-off) in the cell lines. The cell line cancer enriched and group enriched genes are displayed in the interactive plot below, in which clicking on the red and orange circles results in gene lists for the corresponding enriched and group enriched genes, respectively.
Finally, a new classification has been introduced in which genes are clustered based on similarity in expression across the cell lines. The results are presented as an interactive UMAP plot in which mouse-over displays general information for the clusters and the clicking on a cluster will display more information and plots regarding that specific cluster, as well as, a clickable list of all clusters.
How was the similarity of the cell lines to the corresponding TCGA cancer cohorts analysed?The 1132 cancer cell lines were analyzed for their representability of the corresponding TCGA disease cohorts. Considering that tumor samples also harbor immune and other cells, here, TCGA samples with a low tumor purity score (< 0.7, https://www.nature.com/articles/ncomms3612) were excluded from the analysis. The similarity between cell lines and the corresponding TCGA cohort was estimated by two different approaches:
How has the pathway and cytokine analysis been done?PROGENyFor all 1206 analyzed cell lines, the activity of a total of 14 cancer-related pathways were inferred using the PROGENy, a package that relies on biological data mining of publicly available data to obtain cancer-related pathway responsive genes for human and mouse (Schubert M et al. (2018)). For this, read counts for HPA and CCLE cell lines quantified by Kallisto were re-analyzed without filtering out the non-protein-coding genes to ensure a broadened coverage of cancer pathway responsive genes. The read counts of the 1206 cell lines were normalized by DESeq2 with respect to the size factor of each cell line and were further transformed by variance stabilizing transformation into log2 space. To calculate the relative pathway’s activities across all cell lines, the normalized values were centered by subtracting the mean value per gene. Then, the R package decoupleR was used to calculate the relative pathway’s activities based on the top 100 signature genes per pathway obtained from the R package progeny (Schubert M et al. (2018)). By default, the decoupleR was executed using the top performer methods benchmarked (i.e., mlm for multivariate linear model, ulm for univariate linear model, and wsum for weighted sum) and the results were integrated to obtain a consensus z-score to represent the pathway activity. Here, a consensus z-score above 1 or below -1 was considered significant. CytoSigThe activity of 43 CytoSig cytokines was inferred based on the gene expression profile of the 1206 cell lines by the package CytoSig (Jiang P et al. (2021)). Gene expression data were processed in the same way as for PROGENy analysis. Also, DESeq2 normalized expression values were centered per gene as suggested. The CytoSig program was executed with 10,000 permutations, and the results were presented as z-scores to represent the relative cytokine activities, with a p-value < 0.05 as significant. |