Cell Line - Methods summary

Summary

The Cell line resource contains information of transcriptomics and proteomics of human cell lines. The transcriptomics analysis covers 1206 human cell lines, corresponding to 28 cancer types, one non-cancerous group and one uncategorised group of cell lines, and includes classification based on specificity, distribution and expression clusters. The proteomic part is based on the Pan-Cancer Atlas project and presents MS data for 542 cell lines overlapping between the datasets.

Based on the transcriptomics profiles, cell lines were evaluated for their consistency to the corresponding TCGA (The Cancer Genome Atlas) disease cohort to help researchers to select the best cell lines as in vitro models for cancer research. In addition, based on biological data mining, for each cell line, the relative activity of 14 cancer-related pathways and 43 cytokines were inferred and presented to characterize the phenotype of the cell line.

The addition of protein expression data from the Pan-Cancer Atlas contributes to a better understanding of the cell line models, and by enabling a relative comparison between RNA and protein expression levels the data can aid in selecting an appropriate cell line model for the target and application of interest.

Key publication

Jin H et al. (2023) "Systematic transcriptional analysis of human cell lines for gene expression landscape and tumor representation" Nat Commun 14, 5417 (2023).

What can you learn from the Cell line resource?

  • if a gene is enriched in cell lines from a particular cancer type (specificity)
  • which genes have a similar RNA expression profile across the cell lines (expression cluster)
  • the catalogue of genes elevated in each of the cell lines
  • which cell line has the most consistent expression profile to its corresponding TCGA disease cohort (i.e., the best cell lines for cancer study)
  • the relative protein abundance in the cell line groups
  • cancer-related pathway and cytokine activity of each cell line from RNA-seq

Data overview

Data type Count Data Coverage (nr genes)
RNA expression 1206 RNA expression in 1206 cell line 20162
RNA expression 28 RNA expression and classification based on 28 cancer cell line groups 20162
RNA expression 1132 Comparison between 1132 cancer cell lines and TCGA cancers using ranking and correlation 19973
RNA cell line analysis 1206 PROGENy and CytoSig analysis on 1206 cell lines
RNA gene clustering 1206 Gene expression cluster analysis based on 1206 cell lines 19508

How has the data been generated?

RNA-seq

A genome-wide expression analysis of 1206 human cell lines, including 1132 cancer cell lines, was performed using RNA-seq with early-split samples as duplicates. Here, RNA-seq profiles of cell lines generated by the HPA (n = 69) and the Cancer Cell Line Encyclopedia (CCLE 2019; n = 1019) were integrated, with the 33 common cell lines averaged for their gene expression.

MS Proteomics

A mass-spectrometry (MS) based proteomics (dataset), obtained by SCIEX6600 TripleTOF mass spectrometry running DIA mode, containing data for 949 cell lines from over 40 cancer types was retrieved from the Pan-Cancer Atlas (Gonçalves E et al. (2022)). The data corresponds to three replicates for each cell line analysed on two different mass spectrometers resulting in six different injections and teh raw data is available here. From the 6,981 MS runs, the spectral library was then created using in-silico spectral library from DIA-NN containing 12,387 proteins and 144,578 precursors. The MS data was then searched and normalized using RT-dependent normalization. Only precursors from proteotypic peptides was retained in the output results and quantified with maxLFQ. The data was then log2 transformed and 117 files were removed due to experiment and sample quality control.

How has the data been analyzed?

RNA-seq

The transcript abundance of each protein-coding gene was estimated using the average TPM value of the individual samples for each cell line. The transcriptomics data was then used to

  • (i) classify the gene expression specificity in different cancer types and the distribution across all cell lines
  • (ii) evaluate the consistency between the cell lines and the corresponding TCGA disease cohort
  • (iii) estimate the cancer-related pathway (PROGENy) and cytokine (CytoSig) activity (with non-protein-coding genes included for calculation)
  • (iv) find the highest correlating genes and further to classify all genes according to their cell line-specific expression

MS Proteomics

In the analysis data from the 542 cell lines from the Pan-Cancer dataset that could be matched to cell lines in the RNA-seq dataset was used. Normalized relative protein expression (nRPX) was estimated from the intensities by normalization of the protein profile in each cell line using Z-score normalization. The normalization was done regardless of missing values after having them quantified. Normalisation within each cell line takes into account the different levels of protein detection within cell lines in different injections and enables relative comparison of the expression of a protein across the different cell lines.

What is presented in the resource?

The RNA expression levels were determined for all protein-coding genes (n = 20162) across the 1206 human cell lines and the results are presented on the gene summary page of the Cell line resource as exemplified in the figure below.

In this figure the MS protein expression data is represented by a circle for each cell line group with size corresponding to expression level and color intensity representing the percentage of cell lines in the group having MS data. This include non-detectable proteins in white and the size of the dot can be relatively compared by nRPX values. The stroke of the dot in the cell line groups also show the percentage of the availability of the dataset particularly.


On the cell line category specific pages, which are accessed by clicking on the piechart or the colored boxes on the Cell line resource page, plots showing the cancer-related pathway (PROGENy) and cytokine (CytoSig) activity relative to the average expression of all analyzed cell lines as the baseline are displayed.

For 26 TCGA disease cohorts the ranking list of the cell lines based on gene expression similarity to the corresponding disease cohort is shown. This can be served as a reference for cell line selection for in vitro experiments when studying a specific cancer type.

How has the classification of all protein-coding genes been done?

A genome-wide classification of the protein-coding genes with regard to cell line distribution across all cancer cell lines as well as specificity across 28 cancer types has been performed using between-sample normalized data (nTPM). The results can serve as a reference for researchers interested in expression profiles of human cell lines at both the disease level and cell line level. The genes were classified according to specificity into (i) cancer enriched genes with at least four-fold higher expression levels in one cell line cancer type as compared with any other analyzed cell line cancer types; (ii) group enriched genes with enriched expression in a small number of cell line cancer types (2 to 10); and (iii) cancer enhanced genes with only moderately elevated expression. In addition, all genes were classified according to distribution in which each gene is scored according to the presence (expression levels higher than a cut-off) in the cell lines. The cell line cancer enriched and group enriched genes are displayed in the interactive plot below, in which clicking on the red and orange circles results in gene lists for the corresponding enriched and group enriched genes, respectively.

Finally, a new classification has been introduced in which genes are clustered based on similarity in expression across the cell lines. The results are presented as an interactive UMAP plot in which mouse-over displays general information for the clusters and the clicking on a cluster will display more information and plots regarding that specific cluster, as well as, a clickable list of all clusters.

UMAP1UMAP2

How was the similarity of the cell lines to the corresponding TCGA cancer cohorts analysed?

The 1132 cancer cell lines were analyzed for their representability of the corresponding TCGA disease cohorts. Considering that tumor samples also harbor immune and other cells, here, TCGA samples with a low tumor purity score (< 0.7, https://www.nature.com/articles/ncomms3612) were excluded from the analysis. The similarity between cell lines and the corresponding TCGA cohort was estimated by two different approaches:

  • (i) Spearman’s correlation coefficient (ρ) between every cancer cell line and its corresponding TCGA cohorts was estimated at the gene level. For this, for each gene in a TCGA cohort, the nTPM values were averaged per cohort. Then, for each TCGA cohort, Spearman’s ρ was calculated between the averaged nTPM values and the nTPM values of the disease-matched cell lines based on the common 20,053 protein-coding genes.
  • (ii) The enrichment of the TCGA cohort elevated genes (i.e., the union of enriched, group enriched, and enhanced genes in the TCGA cohort) in cell lines was evaluated by gene set enrichment analysis (GSEA). The concept is that genes that have an elevated expression in a TCGA cohort can be considered as the cohort signature, and their high expression should be reflected by cell line models. To test this, for the 28 cell line cancer types, gene expression was averaged per disease, resulting in the mean expression for each of the 28 cell line cancer types. Then, the average expression per disease was further averaged as the disease baseline expression. After that, for every cell line, we calculated the fold change of every gene relative to the disease baseline expression, followed by the log2 transformation of the fold change. Finally, for each cell line, gene log2 fold changes were sorted from high to low, followed by the GSEA of the TCGA cohort elevated genes against the sorted gene list. It is expected that cell lines showing high concordance to the matched TCGA cancer type should present high log2 fold changes of the elevated genes of that TCGA cohort relative to the disease baseline expression. The results were represented as the normalized enrichment score (NES), with a positive value showing high consistency between a cell line and a disease-matched TCGA cohort. The cell lines were then ranked based on Spearman’s (ρ) and NES from high to low, respectively. Finally the two ranking lists were combined, and cell lines were reordered according to their average rank.

How has the pathway and cytokine analysis been done?

PROGENy

For all 1206 analyzed cell lines, the activity of a total of 14 cancer-related pathways were inferred using the PROGENy, a package that relies on biological data mining of publicly available data to obtain cancer-related pathway responsive genes for human and mouse (Schubert M et al. (2018)). For this, read counts for HPA and CCLE cell lines quantified by Kallisto were re-analyzed without filtering out the non-protein-coding genes to ensure a broadened coverage of cancer pathway responsive genes. The read counts of the 1206 cell lines were normalized by DESeq2 with respect to the size factor of each cell line and were further transformed by variance stabilizing transformation into log2 space. To calculate the relative pathway’s activities across all cell lines, the normalized values were centered by subtracting the mean value per gene. Then, the R package decoupleR was used to calculate the relative pathway’s activities based on the top 100 signature genes per pathway obtained from the R package progeny (Schubert M et al. (2018)). By default, the decoupleR was executed using the top performer methods benchmarked (i.e., mlm for multivariate linear model, ulm for univariate linear model, and wsum for weighted sum) and the results were integrated to obtain a consensus z-score to represent the pathway activity. Here, a consensus z-score above 1 or below -1 was considered significant.

CytoSig

The activity of 43 CytoSig cytokines was inferred based on the gene expression profile of the 1206 cell lines by the package CytoSig (Jiang P et al. (2021)). Gene expression data were processed in the same way as for PROGENy analysis. Also, DESeq2 normalized expression values were centered per gene as suggested. The CytoSig program was executed with 10,000 permutations, and the results were presented as z-scores to represent the relative cytokine activities, with a p-value < 0.05 as significant.