RNAbiology
Taylor & Francis Taylor & Francis Group
Advancing Pan-cancer Gene Expression Survial Analysis by Inclusion of Non-coding RNA
Bo Ye, Jianxin Shi, Huining Kang, Olufunmilola Oyebamiji, Deirdre Hill, Hui Yu, Scott Ness, Fei Ye, Jie Ping, Jiapeng He, Jeremy Edwards, Ying-Yong Zhao & Yan Guo
To cite this article: Bo Ye, Jianxin Shi, Huining Kang, Olufunmilola Oyebamiji, Deirdre Hill, Hui Yu, Scott Ness, Fei Ye, Jie Ping, Jiapeng He, Jeremy Edwards, Ying-Yong Zhao & Yan Guo (2020) Advancing Pan-cancer Gene Expression Survial Analysis by Inclusion of Non-coding RNA, RNA Biology, 17:11, 1666-1673, DOI: 10.1080/15476286.2019.1679585
To link to this article: https://doi.org/10.1080/15476286.2019.1679585
+
View supplementary material
☒
Published online: 18 Oct 2019.
Submit your article to this journal ☒
Article views: 4803
View related articles
☒
View Crossmark data ☒ CrossMark
Citing articles: 4 View citing articles
☒
Taylor & Francis Taylor & Francis Group
RESEARCH PAPER
W Check for updates
Advancing Pan-cancer Gene Expression Survial Analysis by Inclusion of Non-coding RNA
Bo Yeª*, Jianxin Shia*, Huining Kang b, Olufunmilola Oyebamijib, Deirdre Hill b, Hui Yu b, Scott Ness @Db, Fei Ye D”, Jie Ping , Jiapeng Heb, Jeremy Edwards @Db, Ying-Yong Zhao @Dd, and Yan Guob
ªDepartment of Thoracic Surgery, Shanghai Chest Hospital, Jiaotong University, Shanghai, China; bComprehensive Cancer Center, University of New Mexico, Albuquerque, NM, USA; Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN, USA; dKey Laboratory of Resource Biology and Biotechnology in Western China, School of Life Sciences, Northwest University, Xi’an, Shaanxi, China
ABSTRACT
Non-coding RNAs occupy a significant fraction of the human genome. Their biological significance is backed up by a plethora of emerging evidence. One of the most robust approaches to demonstrate non-coding RNA’s biological relevance is through their prognostic value. Using the rich gene expression data from The Cancer Genome Altas (TCGA), we designed Advanced Expression Survival Analysis (AESA), a web tool which provides several novel survival analysis approaches not offered by previous tools. In addition to the common single-gene approach, AESA computes the gene expression composite score of a set of genes for survival analysis and utilizes permutation test or cross-validation to assess the significance of log-rank statistic and the degree of over-fitting. AESA offers survival feature selection with post-selection inference and utilizes expanded TCGA clinical data including overall, disease-specific, disease-free, and progression-free survival information. Users can analyse either protein-coding or non- coding regions of the transcriptome. We demonstrated the effectiveness of AESA using several empirical examples. Our analyses showed that non-coding RNAs perform as well as messenger RNAs in predicting survival of cancer patients. These results reinforce the potential prognostic value of non-coding RNAs. AESA is developed as a module in the freely accessible analysis suite MutEx.
Abbreviation: ACC: Adrenocortical Carcinoma (n = 92); BLCA: Bladder Urothelial Carcinoma (n = 412); BRCA: Breast Invasive Carcinoma (n = 1098); CESC: Cervical Squamous Cell Carcinoma and Endocervical Adenocarcinoma (n = 307); CHOL: Cholangiocarcinoma (n = 51); COAD: Colon Adenocarcinoma (n = 461); DLBC: Lymphoid Neoplasm Diffuse Large B-cell Lymphoma (n = 58); ESCA: Oesophageal Carcinoma (n = 185); GBM: Glioblastoma Multiforme (n = 617); HNSC: Head and Neck Squamous Cell Carcinoma (n = 528); KICH: Kidney Chromophobe (n = 113); KIRC: Kidney Renal Clear Cell Carcinoma (n = 537); KIRP: Kidney Renal Papillary Cell Carcinoma (n = 291); LAML: Acute Myeloid Leukaemia (n = 200); LGG: Brain Lower Grade Glioma (n = 516); LIHC: Liver Hepatocellular Carcinoma (n = 377); LUAD: Lung Adenocarcinoma (n = 585); LUSC: Lung Squamous Cell Carcinoma (n = 504); MESO: Mesothelioma (n = 87); OV: Ovarian Serous Cystadenocarcinoma (n = 608) PAAD: Pancreatic Adenocarcinoma (n = 185); PCPG: Pheochromocytoma and Paraganglioma (n = 179); PRAD: Prostate Adenocarcinoma (n = 500); READ: Rectum Adenocarcinoma (n = 172); SARC: Sarcoma (n = 261); SKCM: Skin Cutaneous Melanoma (n = 470); STAD: Stomach Adenocarcinoma (n = 443); TGCT: Testicular Germ Cell Tumours (n = 150); THCA: Thyroid Carcinoma (n = 507) THYM: Thymoma (n = 124); UCEC: Uterine Corpus Endometrial Carcinoma (n = 560); UCS: Uterine Carcinosarcoma (n = 57); UVM: Uveal Melanoma (n = 80)
ARTICLE HISTORY
Received 19 August 2019 Revised 10 September 2019 Accepted 8 October 2019
KEYWORDS
Cancer survival analysis; non-coding RNA; lincRNA; pseudogene
Introduction
Non-coding RNA is composed of all types of RNAs that do not encode protein. In recent years, substantial efforts have been invested in the discovery of the potential functional effects of non-coding RNAs [1]. Non-coding RNAs are usually categor- ized by their length into long and short RNAs. Short RNAs encompass the well-known species such as microRNA (miRNA) and transfer RNA (tRNA); long intergenic non- coding RNAs (lincRNA) is the most frequently studied subject among all non-coding RNA types. Non-coding RNAs have been the focus of many cancer research studies. For example, lncRNA MALAT-1 was found to associate with poor survival in oral
squamous cell carcinoma [2] and non-small cell lung cancer [3,4]. Another well-established example is lncRNA HOTAIR, which has been shown to promote metastasis in multiple cancers [5-9]. In the Lnc2Cancer database [10], nearly 5000 associations between lncRNA and cancer have been curated. The association between MicroRNA and cancer has been firmly established [11]. Cancer-related studies on other types of non-coding RNA such as pseudogenes and tRNA are less common.
One of the most robust approaches to affirm the functional effects of non-coding RNA is through the observance of prog- nostic value. Of all non-coding RNAs, miRNAs were first found to be potential cancer biomarkers [12]. Recently,
*These authors contributed equally to this work
Supplemental data for this article can be accessed here.
LncRNAs have also been found to inform survival. The latest examples include MVIH in hepatocellular carcinoma [13], BLACAT1 in small cell lung cancer [14], and NET1 in clear cell renal cell carcinoma [15]. A large portion of lncRNA survival studies were conducted with the The Cancer Genome Altas (TCGA) data, such as the panel of 12-lncRNA signature which predicts survival of pancreatic adenocarcinoma [16], the panel of three-lincRNA which predicts the survival of head and neck squamous cell carcinoma [17], the 11-lncRNA signature in endometrial carcinoma [18], etc. In addition to the direct evidence of non-coding RNA association with survival, indirect indications of non-coding RNAs’ functional impact exist as well, e.g., single-nucleotide polymorphism (SNP), expression quantitative trait locus (eQTL) and RNA editing. Remarkably, more than 96% of the 40,525 unique SNPs curated in Genome- Wide Association Study (GWAS) catalogue [19] are located in the genomic non-coding region [20]. One theory attempts to explain the disease liability of non-coding SNPs through long- range regulation of gene expression through eQTLs. RNA editing [21] is the prost-transcription process to modify an RNA sequence in divergence to the genetic code. Recent studies [22,23] revealed the potential functional effect of RNA editing in cancers. Notably, more than 80% of all RNA editing events documented in TCGA and Genotype-Tissue Expression pro- jects [24] are located in non-coding regions [25].
With immense richness at both genomic level and clinical level, TCGA data become an important cancer genomic data- set, underpinning influential cancer analysis web portals such as OncoLnc [26], TANRIC [27], and cBioPortal [28]. However, TCGA data does bear several noticeable weak- nesses. One such flaw is immortal time bias [29], an implicit span of time during which the outcome under study was impossible to occur. Immortal time bias was first documented in 1970s [30]. Since then, multiple manuscripts [31,32] have been published regarding this issue. Yet, it remains one of the least recognized flaws in modern cancer survival analysis. Aiming at alleviating immortal time bias in the disease-free context and maximizing the potential of TCGA survival data, we developed Advanced Expression Survival Analysis (AESA), a web-based tool which offers a series of unprecedented analysis options missing in existing TCGA survival analysis tools. AESA is developed as a module in the MutEx integra- tive cancer genomic analysis suite [33].
Methods
TCGA RNA expression data (FPKM) and overall survival data were downloaded from Genomic Data Commons Data Portal with R package TCGAbiolinks (version 2.8.4). To correct for disease-free immortal time bias, we obtained disease-specific survival, disease-free interval, progression-free interval data from publication by Liu et al. [29]. Because the exact surgery dates of TCGA patients are unknown, a three-month time- span was assumed for patients to sustain from initial diagnosis to the surgery date and such a three-month-period was sub- tracted from each patient’s disease-free survival [29]. This operation was intended to alleviate immortal time bias in the disease-free context.
AESA offers three modes of cancer survival analysis: (1) single gene; (2) multi-gene with composite gene expression score (CGES); (3) feature selection with selective inference. In the single-gene approach, each gene is assessed for survival discrimination potential. Subjects of a cancer are dichoto- mized based on the user-selected gene expression threshold, and log-rank test between the high-expression group and the low-expression group is performed.
In the multi-gene approach, the CGES metric is computed as a surrogate expression value for the whole gene set. CGES is manifested as a risk score ranging from 0 to 1, with a large value indicative of high risk of adverse outcome (death, relapse, disease progression, etc.). A pair of user-defined thresholds are used to divide the subject into low- and high- risk groups, and the log-rank test is used to examine the difference in survival between the low- and high-risk groups with the p-value serving as the statistical evidence for the association between the set of genes and survival outcome. Technically, CGES is obtained as the inverse-logit transforma- tion of a linear combination of individual gene expression values, i.e., CGES = 1 +exp (- ( 2+ Bx -m)] 1 where xi denotes
the standardized log-transformed expression value of one gene (the ith out of k ones) and the coefficient ;, the score test statistic for the same gene from a univariate Cox-PH regression. The normalization factor m captures the median of the linear combination term (Zi-1 Bixi) across all patients of a cancer type.
Because the same gene expression data were repeatedly used to estimate the parameters ß; and calculate the CGES, the p-value of log-rank test based on the x2 distribution is no longer valid which usually overestimates the significance of the associa- tion. We used leave-one-out cross validation (LOOCV) to cor- rect for overfitting. The entire procedure was repeated until each case had been omitted once and had been assigned into a risk group. In each iteration of LOOCV, one sample was left out and the Cox model was fitted by the rest of the samples using exactly the same process. To assess the significance of log-rank statistic and the degree of overfitting, we performed an outcome (or phenotype) permutation test 1000 times by shuffling among the cases and the entire cross-validation process described above was repeated. For each random shuffling, the predicted high/low-risk groups were generated as described above and the log-rank statistics was calculated between two risk groups. Permutation test provided a null-distribution of log-rank test, so the probability that the log-rank statistics in the null distribu- tion is larger than the log-rank statistic between the risk groups predicted by LOOCV without case shuffling is the probability that we obtain this log-rank statistic by random chance. Through this, we calculate the permutation p-value as # (Xperm Xobs) #Perm
Pperm = “(perm ” Nobs), where # (Xperm > Xobs) is the number of times that the log-rank test statistic values (X2 ) calculated from perm permutated data are greater than or equal to that (Xabs) calculated from the real data, and Perm is the total number of permutations.
In the LOOCV method, a leave-one-out loop is exerted to estimate the risk score (CGES) of each subject using the
parameters ({{}}) inferred from all the remaining subjects. Specifically, to determine the risk of a sample, we first take that sample out and use the rest of the samples as a training data set to estimate the parameters for CGES model and then apply the estimated parameters to calculate CGES of the left-out sample. Such CGES scores out of a leave-one-out loop are used to group the subjects, upon which a conventional log-rank test is conducted. The dif- ference in survival outcome between the LOOCV- determined risk groups serves as an evidence for the prog- nosis power of the given gene set. The hazard ratio (HR) and p-value of the log-rank test were calculated using Cox- regression.
As a third option of advanced survival analysis, AESA offers selective inference wrapping around Lasso [34] or Elastic Net [35] regression approaches. The Lasso method follows L1 regularization and penalizes the absolute size of the regression coefficients, while Elastic Net aims at a balance between L1 regularization and L2 regularization. Both meth- ods help to select a subset of genes from the initial set (of k genes, as denoted in the CGES formula above) that best explain survival outcome in parsimony. Post-selection infer- ence is performed using the R package selectiveInference (ver- sion 1.2.4).
AESA also allows the screening of gene sets’ survival pre- dictability by pathway or Gene Ontology. From Pathway Commons [36], the combination from three major pathway sources, namely PID [37], PANTHER [38], and INOH [39], was collected. Identically named pathways from distinct sources were integrated into a singleton pathway by adopting the largest-sized gene set and adding additional gene members from a secondary source if that source contributed more than 70% shared gene members. Additionally, we examined gene overlapping between every pathway pair and ensured each pair of pathways have no more than 70% shared genes. In the end, we established 369 major pathways screenable by AESA. Gene Ontology data were retrieved from BioMart databases using R package biomaRt [40]. A total of 13,486 gene sets were collected, comprised of 9,184 Biological Process sets, 1,441 Cellular Component sets, 2,861 Molecular Function sets.
Results
AESA was developed through mixture support by PHP, Javascript, MySQL, and R. Taking full advantage of the rich gene expression and clinical data from TCGA, AESA enables three modes of survival analysis on four types of survival data. The three modes of survival analysis are single, multi-gene, and feature selection. The four types of survival data are overall survival, disease-specific survival, disease-free survival, and progression-free survival [29]. In addition, AESA allows users to limit survival analysis to a particular gene type - pseudogenes, lncRNAs, or protein-coding genes. The overall schema of AESA is depicted in Fig. 1A.
TCGA curates valuable clinical data for the enrolled cancer patients. Nevertheless, these clinical data are subject to immortal time bias as alluded to in Fig. 1B. The initial diagnosis date of TCGA patients can be as early as the 1980s. Yet, the enrolment
commencement for TCGA is around 2010, which means the survival time is biased towards survival. Because the exact enrol- ment date is unknown, it is difficult to adjust for this bias. Similar immortal bias impairs disease-free interval data. We followed Liu et al.’s proposal [29] to subtract 3 months from the initial diagnose date, which roughly accommodate the surgery wait time. At the molecular level, TCGA gene expression data provide a good representation of the entire human transcriptome. It contains over 56,000 RNAs with 19,632 protein-coding RNAs, and the rest being non-coding RNAs. In total, 42 types of non-coding RNAs are represented in the data, the most abundant ones being processed pseudogene (10,113) followed by lincRNA (7,176) (Fig. 1C, supplementary Table 1). The number of outcome events, for combinations of each cancer and each survival data type, is depicted in Fig. 1D.
For a case study of the single-gene mode of survival ana- lysis, we analysed 80 lincRNAs which have been indicated for survival association from 11 studies (Fig. 2A, Supplementary Table 2). Overall survival was selected to ensure comparability with previous studies. Of the 11 studies (Table 1), seven were conducted on TCGA data, four on non-TCGA data. Of the 80 lincRNAs tested, 33 were found significantly (p < 0.05) asso- ciated with survival in the reported cancer type. There is a striking difference in verification of statistical significance between distinct data sources. Compared with lincRNAs that originated from the study of TCGA data, lincRNAs identified from non-TCGA data are less significantly associated with TCGA survival (Fig. 2B). In precise numbers, 37 of the 63 lincRNAs originating from TCGA-supported studies were found to be significant.
Most of those 11 studies argued that the lincRNAs they identified can form a biomarker panel for survival prediction. To verify their allegations, we performed an analysis of the 11 separate lincRNA panels using AESA’s multi-gene CGES approach. Our results indicated that 7 of the 11 lincRNA signa- tures were significantly associated with survival in their respec- tive cancer types (Table 1), including one signature originated from non-TCGA data sources and seven from TCGA-grounded studies.
To demonstrate the feature selection utility of AESA, we screened for prognosis genes with respect to overall survival (Fig. 3A) and disease-specific survival (Fig. 3B) analyses on all cancers with ten-fold cross-validated Lasso method. Due to sample size and survival event limitations, of the 33 cancer tested, 24 cancers produced results for overall survival analysis, and 20 cancers produced results for disease-specific survival analysis. There are noticeable differences between Lasso results for overall survival and disease-specific survival. Five cancers, GBM, LAML, OV, TGCT, and THCA, produced results for overall survival analysis but not disease-specific survival analysis. STAD produced a result for disease-specific survival analysis but none for overall survival analysis.
Of the 19 cancer types that generated results for both types of survival analyses, a significant difference in gene type selection was observed. Our results show that of all genes selected in overall survival analysis, lincRNA was the most abundant type account- ing for 25%, followed by protein-coding RNA for 24% and then other types of non-coding RNAs. For disease-specific survival analysis, lincRNA and protein-coding RNA each represented
A
D
GENE EXPRESSION
CLINICAL OUTCOME
Protein Coding
OS/DSS/DFI/PFI
ACC
☒ ☒ 28
26
0
☒ 11
0 ☒
41
2
non-coding
overall survival disease-specific survival
BLCA
122
☒
43
☒
31
☒
26
174
55
disease-free interval progression-free interval
☒ 180
☒
BRCA
☒
152
☒
83
☒ 49
☒ 84
☒
40
145
59
· Single gene
CESC
☒ 71
☒
54
☒
13
☒ 26
☒
5 ☒
71
☒
16
· Gene expression composite score
· Selective inference with Cox model
CHOL
☒
18
☒
16
1
☒
10
2
20
2
Advanced Expression Survival Analysis
COAD
☒ ☒ 102
64
☒
22
☒
23
☒
8 ☒
122
28
DLBC
☒
9
☒
4
☒
5
☒
4
3
12
4
High
LOW
ESCA
☒ 64
☒
45
☒ 18
☒ 21
☒
9
78
☒ 19
Expression
GBM
☒
122
☒ 107
☒
3
1
0 ☒ 121
12
Differential Survival
Kaplan-Meier curve Log-rank test
HNSC
☒ ☒ 218
129
☒
63
☒
28
☒
10
193
72
Cox-PH model
KICH
☒ 10
☒ 8
2
3
0
12
1
Gene selection
Overfitting correction
KIRC
☒ 173 ☒
108
☒
54
☒ 15
☒
5
☒ 160
62
B
KIRP
☒
44
☒
28
☒
12 ☒
28
☒
7
☒
57
15
Approximate enrollment commencement
LAML
☒
97
0
0
0
0
0
0
1500
Overall Immortal Bias
LGG
126 ☒ ☒
113
4
☒ 20
1
192
9
(Early) Diagnosis
Enrollment 2010
LIHC
130 ☒ ☒ 79
☒ 43
☒ 143
☒
39
180
45
Disease-free Immortal Bias
LUAD
☒
187 ☒ 115
☒
36
☒
89
☒
29
209
54
Diagnosis
(Surgery date unknown)
Frequency
1000
Disease free
LUSC
217 ☒ ☒
89
☒
75
☒
61
☒
51
147
108
Pseudo Correction: 3 months
MESO
☒ 73
☒
43
☒
9
☒ 7
☒
6
61
18
OV
☒ 230
☒
199
☒
6
☒ 126
☒
4
272
21
500
PAAD
☒
92 ☒
72 ☒
14
☒
23
☒
11
103
17
PCPG
☒ ☒
6
4
2
☒ 4
1
21
2
PRAD
☒
10
☒
5
☒
3
☒ 30
☒
3
93
4
0
READ
☒
26
☒
14
☒
5
☒
6
1 ☒
37
9
1980
1985
1990
1995
2000
2005
2010
SARC
98 ☒ ☒
81
☒
11
☒
☒ 138
15
Year of Initial Diagnosis
☒
66
7
SKCM
☒
29
☒
21
☒ 8
0
0
☒ 37
7
C
1448 (3%)
940 (2%)
STAD
☒
149 ☒ 91
☒
36
☒
37
☒
27
☒
124
55
1872 (3%)
1039 (2%)
4389 (8%)
TGCT
☒
2207 (4%)
4
3
1
☒ 27
1
35
1
4
SnRNA miRNA
TEC
5 snoRNA
2506 (4%)
unprocessed pseudogene
THCA
☒
16
☒ 7
☒
3
☒
26
☒
4
☒ 52
☒ 9
5394 (10%)
misc RNA
others
THYM
☒
9
☒
4
☒
5
0
0
20
3
antisense
UCEC
☒ 91
☒ 60
☒
29 ☒
57
☒ 23
124
26
7176 (13%)
lincRNA
protein coding
UCS
☒ ☒ 34
30
2
☒
9
2
36
4
processed pseudogene
19632 (35%)
UVM
☒
23
☒
21
☒
2
0
0
☒
30
4
10113 (18%)
OS
DSS
DSS.cr
DFI
DFI.cr
PFI
PFI.cr
22% of all genes selected. Among the other types of non-coding RNAs, antisense RNA and processed pseudogenes are the two
most abundantly selected types of non-coding RNA types behind lincRNA.
A
TCGA
No
Yes
4 -
-log10(p-value)
2
I
0
V
ESCA
29891973
OV
27074572
LUAD
30659574
28109476
HNSC
29304762
UCEC
COAD
30510449
PAAD
31031865
BRCA
26549855
OV
28389671
COAD
30396175
26183581
LUAD
B
TCGA
No
Yes
C
TCGA
No
Yes
4-
4-
3-
-log10(p-value)
· ·
-log10(p-value)
2
2
1-
0-
0-
Yes
NO
ESCA
OV
LUAD
UCEC
HNSC
PAAD
COAD
LUAD
BRCA
OV
COAD
Separately for each cancer, we conducted conventional multi- variable Cox regression analysis using the selected feature genes. R-square was used to compare the model fitness between Lasso all genes and Lasso non-coding RNAs for each cancer. For both overall survival (Fig. 3C) and disease-specific survival (Fig. 3D), non-coding RNAs achieved comparable fitness to all genes. The large variation of R-square across cancer types indicates different cancer may have different survival sensitivity towards gene expression. For certain cancers, such as ACC and KIRP in overall survival context, and LGG and MESO in disease-specific survival context, the model with non-coding RNA performed substantially better than the model with all genes. The performance of the models between all genes and non-coding RNAs was also com- pared using the concordance index (C-index) [41]. Both overall survival (Fig. 3E) and disease-specific survival (Fig. 3F) produced models with high correlations. These results suggest that the survival prediction power conferred by non-coding RNAs is as effective as by protein-coding RNAs.
Discussion
We presented AESA to take full advantage of the TCGA gene expression and survival data. In addition to the conventional approaches offered by the previous TCGA analysis tools, AESA is equipped with multi-gene CGES and feature selec- tion methods. Both of these new analysis options are aimed for non-single gene approach, and to provide users with more freedom when exploring the TCGA data. AESA is also the first publically available tool to offer disease-specific, disease- free, and progression-free survival analysis for TCGA data.
As we have shown, TCGA clinical data are subject to immortal time bias. Such bias can be addressed by subtracting 3 months from the disease-free survival time, which roughly accounts for the surgery wait period after diagnosis. To cor- rect for immortal time bias for overall and disease-specific survival time, ideally, we should calibrate the followup initia- tion to the enrolment date, rather than the initial diagnose
| Study PMID1 | Cancer | Number of genes | P-value2 | TCGA3 |
|---|---|---|---|---|
| 28389671 [42] | Ovarian cancer | 6 | 0.0534 | No |
| 26549855 [43] | Breast cancer | 9 | 0.5622 | No |
| 30396175 [44] | Colorectal cancer | 6 | 0.0005 | No |
| 26183581 [45] | Non-small cell lung cancer | 8 | 0.7016 | No |
| 27074572 [46] | Ovarian cancer | 8 | 0.023 | Yes |
| 29304762 [47] | Endometrial carcinoma | 11 | 0.0017 | Yes |
| 28109476 [17] | Head and neck squamous cell carcinoma | 5 | 0.0004 | Yes |
| 31031865 [16] | Pancreatic adenocarcinoma | 12 | 0.0001 | Yes |
| 30510449 [48] | Colorectal cancer | 15 | 0.0001 | Yes |
| 30659574 [49] | Non-small cell lung cancer | 8 | 0.0057 | Yes |
| 29891973 [50] | Oesophageal squamous cell carcinoma | 7 | 0.9708 | Yes |
1Pubmed ID.
2P-value computed from leave one out cross-validation.
3Whether the lincRNA signature was derived from TCGA data.
date. Unfortunately, after carefully examining the TCGA clin- ical data from several different sources, the exact enrolment dates remain unknown. Thus, at this point, we are unable to correct for immortal time bias for overall and disease-specific survival.
As for model validation, the LOOCV, a special case of K-fold cross-validation where K = N (number of observa- tions), is approximately an unbiased estimator. However, it is known that LOOCV has a high variance (the results may vary a lot when a different random sample is drawn from the target population). For moderate-to-large study samples, repeated 10-fold cross-validation has been shown to
outperform LOOCV. In smaller samples, bootstrapping is usually the preferred approach for internal validation. Bootstrap methods generally have the lowest mean squared error of accuracy estimates. One of our future research direc- tions is to apply and compare these internal validation meth- ods, and provide overfitting corrected model fitting estimations.
To demonstrate AESA’s effectiveness, we performed sev- eral empirical and proof-of-concept examples. One case study checked survival predictability of previously reported lincRNAs. The results validated 33 out of 80 lincRNA that were reported in 11 published studies. Some of these
A
C
E
KIRP
Overall Survival
LGG
KIRC
0.8
Overall Survival
SARC
LUAD
1.0
Spearman r-0.98
cancers
MESO
PAAD
RSquare
0.6
&
O ACC
8 LƯỢC
UCEC
&
BACA 5 MESO
LUSC
OV
ACC
antisense
0,9
+
COAD
LAML
IG C pseudogene
0.4 -
Non-coding Genes
X
DLBC
MAAD
COAD
PRAD
IG D gene
0
OBU
PRAD
0,8
V
HNSC
0.2
SARC
THYM
IG J gene
OV
2
KOCH
TOCT
DLBC
IG V gene
*
KRC
. THCA
THCA
IG V pseudogene
0.0
0.7
¢ KRP
O THYM
HNSC
BRCA
lincRNA
THYM -
e LAML
UCEC
TGCT
DLBC
ACC
LIHC
miRNA
A
JVM
misc RNA
ME
0.6
2
LUAD E
UVM-
KICH
processed
0.6
0.7
0.8
0.9
1.0
GBM
10
20
30
pseudogene
All Genes
0
Number of genes
processed transcript
B
protein coding
All Genes
Non-coding Genes
LUAD
scaRNA
D
SARC
sense intronic
Disease-Specific Survival
F
KIRC
COAD
sense overlapping
Disease-Specific Survival
cancers
KIRP
snoRNA
1.0
Spearman r-0 93
0
ACC
BRCA
snRNA
0.6-
4
BACA
PAAD
TEC
RSquare
+
COAD
X
DLBC
STAD
TR J gene
0.4-
0.9
O
HNSC
PRAD
Non-coding Genes
ACC
TR V gene
A
V
KICH
KORG
LGG
transcribed
&
米
LOG
UCEC
processed pseudogene
0.2 -
0.8
UVM
transcribed
ICK
+
LUAD
@
LƯỢC
THYM
unitary pseudogene
MESO
KICH
transcribed
EB
PAAD
8
HNSC
unprocessed pseudogene
0.0
0.7
SARC
LUSC
unprocessed pseudogene
LUSC -
HNSC -
UCEC -
BRCA -
MESO -
STAD -
THYM -
KICH -
KIRC -
®
LGG -
LUAD -
COAD -
DLBC -
SARC -
PAAD -
KIRP -
UVM -
ACC -
B
STAD
THYM
LIHC
0
UCEC
DLBC -
0.7
0.8
0.9
1.0
WVM
MESO
All Genes
0
20
40
60
Number of genes
pronounced lincRNAs cannot be validated in AESA, possibly because some studies were based on early versions of TCGA data. Over the last decade, TCGA has been through several iterations of data reprocessing and was augmented by addi- tional sample accumulation. For example, there are as many as four versions of gene expression data for some cancers in TCGA. Some early studies have been based solely on micro- array gene expression data. In contrast, AESA is entirely based on the latest version of RNA-seq data. When processing each lincRNA signature with our multi-gene CGES approach, we found 7 out of 11 lincRNA signatures were significantly associated with survival. This result highlights the advantage of the multi-gene approach. Our other important results show that non-coding RNAs have similar prediction power for cancer survival compared to protein-coding RNAs. The strong survival prediction power of non-coding RNA could result from either direct functional effect or indirect con- founding effect between non-coding RNA and protein- coding RNAs such as co-expression. In summary, all results presented herewith indicate lincRNAs as a vastly untapped resource for predicting cancer survival.
Secondary data analysis from publically available datasets is one of the most efficient and cost-effective approaches for inception of novel hypotheses or validation of presumable bio- medical hypothesis. The rise of interest in non-coding RNA and its functional effect is undoubtedly associated with the accumu- lation of public genomic data. While being analysed countless times, certain aspect of the TCGA data often eludes us, prevent- ing us from truly maximizing the potential of a dataset. Based on previous experience and several innovative ideas, we devel- oped AESA and demonstrated its functionality with empirical examples. Our goal is to fully maximize the potential of TCGA data and offers unprecedented analysis options to users. AESA by no means covers all of the possibilities with TCGA survival analysis, but it does offer several novel approaches not covered by existing TCGA analysis tools. AESA will undergo constant improvement based on user feedbacks and methodology inno- vation. In the near future, we hope to integrate addition datasets into AESA, especially the data from International Cancer Genome Consortium.
Acknowledgments
YG, OO, HY, HJ, SN, HK, JE were supported by the cancer centre support grant P30CA118100. This study was supported by the Comprehensive Cancer Centre at the University of New Mexico, the Bioinformatics Shared Resources and the Biostatistics Shared Resources at The Comprehensive Cancer Centre. This work was supported by the pro- gramme in Western Medicine approved by Shanghai Science and Technology Commission (16411966000) (BY, JS), the interdisciplinary Programme of Shanghai Jiaotong University (YG2014QN22) (BY, JS), Science and Technology Commission Shanghai Municipality (STC5M15DZ2270400), Natural Science Foundation of China (81572245, 81402571) (BY, JS) and Collaborative Research Project of Transformational Medicine Collaborative Innovation Centre (TM201822) (BY, JS).
Availability
AESA is freely available at http://innovebioinfo.com/Databases/TCGA_ Mutation_ExpressionDB/Mutationdb_Search5.php.
Disclosure of potential conflicts of interest
No potential conflicts of interest were disclosed.
Funding
This work was supported by the National Cancer Institute [P30CA118100].
ORCID
Huining Kang ® http://orcid.org/0000-0002-4415-3573 Deirdre Hill @D http://orcid.org/0000-0002-8495-659X Hui Yu @ http://orcid.org/0000-0002-4491-8855 Scott Ness [ http://orcid.org/0000-0001-6965-8909 Fei Ye @ http://orcid.org/0000-0001-6472-5076 Jie Ping @ http://orcid.org/0000-0001-9907-5819 Jeremy Edwards @ http://orcid.org/0000-0003-3694-3716 Ying-Yong Zhao ® http://orcid.org/0000-0002-0239-7342
References
[1] Wang J, Wu X, Gao W, et al. Current Research on Non-Coding Ribonucleic Acid (RNA). Genes (Basel). 2017;8(12):366.
[2] Zhou X, Liu Y, Feng X, et al. Long non coding RNA MALAT1 promotes tumor growth and metastasis by inducing epithelial-mesenchymal transition in oral squamous cell carcinoma. Sci Rep. 2015;5:15972.
[3] Schmidt LH, Spieker T, Koschmieder S, et al. The long noncoding MALAT-1 RNA indicates a poor prognosis in non-small cell lung cancer and induces migration and tumor growth. J Thorac Oncol. 2011;6(12):1984-1992.
[4] Gutschner T, Hämmerle M, Eissmann M, et al. The noncoding RNA MALAT1 is a critical regulator of the metastasis phenotype of lung cancer cells. Cancer Res. 2013;73(3):1180-1189.
[5] Gupta RA, Shah N, Wang KC, et al. Long non-coding RNA HOTAIR reprograms chromatin state to promote cancer metastasis. Nature. 2010;464(7291):1071-U148.
[6] Bhan A, Hussain I, Ansari KI, et al. Bisphenol-A and diethylstil- bestrol exposure induces the expression of breast cancer asso- ciated long noncoding RNA HOTAIR in vitro and in vivo. J Steroid Biochem Mol Biol. 2014;141:160-170.
[7] Bhan A, Hussain I, Ansari KI, et al. Antisense transcript long noncoding RNA (lncRNA) HOTAIR is transcriptionally induced by estradiol. J Mol Biol. 2013;425(19):3707-3722.
[8] Hajjari M, Salavaty A. HOTAIR: an oncogenic long non-coding RNA in different cancers. Cancer Biol Med. 2015;12(1):1-9.
[9] Zhong DN, Luo YH, Mo WJ, et al. High expression of long noncod- ing HOTAIR correlated with hepatocarcinogenesis and metastasis. Mol Med Rep. 2017;17(1):1148-1156.
[10] Ning S, Zhang J, Wang P, et al. Lnc2Cancer: a manually curated database of experimentally supported lncRNAs associated with various human cancers. Nucleic Acids Res. 2016;44(D1):D980-5.
[11] Peng Y, Croce CM. The role of MicroRNAs in human cancer. Signal Transduct Target Ther. 2016;1:15004.
[12] Schoof CR, Botelho ELDS, Izzotti A, et al. MicroRNAs in cancer treatment and prognosis. Am J Cancer Res. 2012;2(4):414-433.
[13] Zhang Y, Lin S, Yang X, et al. Prognostic and clinicopathological significance of lncRNA MVIH in cancer patients. J Cancer. 2019;10(6):1503-1510.
[14] Su J, Nikolic A, Kentish SJ, et al. Long noncoding RNA BLACAT1 indicates a poor prognosis of colorectal cancer and affects cell proliferation by epigenetically silencing of p15. Cell Death Dis. 2017;8(3):e2665.
[15] Ning L, Li Z, Wei D, et al. LncRNA, NEAT1 is a prognosis biomarker and regulates cancer progression via epithelial-mesenchymal transi- tion in clear cell renal cell carcinoma. Cancer Biomark. 2017;19 (1):75-83.
[16] Zhang H, Zhu M, Du Y, et al. A panel of 12-lncRNA signature predicts survival of pancreatic adenocarcinoma. J Cancer. 2019;10 (6):1550-1559.
[17] Cao W, Liu J-N, Liu Z, et al. A three-lncRNA signature derived from the Atlas of ncRNA in cancer (TANRIC) database predicts the survival of patients with head and neck squamous cell carcinoma. Oral Oncol. 2017;65:94-101.
[18] Le Page C, Rahimi K, Köbel M, et al. A novel lncRNA-focus expression signature for survival prediction in endometrial carcinoma. BMC Cancer. 2018;18(1):39.
[19] Hindorff LA, Sethupathy P, Junkins HA, et al. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc Natl Acad Sci U S A. 2009;106 (23):9362-9367.
[20] Sheng Q, Samuels DC, Yu H, et al. Cancer-specific expression quantitative loci are affected by expression dysregulation. Brief Bioinform. 2018.
[21] Guo Y, Yu H, Samuels DC, et al. Single-nucleotide variants in human RNA: RNA editing and beyond. Brief Funct Genomics. 2019;18(1):30-39.
[22] Han L, Diao L, Yu S, et al. The genomic landscape and clinical relevance of A-to-I RNA editing in human cancers. Cancer Cell. 2015;28(4):515-528.
[23] Qian M, Spada C, Wang X. Detection and application of RNA Editing in Cancer. Adv Exp Med Biol. 2018;1068:159-170.
[24] Lonsdale J, Thomas J, Salvatore M, et al. The Genotype-Tissue Expression (GTEx) project. Nat Genet. 2013;45(6):580-585.
[25] Chigaev M, Yu H, Samuels DC, et al. Genomic positional dissec- tion of RNA editomes in tumor and normal samples. Front Genet. 2019;10:211.
[26] Anaya J. OncoLnc: linking TCGA survival data to mRNAs, miRNAs, and lncRNAs. PeerJ Comput Sci. 2016;2:e67.
[27] Li J, Han L, Roebuck P, et al. TANRIC: an interactive open platform to explore the function of lncRNAs in cancer. Cancer Res. 2015;75(18):3728-3737.
[28] Cerami E, Gao J, Dogrusoz U, et al. The cBio cancer genomics portal: an open platform for exploring multidimensional cancer genomics data (vol 2, pg 401, 2012). Cancer Discov. 2012;2 (10):960.
[29] Liu JF, Lichtenberg T, Hoadley KA, et al. An integrated TCGA pan-cancer clinical data resource to drive high-quality survival outcome analytics. Cell. 2018;173(2):400-+.
[30] Gail MH. Does cardiac transplantation prolong life? A reassessment. Ann Intern Med. 1972;76(5):815-817.
[31] Levesque LE, Deneux-Tharaux C, Brocklehurst P, et al. Problem of immortal time bias in cohort studies: example using statins for preventing progression of diabetes. Br Med J. 2010;340:b5087.
[32] Park HS, Gross CP, Makarov DV, et al. Immortal time bias: a frequently unrecognized threat to validity in the evaluation of post- operative radiotherapy. Int J Radiat Oncol Biol Phys. 2012;83 (5):1365-1373.
[33] Ping J, Oyebamiji O, Yu H. Mutex: a multifaceted gateway for exploring integrative pan-cancer genomic data. Brief Bioinform.
2019. https://academic.oup.com/bib/advance-article/doi/10.1093/ bib/bbz084/5581642
[34] Wang HS, Li GD, Tsai CL. Regression coefficient and autoregres- sive order shrinkage and selection via the lasso. J R Stat Soc Ser B Stat Methodol. 2007;69:63-78.
[35] Zou HAH T. Regularization and variable selection via the elastic net. J R Stat Soc Series B Stat Methodol. 1995;571:289-300. Key: citeulike:1042553, 2005. .
[36] Cerami EG, Gross BE, Demir E, et al. Pathway Commons, a web resource for biological pathway data. Nucleic Acids Res. 2011;39 (Database issue):D685-690. [cited 2019 May 20]. Available from: https://www.pathwaycommons.org.
[37] Schaefer CF, Anthony K, Krupa S, et al. PID: the pathway interaction database. Nucleic Acids Res. 2009;37(Database issue):D674-9.
[38] Mi H, Poudel S, Muruganujan A, et al. PANTHER version 10: expanded protein families and functions, and analysis tools. Nucleic Acids Res. 2016;44(D1):D336-42.
[39] Yamamoto S, Humphrey JC, Conley AB, et al. INOH: ontology-based highly structured database of signal transduction pathways. Database (Oxford). 2011;2011:bar052.
[40] Durinck S, Moreau Y, Kasprzyk A, et al. BioMart and bioconduc- tor: a powerful link between biological databases and microarray data analysis. Bioinformatics. 2005;21(16):3439-3440.
[41] Hosmer DW, Lemeshow S, Sturdivant RX. Applied logistic regres- sion. In: Applied logistic regression. 3rd ed. Hoboken (NJ): Wiley; 2013; p. 1-500.
[42] Yang K, Hou Y, Li A, et al. Identification of a six-IncRNA signature associated with recurrence of ovarian cancer. Sci Rep. 2017;7(1):752.
[43] Sun J, Yuan Y, Chen J, et al. A potential prognostic long non-coding RNA signature to predict metastasis-free survival of breast cancer patients. Sci Rep. 2015;5:16553.
[44] Zhao J, Xu J, Shang A-Q, et al. A Six-LncRNA expression signa- ture associated with prognosis of colorectal cancer patients. Cell Physiol Biochem. 2018;50(5):1882-1890.
[45] Zhou M, Presnell S, Domico K, et al. A potential signature of eight long non-coding RNAs predicts survival in patients with non-small cell lung cancer. J Transl Med. 2015;13:231.
[46] Zhou M, Sun Y, Sun Y, et al. Comprehensive analysis of IncRNA expression profiles reveals a novel IncRNA signature to discrimi- nate nonequivalent outcomes in patients with ovarian cancer. Oncotarget. 2016;7(22):32433-32448.
[47] ] Zhou M, Rahimi K, Köbel M, et al. A novel IncRNA-focus expres- sion signature for survival prediction in endometrial carcinoma. BMC Cancer. 2018;18(1):39.
[48] Wang XN, Zhou J, Xu M, et al. A 15-lncRNA signature predicts survival and functions as a ceRNA in patients with colorectal cancer. Cancer Manag Res. 2018;10:5799-5806.
[49] Miao RC, Ge C, Zhang X, et al. Combined eight-long noncoding RNA signature: a new risk score predicting prognosis in elderly non-small cell lung cancer patients. Aging-Us. 2019;11(2):467-479.
[50] Mao Y, Fu Z, Zhang Y, et al. A seven-IncRNA signature predicts overall survival in esophageal squamous cell carcinoma. Sci Rep. 2018;8.