RNAbiology

Taylor & Francis Taylor & Francis Group

Advancing Pan-cancer Gene Expression Survial Analysis by Inclusion of Non-coding RNA

Bo Ye, Jianxin Shi, Huining Kang, Olufunmilola Oyebamiji, Deirdre Hill, Hui Yu, Scott Ness, Fei Ye, Jie Ping, Jiapeng He, Jeremy Edwards, Ying-Yong Zhao & Yan Guo

To cite this article: Bo Ye, Jianxin Shi, Huining Kang, Olufunmilola Oyebamiji, Deirdre Hill, Hui Yu, Scott Ness, Fei Ye, Jie Ping, Jiapeng He, Jeremy Edwards, Ying-Yong Zhao & Yan Guo (2020) Advancing Pan-cancer Gene Expression Survial Analysis by Inclusion of Non-coding RNA, RNA Biology, 17:11, 1666-1673, DOI: 10.1080/15476286.2019.1679585

To link to this article: https://doi.org/10.1080/15476286.2019.1679585

+

View supplementary material

Published online: 18 Oct 2019.

Submit your article to this journal ☒

Article views: 4803

View related articles

View Crossmark data ☒ CrossMark

Citing articles: 4 View citing articles

Taylor & Francis Taylor & Francis Group

RESEARCH PAPER

W Check for updates

Advancing Pan-cancer Gene Expression Survial Analysis by Inclusion of Non-coding RNA

Bo Yeª*, Jianxin Shia*, Huining Kang b, Olufunmilola Oyebamijib, Deirdre Hill b, Hui Yu b, Scott Ness @Db, Fei Ye D”, Jie Ping , Jiapeng Heb, Jeremy Edwards @Db, Ying-Yong Zhao @Dd, and Yan Guob

ªDepartment of Thoracic Surgery, Shanghai Chest Hospital, Jiaotong University, Shanghai, China; bComprehensive Cancer Center, University of New Mexico, Albuquerque, NM, USA; Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN, USA; dKey Laboratory of Resource Biology and Biotechnology in Western China, School of Life Sciences, Northwest University, Xi’an, Shaanxi, China

ABSTRACT

Non-coding RNAs occupy a significant fraction of the human genome. Their biological significance is backed up by a plethora of emerging evidence. One of the most robust approaches to demonstrate non-coding RNA’s biological relevance is through their prognostic value. Using the rich gene expression data from The Cancer Genome Altas (TCGA), we designed Advanced Expression Survival Analysis (AESA), a web tool which provides several novel survival analysis approaches not offered by previous tools. In addition to the common single-gene approach, AESA computes the gene expression composite score of a set of genes for survival analysis and utilizes permutation test or cross-validation to assess the significance of log-rank statistic and the degree of over-fitting. AESA offers survival feature selection with post-selection inference and utilizes expanded TCGA clinical data including overall, disease-specific, disease-free, and progression-free survival information. Users can analyse either protein-coding or non- coding regions of the transcriptome. We demonstrated the effectiveness of AESA using several empirical examples. Our analyses showed that non-coding RNAs perform as well as messenger RNAs in predicting survival of cancer patients. These results reinforce the potential prognostic value of non-coding RNAs. AESA is developed as a module in the freely accessible analysis suite MutEx.

Abbreviation: ACC: Adrenocortical Carcinoma (n = 92); BLCA: Bladder Urothelial Carcinoma (n = 412); BRCA: Breast Invasive Carcinoma (n = 1098); CESC: Cervical Squamous Cell Carcinoma and Endocervical Adenocarcinoma (n = 307); CHOL: Cholangiocarcinoma (n = 51); COAD: Colon Adenocarcinoma (n = 461); DLBC: Lymphoid Neoplasm Diffuse Large B-cell Lymphoma (n = 58); ESCA: Oesophageal Carcinoma (n = 185); GBM: Glioblastoma Multiforme (n = 617); HNSC: Head and Neck Squamous Cell Carcinoma (n = 528); KICH: Kidney Chromophobe (n = 113); KIRC: Kidney Renal Clear Cell Carcinoma (n = 537); KIRP: Kidney Renal Papillary Cell Carcinoma (n = 291); LAML: Acute Myeloid Leukaemia (n = 200); LGG: Brain Lower Grade Glioma (n = 516); LIHC: Liver Hepatocellular Carcinoma (n = 377); LUAD: Lung Adenocarcinoma (n = 585); LUSC: Lung Squamous Cell Carcinoma (n = 504); MESO: Mesothelioma (n = 87); OV: Ovarian Serous Cystadenocarcinoma (n = 608) PAAD: Pancreatic Adenocarcinoma (n = 185); PCPG: Pheochromocytoma and Paraganglioma (n = 179); PRAD: Prostate Adenocarcinoma (n = 500); READ: Rectum Adenocarcinoma (n = 172); SARC: Sarcoma (n = 261); SKCM: Skin Cutaneous Melanoma (n = 470); STAD: Stomach Adenocarcinoma (n = 443); TGCT: Testicular Germ Cell Tumours (n = 150); THCA: Thyroid Carcinoma (n = 507) THYM: Thymoma (n = 124); UCEC: Uterine Corpus Endometrial Carcinoma (n = 560); UCS: Uterine Carcinosarcoma (n = 57); UVM: Uveal Melanoma (n = 80)

ARTICLE HISTORY

Received 19 August 2019 Revised 10 September 2019 Accepted 8 October 2019

KEYWORDS

Cancer survival analysis; non-coding RNA; lincRNA; pseudogene

Introduction

Non-coding RNA is composed of all types of RNAs that do not encode protein. In recent years, substantial efforts have been invested in the discovery of the potential functional effects of non-coding RNAs [1]. Non-coding RNAs are usually categor- ized by their length into long and short RNAs. Short RNAs encompass the well-known species such as microRNA (miRNA) and transfer RNA (tRNA); long intergenic non- coding RNAs (lincRNA) is the most frequently studied subject among all non-coding RNA types. Non-coding RNAs have been the focus of many cancer research studies. For example, lncRNA MALAT-1 was found to associate with poor survival in oral

squamous cell carcinoma [2] and non-small cell lung cancer [3,4]. Another well-established example is lncRNA HOTAIR, which has been shown to promote metastasis in multiple cancers [5-9]. In the Lnc2Cancer database [10], nearly 5000 associations between lncRNA and cancer have been curated. The association between MicroRNA and cancer has been firmly established [11]. Cancer-related studies on other types of non-coding RNA such as pseudogenes and tRNA are less common.

One of the most robust approaches to affirm the functional effects of non-coding RNA is through the observance of prog- nostic value. Of all non-coding RNAs, miRNAs were first found to be potential cancer biomarkers [12]. Recently,

*These authors contributed equally to this work

Supplemental data for this article can be accessed here.

LncRNAs have also been found to inform survival. The latest examples include MVIH in hepatocellular carcinoma [13], BLACAT1 in small cell lung cancer [14], and NET1 in clear cell renal cell carcinoma [15]. A large portion of lncRNA survival studies were conducted with the The Cancer Genome Altas (TCGA) data, such as the panel of 12-lncRNA signature which predicts survival of pancreatic adenocarcinoma [16], the panel of three-lincRNA which predicts the survival of head and neck squamous cell carcinoma [17], the 11-lncRNA signature in endometrial carcinoma [18], etc. In addition to the direct evidence of non-coding RNA association with survival, indirect indications of non-coding RNAs’ functional impact exist as well, e.g., single-nucleotide polymorphism (SNP), expression quantitative trait locus (eQTL) and RNA editing. Remarkably, more than 96% of the 40,525 unique SNPs curated in Genome- Wide Association Study (GWAS) catalogue [19] are located in the genomic non-coding region [20]. One theory attempts to explain the disease liability of non-coding SNPs through long- range regulation of gene expression through eQTLs. RNA editing [21] is the prost-transcription process to modify an RNA sequence in divergence to the genetic code. Recent studies [22,23] revealed the potential functional effect of RNA editing in cancers. Notably, more than 80% of all RNA editing events documented in TCGA and Genotype-Tissue Expression pro- jects [24] are located in non-coding regions [25].

With immense richness at both genomic level and clinical level, TCGA data become an important cancer genomic data- set, underpinning influential cancer analysis web portals such as OncoLnc [26], TANRIC [27], and cBioPortal [28]. However, TCGA data does bear several noticeable weak- nesses. One such flaw is immortal time bias [29], an implicit span of time during which the outcome under study was impossible to occur. Immortal time bias was first documented in 1970s [30]. Since then, multiple manuscripts [31,32] have been published regarding this issue. Yet, it remains one of the least recognized flaws in modern cancer survival analysis. Aiming at alleviating immortal time bias in the disease-free context and maximizing the potential of TCGA survival data, we developed Advanced Expression Survival Analysis (AESA), a web-based tool which offers a series of unprecedented analysis options missing in existing TCGA survival analysis tools. AESA is developed as a module in the MutEx integra- tive cancer genomic analysis suite [33].

Methods

TCGA RNA expression data (FPKM) and overall survival data were downloaded from Genomic Data Commons Data Portal with R package TCGAbiolinks (version 2.8.4). To correct for disease-free immortal time bias, we obtained disease-specific survival, disease-free interval, progression-free interval data from publication by Liu et al. [29]. Because the exact surgery dates of TCGA patients are unknown, a three-month time- span was assumed for patients to sustain from initial diagnosis to the surgery date and such a three-month-period was sub- tracted from each patient’s disease-free survival [29]. This operation was intended to alleviate immortal time bias in the disease-free context.

AESA offers three modes of cancer survival analysis: (1) single gene; (2) multi-gene with composite gene expression score (CGES); (3) feature selection with selective inference. In the single-gene approach, each gene is assessed for survival discrimination potential. Subjects of a cancer are dichoto- mized based on the user-selected gene expression threshold, and log-rank test between the high-expression group and the low-expression group is performed.

In the multi-gene approach, the CGES metric is computed as a surrogate expression value for the whole gene set. CGES is manifested as a risk score ranging from 0 to 1, with a large value indicative of high risk of adverse outcome (death, relapse, disease progression, etc.). A pair of user-defined thresholds are used to divide the subject into low- and high- risk groups, and the log-rank test is used to examine the difference in survival between the low- and high-risk groups with the p-value serving as the statistical evidence for the association between the set of genes and survival outcome. Technically, CGES is obtained as the inverse-logit transforma- tion of a linear combination of individual gene expression values, i.e., CGES = 1 +exp (- ( 2+ Bx -m)] 1 where xi denotes

the standardized log-transformed expression value of one gene (the ith out of k ones) and the coefficient ;, the score test statistic for the same gene from a univariate Cox-PH regression. The normalization factor m captures the median of the linear combination term (Zi-1 Bixi) across all patients of a cancer type.

Because the same gene expression data were repeatedly used to estimate the parameters ß; and calculate the CGES, the p-value of log-rank test based on the x2 distribution is no longer valid which usually overestimates the significance of the associa- tion. We used leave-one-out cross validation (LOOCV) to cor- rect for overfitting. The entire procedure was repeated until each case had been omitted once and had been assigned into a risk group. In each iteration of LOOCV, one sample was left out and the Cox model was fitted by the rest of the samples using exactly the same process. To assess the significance of log-rank statistic and the degree of overfitting, we performed an outcome (or phenotype) permutation test 1000 times by shuffling among the cases and the entire cross-validation process described above was repeated. For each random shuffling, the predicted high/low-risk groups were generated as described above and the log-rank statistics was calculated between two risk groups. Permutation test provided a null-distribution of log-rank test, so the probability that the log-rank statistics in the null distribu- tion is larger than the log-rank statistic between the risk groups predicted by LOOCV without case shuffling is the probability that we obtain this log-rank statistic by random chance. Through this, we calculate the permutation p-value as # (Xperm Xobs) #Perm

Pperm = “(perm ” Nobs), where # (Xperm > Xobs) is the number of times that the log-rank test statistic values (X2 ) calculated from perm permutated data are greater than or equal to that (Xabs) calculated from the real data, and Perm is the total number of permutations.

In the LOOCV method, a leave-one-out loop is exerted to estimate the risk score (CGES) of each subject using the

parameters ({{}}) inferred from all the remaining subjects. Specifically, to determine the risk of a sample, we first take that sample out and use the rest of the samples as a training data set to estimate the parameters for CGES model and then apply the estimated parameters to calculate CGES of the left-out sample. Such CGES scores out of a leave-one-out loop are used to group the subjects, upon which a conventional log-rank test is conducted. The dif- ference in survival outcome between the LOOCV- determined risk groups serves as an evidence for the prog- nosis power of the given gene set. The hazard ratio (HR) and p-value of the log-rank test were calculated using Cox- regression.

As a third option of advanced survival analysis, AESA offers selective inference wrapping around Lasso [34] or Elastic Net [35] regression approaches. The Lasso method follows L1 regularization and penalizes the absolute size of the regression coefficients, while Elastic Net aims at a balance between L1 regularization and L2 regularization. Both meth- ods help to select a subset of genes from the initial set (of k genes, as denoted in the CGES formula above) that best explain survival outcome in parsimony. Post-selection infer- ence is performed using the R package selectiveInference (ver- sion 1.2.4).

AESA also allows the screening of gene sets’ survival pre- dictability by pathway or Gene Ontology. From Pathway Commons [36], the combination from three major pathway sources, namely PID [37], PANTHER [38], and INOH [39], was collected. Identically named pathways from distinct sources were integrated into a singleton pathway by adopting the largest-sized gene set and adding additional gene members from a secondary source if that source contributed more than 70% shared gene members. Additionally, we examined gene overlapping between every pathway pair and ensured each pair of pathways have no more than 70% shared genes. In the end, we established 369 major pathways screenable by AESA. Gene Ontology data were retrieved from BioMart databases using R package biomaRt [40]. A total of 13,486 gene sets were collected, comprised of 9,184 Biological Process sets, 1,441 Cellular Component sets, 2,861 Molecular Function sets.

Results

AESA was developed through mixture support by PHP, Javascript, MySQL, and R. Taking full advantage of the rich gene expression and clinical data from TCGA, AESA enables three modes of survival analysis on four types of survival data. The three modes of survival analysis are single, multi-gene, and feature selection. The four types of survival data are overall survival, disease-specific survival, disease-free survival, and progression-free survival [29]. In addition, AESA allows users to limit survival analysis to a particular gene type - pseudogenes, lncRNAs, or protein-coding genes. The overall schema of AESA is depicted in Fig. 1A.

TCGA curates valuable clinical data for the enrolled cancer patients. Nevertheless, these clinical data are subject to immortal time bias as alluded to in Fig. 1B. The initial diagnosis date of TCGA patients can be as early as the 1980s. Yet, the enrolment

commencement for TCGA is around 2010, which means the survival time is biased towards survival. Because the exact enrol- ment date is unknown, it is difficult to adjust for this bias. Similar immortal bias impairs disease-free interval data. We followed Liu et al.’s proposal [29] to subtract 3 months from the initial diagnose date, which roughly accommodate the surgery wait time. At the molecular level, TCGA gene expression data provide a good representation of the entire human transcriptome. It contains over 56,000 RNAs with 19,632 protein-coding RNAs, and the rest being non-coding RNAs. In total, 42 types of non-coding RNAs are represented in the data, the most abundant ones being processed pseudogene (10,113) followed by lincRNA (7,176) (Fig. 1C, supplementary Table 1). The number of outcome events, for combinations of each cancer and each survival data type, is depicted in Fig. 1D.

For a case study of the single-gene mode of survival ana- lysis, we analysed 80 lincRNAs which have been indicated for survival association from 11 studies (Fig. 2A, Supplementary Table 2). Overall survival was selected to ensure comparability with previous studies. Of the 11 studies (Table 1), seven were conducted on TCGA data, four on non-TCGA data. Of the 80 lincRNAs tested, 33 were found significantly (p < 0.05) asso- ciated with survival in the reported cancer type. There is a striking difference in verification of statistical significance between distinct data sources. Compared with lincRNAs that originated from the study of TCGA data, lincRNAs identified from non-TCGA data are less significantly associated with TCGA survival (Fig. 2B). In precise numbers, 37 of the 63 lincRNAs originating from TCGA-supported studies were found to be significant.

Most of those 11 studies argued that the lincRNAs they identified can form a biomarker panel for survival prediction. To verify their allegations, we performed an analysis of the 11 separate lincRNA panels using AESA’s multi-gene CGES approach. Our results indicated that 7 of the 11 lincRNA signa- tures were significantly associated with survival in their respec- tive cancer types (Table 1), including one signature originated from non-TCGA data sources and seven from TCGA-grounded studies.

To demonstrate the feature selection utility of AESA, we screened for prognosis genes with respect to overall survival (Fig. 3A) and disease-specific survival (Fig. 3B) analyses on all cancers with ten-fold cross-validated Lasso method. Due to sample size and survival event limitations, of the 33 cancer tested, 24 cancers produced results for overall survival analysis, and 20 cancers produced results for disease-specific survival analysis. There are noticeable differences between Lasso results for overall survival and disease-specific survival. Five cancers, GBM, LAML, OV, TGCT, and THCA, produced results for overall survival analysis but not disease-specific survival analysis. STAD produced a result for disease-specific survival analysis but none for overall survival analysis.

Of the 19 cancer types that generated results for both types of survival analyses, a significant difference in gene type selection was observed. Our results show that of all genes selected in overall survival analysis, lincRNA was the most abundant type account- ing for 25%, followed by protein-coding RNA for 24% and then other types of non-coding RNAs. For disease-specific survival analysis, lincRNA and protein-coding RNA each represented

Figure 1. An overview of AESA. (A). The basic workflow of AESA. AESA is constructed from TCGA gene expression data and survival data. It offers unique analysis abilities not implemented by other TCGA analysis tools. (B). Barplot demonstrating the immortal time bias from TCGA. Initial diagnose dates of TCGA patients can be as early as the 1980s, yet the rough TCGA enrolment date is around 2010, clearly, bias towards survival if initial diagnose dates were used as the starting dates of survival. Disease-free immortal bias can be pseudo addressed by subtracting 3 months from total survive time in order to cover the surgery duration. 3. TCGA gene express data is rich with non-coding RNAs, more than 50% of the reported RNAs are non-coding RNAs. (D). Total number of events for overall survival, disease- specific, disease-free and progression-free in each cancer.

A

D

GENE EXPRESSION

CLINICAL OUTCOME

Protein Coding

OS/DSS/DFI/PFI

ACC

☒ ☒ 28

26

0

☒ 11

0 ☒

41

2

non-coding

overall survival disease-specific survival

BLCA

122

43

31

26

174

55

disease-free interval progression-free interval

☒ 180

BRCA

152

83

☒ 49

☒ 84

40

145

59

· Single gene

CESC

☒ 71

54

13

☒ 26

5 ☒

71

16

· Gene expression composite score

· Selective inference with Cox model

CHOL

18

16

1

10

2

20

2

Advanced Expression Survival Analysis

COAD

☒ ☒ 102

64

22

23

8 ☒

122

28

DLBC

9

4

5

4

3

12

4

High

LOW

ESCA

☒ 64

45

☒ 18

☒ 21

9

78

☒ 19

Expression

GBM

122

☒ 107

3

1

0 ☒ 121

12

Differential Survival

Kaplan-Meier curve Log-rank test

HNSC

☒ ☒ 218

129

63

28

10

193

72

Cox-PH model

KICH

☒ 10

☒ 8

2

3

0

12

1

Gene selection

Overfitting correction

KIRC

☒ 173 ☒

108

54

☒ 15

5

☒ 160

62

B

KIRP

44

28

12 ☒

28

7

57

15

Approximate enrollment commencement

LAML

97

0

0

0

0

0

0

1500

Overall Immortal Bias

LGG

126 ☒ ☒

113

4

☒ 20

1

192

9

(Early) Diagnosis

Enrollment 2010

LIHC

130 ☒ ☒ 79

☒ 43

☒ 143

39

180

45

Disease-free Immortal Bias

LUAD

187 ☒ 115

36

89

29

209

54

Diagnosis

(Surgery date unknown)

Frequency

1000

Disease free

LUSC

217 ☒ ☒

89

75

61

51

147

108

Pseudo Correction: 3 months

MESO

☒ 73

43

9

☒ 7

6

61

18

OV

☒ 230

199

6

☒ 126

4

272

21

500

PAAD

92 ☒

72 ☒

14

23

11

103

17

PCPG

☒ ☒

6

4

2

☒ 4

1

21

2

PRAD

10

5

3

☒ 30

3

93

4

0

READ

26

14

5

6

1 ☒

37

9

1980

1985

1990

1995

2000

2005

2010

SARC

98 ☒ ☒

81

11

☒ 138

15

Year of Initial Diagnosis

66

7

SKCM

29

21

☒ 8

0

0

☒ 37

7

C

1448 (3%)

940 (2%)

STAD

149 ☒ 91

36

37

27

124

55

1872 (3%)

1039 (2%)

4389 (8%)

TGCT

2207 (4%)

4

3

1

☒ 27

1

35

1

4

SnRNA miRNA

TEC

5 snoRNA

2506 (4%)

unprocessed pseudogene

THCA

16

☒ 7

3

26

4

☒ 52

☒ 9

5394 (10%)

misc RNA

others

THYM

9

4

5

0

0

20

3

antisense

UCEC

☒ 91

☒ 60

29 ☒

57

☒ 23

124

26

7176 (13%)

lincRNA

protein coding

UCS

☒ ☒ 34

30

2

9

2

36

4

processed pseudogene

19632 (35%)

UVM

23

21

2

0

0

30

4

10113 (18%)

OS

DSS

DSS.cr

DFI

DFI.cr

PFI

PFI.cr

22% of all genes selected. Among the other types of non-coding RNAs, antisense RNA and processed pseudogenes are the two

most abundantly selected types of non-coding RNA types behind lincRNA.

Figure 2. Eleven lincRNA signatures of 80 lincRNAs were tested by AESA at single-gene and multi-gene level. (A). Result of single-gene survival model for all 80 lincRNAs. Each bar represents a single lincRNA, Y-axis is - log10 (p-value), higher means more significant. LincRNAs are grouped by study, study pubmed IDs are labelled in the x-axis. (B). Boxplot of - log10 (p-value) for TCGA-based studies and non-TCGA-based studies. Dashed line indicates the significance threshold. LincRNAs identified based on TCGA data are more significant than lincRNAs identified based on non-TCGA data. (C). Results of survival analysis using multi-gene CGES approach for the 11 lincRNA signatures. Dashed line indicates the significance threshold.

A

TCGA

No

Yes

4 -

-log10(p-value)

2

I

0

V

ESCA

29891973

OV

27074572

LUAD

30659574

28109476

HNSC

29304762

UCEC

COAD

30510449

PAAD

31031865

BRCA

26549855

OV

28389671

COAD

30396175

26183581

LUAD

B

TCGA

No

Yes

C

TCGA

No

Yes

4-

4-

3-

-log10(p-value)

· ·

-log10(p-value)

2

2

1-

0-

0-

Yes

NO

ESCA

OV

LUAD

UCEC

HNSC

PAAD

COAD

LUAD

BRCA

OV

COAD

Separately for each cancer, we conducted conventional multi- variable Cox regression analysis using the selected feature genes. R-square was used to compare the model fitness between Lasso all genes and Lasso non-coding RNAs for each cancer. For both overall survival (Fig. 3C) and disease-specific survival (Fig. 3D), non-coding RNAs achieved comparable fitness to all genes. The large variation of R-square across cancer types indicates different cancer may have different survival sensitivity towards gene expression. For certain cancers, such as ACC and KIRP in overall survival context, and LGG and MESO in disease-specific survival context, the model with non-coding RNA performed substantially better than the model with all genes. The performance of the models between all genes and non-coding RNAs was also com- pared using the concordance index (C-index) [41]. Both overall survival (Fig. 3E) and disease-specific survival (Fig. 3F) produced models with high correlations. These results suggest that the survival prediction power conferred by non-coding RNAs is as effective as by protein-coding RNAs.

Discussion

We presented AESA to take full advantage of the TCGA gene expression and survival data. In addition to the conventional approaches offered by the previous TCGA analysis tools, AESA is equipped with multi-gene CGES and feature selec- tion methods. Both of these new analysis options are aimed for non-single gene approach, and to provide users with more freedom when exploring the TCGA data. AESA is also the first publically available tool to offer disease-specific, disease- free, and progression-free survival analysis for TCGA data.

As we have shown, TCGA clinical data are subject to immortal time bias. Such bias can be addressed by subtracting 3 months from the disease-free survival time, which roughly accounts for the surgery wait period after diagnosis. To cor- rect for immortal time bias for overall and disease-specific survival time, ideally, we should calibrate the followup initia- tion to the enrolment date, rather than the initial diagnose

Table 1. CGES analysis results on 11 lincRNA signatures reported previously.
Study PMID1CancerNumber of genesP-value2TCGA3
28389671 [42]Ovarian cancer60.0534No
26549855 [43]Breast cancer90.5622No
30396175 [44]Colorectal cancer60.0005No
26183581 [45]Non-small cell lung cancer80.7016No
27074572 [46]Ovarian cancer80.023Yes
29304762 [47]Endometrial carcinoma110.0017Yes
28109476 [17]Head and neck squamous cell carcinoma50.0004Yes
31031865 [16]Pancreatic adenocarcinoma120.0001Yes
30510449 [48]Colorectal cancer150.0001Yes
30659574 [49]Non-small cell lung cancer80.0057Yes
29891973 [50]Oesophageal squamous cell carcinoma70.9708Yes

1Pubmed ID.

2P-value computed from leave one out cross-validation.

3Whether the lincRNA signature was derived from TCGA data.

date. Unfortunately, after carefully examining the TCGA clin- ical data from several different sources, the exact enrolment dates remain unknown. Thus, at this point, we are unable to correct for immortal time bias for overall and disease-specific survival.

As for model validation, the LOOCV, a special case of K-fold cross-validation where K = N (number of observa- tions), is approximately an unbiased estimator. However, it is known that LOOCV has a high variance (the results may vary a lot when a different random sample is drawn from the target population). For moderate-to-large study samples, repeated 10-fold cross-validation has been shown to

outperform LOOCV. In smaller samples, bootstrapping is usually the preferred approach for internal validation. Bootstrap methods generally have the lowest mean squared error of accuracy estimates. One of our future research direc- tions is to apply and compare these internal validation meth- ods, and provide overfitting corrected model fitting estimations.

To demonstrate AESA’s effectiveness, we performed sev- eral empirical and proof-of-concept examples. One case study checked survival predictability of previously reported lincRNAs. The results validated 33 out of 80 lincRNA that were reported in 11 published studies. Some of these

Figure 3. Overall and disease-specific survival analysis results on all cancer types. (A). Twenty-four cancer types successfully completed overall survival lasso feature selection. The selected gene types are distinguished by colour. More non-coding genes were selected than protein-coding genes. (B). Twenty cancer types successfully completed the disease-specific survival lasso feature selection. Similar to overall survival, more non-coding RNAs were selected than protein-coding genes. (C). R-Square comparison between overall survival models constructed using all genes and non-coding genes. (D). R-Square comparison between disease- specific survival models constructed using all genes and non-coding genes. (E). Scatter plot of C-index between overall survival models constructed using all genes and non-coding genes. (F). Scatter plot of C-index between disease-specific survival models constructed using all genes and non-coding genes. The results from C, D, E and F indicate non-coding RNAs can predict survival as well as protein-coding genes.

A

C

E

KIRP

Overall Survival

LGG

KIRC

0.8

Overall Survival

SARC

LUAD

1.0

Spearman r-0.98

cancers

MESO

PAAD

RSquare

0.6

&

O ACC

8 LƯỢC

UCEC

&

BACA 5 MESO

LUSC

OV

ACC

antisense

0,9

+

COAD

LAML

IG C pseudogene

0.4 -

Non-coding Genes

X

DLBC

MAAD

COAD

PRAD

IG D gene

0

OBU

PRAD

0,8

V

HNSC

0.2

SARC

THYM

IG J gene

OV

2

KOCH

TOCT

DLBC

IG V gene

*

KRC

. THCA

THCA

IG V pseudogene

0.0

0.7

¢ KRP

O THYM

HNSC

BRCA

lincRNA

THYM -

e LAML

UCEC

TGCT

DLBC

ACC

LIHC

miRNA

A

JVM

misc RNA

ME

0.6

2

LUAD E

UVM-

KICH

processed

0.6

0.7

0.8

0.9

1.0

GBM

10

20

30

pseudogene

All Genes

0

Number of genes

processed transcript

B

protein coding

All Genes

Non-coding Genes

LUAD

scaRNA

D

SARC

sense intronic

Disease-Specific Survival

F

KIRC

COAD

sense overlapping

Disease-Specific Survival

cancers

KIRP

snoRNA

1.0

Spearman r-0 93

0

ACC

BRCA

snRNA

0.6-

4

BACA

PAAD

TEC

RSquare

+

COAD

X

DLBC

STAD

TR J gene

0.4-

0.9

O

HNSC

PRAD

Non-coding Genes

ACC

TR V gene

A

V

KICH

KORG

LGG

transcribed

&

LOG

UCEC

processed pseudogene

0.2 -

0.8

UVM

transcribed

ICK

+

LUAD

@

LƯỢC

THYM

unitary pseudogene

MESO

KICH

transcribed

EB

PAAD

8

HNSC

unprocessed pseudogene

0.0

0.7

SARC

LUSC

unprocessed pseudogene

LUSC -

HNSC -

UCEC -

BRCA -

MESO -

STAD -

THYM -

KICH -

KIRC -

®

LGG -

LUAD -

COAD -

DLBC -

SARC -

PAAD -

KIRP -

UVM -

ACC -

B

STAD

THYM

LIHC

0

UCEC

DLBC -

0.7

0.8

0.9

1.0

WVM

MESO

All Genes

0

20

40

60

Number of genes

pronounced lincRNAs cannot be validated in AESA, possibly because some studies were based on early versions of TCGA data. Over the last decade, TCGA has been through several iterations of data reprocessing and was augmented by addi- tional sample accumulation. For example, there are as many as four versions of gene expression data for some cancers in TCGA. Some early studies have been based solely on micro- array gene expression data. In contrast, AESA is entirely based on the latest version of RNA-seq data. When processing each lincRNA signature with our multi-gene CGES approach, we found 7 out of 11 lincRNA signatures were significantly associated with survival. This result highlights the advantage of the multi-gene approach. Our other important results show that non-coding RNAs have similar prediction power for cancer survival compared to protein-coding RNAs. The strong survival prediction power of non-coding RNA could result from either direct functional effect or indirect con- founding effect between non-coding RNA and protein- coding RNAs such as co-expression. In summary, all results presented herewith indicate lincRNAs as a vastly untapped resource for predicting cancer survival.

Secondary data analysis from publically available datasets is one of the most efficient and cost-effective approaches for inception of novel hypotheses or validation of presumable bio- medical hypothesis. The rise of interest in non-coding RNA and its functional effect is undoubtedly associated with the accumu- lation of public genomic data. While being analysed countless times, certain aspect of the TCGA data often eludes us, prevent- ing us from truly maximizing the potential of a dataset. Based on previous experience and several innovative ideas, we devel- oped AESA and demonstrated its functionality with empirical examples. Our goal is to fully maximize the potential of TCGA data and offers unprecedented analysis options to users. AESA by no means covers all of the possibilities with TCGA survival analysis, but it does offer several novel approaches not covered by existing TCGA analysis tools. AESA will undergo constant improvement based on user feedbacks and methodology inno- vation. In the near future, we hope to integrate addition datasets into AESA, especially the data from International Cancer Genome Consortium.

Acknowledgments

YG, OO, HY, HJ, SN, HK, JE were supported by the cancer centre support grant P30CA118100. This study was supported by the Comprehensive Cancer Centre at the University of New Mexico, the Bioinformatics Shared Resources and the Biostatistics Shared Resources at The Comprehensive Cancer Centre. This work was supported by the pro- gramme in Western Medicine approved by Shanghai Science and Technology Commission (16411966000) (BY, JS), the interdisciplinary Programme of Shanghai Jiaotong University (YG2014QN22) (BY, JS), Science and Technology Commission Shanghai Municipality (STC5M15DZ2270400), Natural Science Foundation of China (81572245, 81402571) (BY, JS) and Collaborative Research Project of Transformational Medicine Collaborative Innovation Centre (TM201822) (BY, JS).

Availability

AESA is freely available at http://innovebioinfo.com/Databases/TCGA_ Mutation_ExpressionDB/Mutationdb_Search5.php.

Disclosure of potential conflicts of interest

No potential conflicts of interest were disclosed.

Funding

This work was supported by the National Cancer Institute [P30CA118100].

ORCID

Huining Kang ® http://orcid.org/0000-0002-4415-3573 Deirdre Hill @D http://orcid.org/0000-0002-8495-659X Hui Yu @ http://orcid.org/0000-0002-4491-8855 Scott Ness [ http://orcid.org/0000-0001-6965-8909 Fei Ye @ http://orcid.org/0000-0001-6472-5076 Jie Ping @ http://orcid.org/0000-0001-9907-5819 Jeremy Edwards @ http://orcid.org/0000-0003-3694-3716 Ying-Yong Zhao ® http://orcid.org/0000-0002-0239-7342

References

[1] Wang J, Wu X, Gao W, et al. Current Research on Non-Coding Ribonucleic Acid (RNA). Genes (Basel). 2017;8(12):366.

[2] Zhou X, Liu Y, Feng X, et al. Long non coding RNA MALAT1 promotes tumor growth and metastasis by inducing epithelial-mesenchymal transition in oral squamous cell carcinoma. Sci Rep. 2015;5:15972.

[3] Schmidt LH, Spieker T, Koschmieder S, et al. The long noncoding MALAT-1 RNA indicates a poor prognosis in non-small cell lung cancer and induces migration and tumor growth. J Thorac Oncol. 2011;6(12):1984-1992.

[4] Gutschner T, Hämmerle M, Eissmann M, et al. The noncoding RNA MALAT1 is a critical regulator of the metastasis phenotype of lung cancer cells. Cancer Res. 2013;73(3):1180-1189.

[5] Gupta RA, Shah N, Wang KC, et al. Long non-coding RNA HOTAIR reprograms chromatin state to promote cancer metastasis. Nature. 2010;464(7291):1071-U148.

[6] Bhan A, Hussain I, Ansari KI, et al. Bisphenol-A and diethylstil- bestrol exposure induces the expression of breast cancer asso- ciated long noncoding RNA HOTAIR in vitro and in vivo. J Steroid Biochem Mol Biol. 2014;141:160-170.

[7] Bhan A, Hussain I, Ansari KI, et al. Antisense transcript long noncoding RNA (lncRNA) HOTAIR is transcriptionally induced by estradiol. J Mol Biol. 2013;425(19):3707-3722.

[8] Hajjari M, Salavaty A. HOTAIR: an oncogenic long non-coding RNA in different cancers. Cancer Biol Med. 2015;12(1):1-9.

[9] Zhong DN, Luo YH, Mo WJ, et al. High expression of long noncod- ing HOTAIR correlated with hepatocarcinogenesis and metastasis. Mol Med Rep. 2017;17(1):1148-1156.

[10] Ning S, Zhang J, Wang P, et al. Lnc2Cancer: a manually curated database of experimentally supported lncRNAs associated with various human cancers. Nucleic Acids Res. 2016;44(D1):D980-5.

[11] Peng Y, Croce CM. The role of MicroRNAs in human cancer. Signal Transduct Target Ther. 2016;1:15004.

[12] Schoof CR, Botelho ELDS, Izzotti A, et al. MicroRNAs in cancer treatment and prognosis. Am J Cancer Res. 2012;2(4):414-433.

[13] Zhang Y, Lin S, Yang X, et al. Prognostic and clinicopathological significance of lncRNA MVIH in cancer patients. J Cancer. 2019;10(6):1503-1510.

[14] Su J, Nikolic A, Kentish SJ, et al. Long noncoding RNA BLACAT1 indicates a poor prognosis of colorectal cancer and affects cell proliferation by epigenetically silencing of p15. Cell Death Dis. 2017;8(3):e2665.

[15] Ning L, Li Z, Wei D, et al. LncRNA, NEAT1 is a prognosis biomarker and regulates cancer progression via epithelial-mesenchymal transi- tion in clear cell renal cell carcinoma. Cancer Biomark. 2017;19 (1):75-83.

[16] Zhang H, Zhu M, Du Y, et al. A panel of 12-lncRNA signature predicts survival of pancreatic adenocarcinoma. J Cancer. 2019;10 (6):1550-1559.

[17] Cao W, Liu J-N, Liu Z, et al. A three-lncRNA signature derived from the Atlas of ncRNA in cancer (TANRIC) database predicts the survival of patients with head and neck squamous cell carcinoma. Oral Oncol. 2017;65:94-101.

[18] Le Page C, Rahimi K, Köbel M, et al. A novel lncRNA-focus expression signature for survival prediction in endometrial carcinoma. BMC Cancer. 2018;18(1):39.

[19] Hindorff LA, Sethupathy P, Junkins HA, et al. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc Natl Acad Sci U S A. 2009;106 (23):9362-9367.

[20] Sheng Q, Samuels DC, Yu H, et al. Cancer-specific expression quantitative loci are affected by expression dysregulation. Brief Bioinform. 2018.

[21] Guo Y, Yu H, Samuels DC, et al. Single-nucleotide variants in human RNA: RNA editing and beyond. Brief Funct Genomics. 2019;18(1):30-39.

[22] Han L, Diao L, Yu S, et al. The genomic landscape and clinical relevance of A-to-I RNA editing in human cancers. Cancer Cell. 2015;28(4):515-528.

[23] Qian M, Spada C, Wang X. Detection and application of RNA Editing in Cancer. Adv Exp Med Biol. 2018;1068:159-170.

[24] Lonsdale J, Thomas J, Salvatore M, et al. The Genotype-Tissue Expression (GTEx) project. Nat Genet. 2013;45(6):580-585.

[25] Chigaev M, Yu H, Samuels DC, et al. Genomic positional dissec- tion of RNA editomes in tumor and normal samples. Front Genet. 2019;10:211.

[26] Anaya J. OncoLnc: linking TCGA survival data to mRNAs, miRNAs, and lncRNAs. PeerJ Comput Sci. 2016;2:e67.

[27] Li J, Han L, Roebuck P, et al. TANRIC: an interactive open platform to explore the function of lncRNAs in cancer. Cancer Res. 2015;75(18):3728-3737.

[28] Cerami E, Gao J, Dogrusoz U, et al. The cBio cancer genomics portal: an open platform for exploring multidimensional cancer genomics data (vol 2, pg 401, 2012). Cancer Discov. 2012;2 (10):960.

[29] Liu JF, Lichtenberg T, Hoadley KA, et al. An integrated TCGA pan-cancer clinical data resource to drive high-quality survival outcome analytics. Cell. 2018;173(2):400-+.

[30] Gail MH. Does cardiac transplantation prolong life? A reassessment. Ann Intern Med. 1972;76(5):815-817.

[31] Levesque LE, Deneux-Tharaux C, Brocklehurst P, et al. Problem of immortal time bias in cohort studies: example using statins for preventing progression of diabetes. Br Med J. 2010;340:b5087.

[32] Park HS, Gross CP, Makarov DV, et al. Immortal time bias: a frequently unrecognized threat to validity in the evaluation of post- operative radiotherapy. Int J Radiat Oncol Biol Phys. 2012;83 (5):1365-1373.

[33] Ping J, Oyebamiji O, Yu H. Mutex: a multifaceted gateway for exploring integrative pan-cancer genomic data. Brief Bioinform.

2019. https://academic.oup.com/bib/advance-article/doi/10.1093/ bib/bbz084/5581642

[34] Wang HS, Li GD, Tsai CL. Regression coefficient and autoregres- sive order shrinkage and selection via the lasso. J R Stat Soc Ser B Stat Methodol. 2007;69:63-78.

[35] Zou HAH T. Regularization and variable selection via the elastic net. J R Stat Soc Series B Stat Methodol. 1995;571:289-300. Key: citeulike:1042553, 2005. .

[36] Cerami EG, Gross BE, Demir E, et al. Pathway Commons, a web resource for biological pathway data. Nucleic Acids Res. 2011;39 (Database issue):D685-690. [cited 2019 May 20]. Available from: https://www.pathwaycommons.org.

[37] Schaefer CF, Anthony K, Krupa S, et al. PID: the pathway interaction database. Nucleic Acids Res. 2009;37(Database issue):D674-9.

[38] Mi H, Poudel S, Muruganujan A, et al. PANTHER version 10: expanded protein families and functions, and analysis tools. Nucleic Acids Res. 2016;44(D1):D336-42.

[39] Yamamoto S, Humphrey JC, Conley AB, et al. INOH: ontology-based highly structured database of signal transduction pathways. Database (Oxford). 2011;2011:bar052.

[40] Durinck S, Moreau Y, Kasprzyk A, et al. BioMart and bioconduc- tor: a powerful link between biological databases and microarray data analysis. Bioinformatics. 2005;21(16):3439-3440.

[41] Hosmer DW, Lemeshow S, Sturdivant RX. Applied logistic regres- sion. In: Applied logistic regression. 3rd ed. Hoboken (NJ): Wiley; 2013; p. 1-500.

[42] Yang K, Hou Y, Li A, et al. Identification of a six-IncRNA signature associated with recurrence of ovarian cancer. Sci Rep. 2017;7(1):752.

[43] Sun J, Yuan Y, Chen J, et al. A potential prognostic long non-coding RNA signature to predict metastasis-free survival of breast cancer patients. Sci Rep. 2015;5:16553.

[44] Zhao J, Xu J, Shang A-Q, et al. A Six-LncRNA expression signa- ture associated with prognosis of colorectal cancer patients. Cell Physiol Biochem. 2018;50(5):1882-1890.

[45] Zhou M, Presnell S, Domico K, et al. A potential signature of eight long non-coding RNAs predicts survival in patients with non-small cell lung cancer. J Transl Med. 2015;13:231.

[46] Zhou M, Sun Y, Sun Y, et al. Comprehensive analysis of IncRNA expression profiles reveals a novel IncRNA signature to discrimi- nate nonequivalent outcomes in patients with ovarian cancer. Oncotarget. 2016;7(22):32433-32448.

[47] ] Zhou M, Rahimi K, Köbel M, et al. A novel IncRNA-focus expres- sion signature for survival prediction in endometrial carcinoma. BMC Cancer. 2018;18(1):39.

[48] Wang XN, Zhou J, Xu M, et al. A 15-lncRNA signature predicts survival and functions as a ceRNA in patients with colorectal cancer. Cancer Manag Res. 2018;10:5799-5806.

[49] Miao RC, Ge C, Zhang X, et al. Combined eight-long noncoding RNA signature: a new risk score predicting prognosis in elderly non-small cell lung cancer patients. Aging-Us. 2019;11(2):467-479.

[50] Mao Y, Fu Z, Zhang Y, et al. A seven-IncRNA signature predicts overall survival in esophageal squamous cell carcinoma. Sci Rep. 2018;8.