Prediction of origin of cancer by deep-learning analysis using pan-cancer transcriptome

We introduce DEARGEN’s research results on the model that predicts the primary cancer sites (cancer metastasis prediction) through the gene expression profile (GEP).


Cancer Metastasis is a phenomenon in which primary tumor cells spread to other organs and is regarded as the final stage of cancer. Until the last 10 years, cancer research has focused on the mechanisms of the incidence and progression of primary cancers, which has given remarkable achievements in helping patients with early cancer treatment. However, very few researches on cancer metastasis, which accounts for more than 90% of causes of actual cancer patients’ death, generally have been made. As the prognosis of treatment and the length of survival varies greatly depending on the presence of metastasis, the key to cancer research is metastasis research.

Even in the same kind of cancer, the appearance of cancer cells varies according to the individual’s genomic information. This will eventually have a big impact on the presence of cancer cell metastasis and the efficacy of the medicine. Therefore, if it is possible to accurately diagnose cancer metastasis by identifying genes associated with metastasis, which can lead to personalized therapy based on genetic information related to metastasis, it will be able to maximize the treatment effect of cancer.

In this article, we will introduce a deep learning analysis model that predicts the location of primary cancer and metastatic cancer incidence, using GEP data developed by DEARGEN to address these issues.


Cancers share a common Gene Expression Profile (GEP), depending on where they occur. However, it is difficult to find the first location of cancer incidence with GEP in actual clinical sites because there is no effective algorithm for dealing with cancer transcript databases. Therefore, DEARGEN has developed a model through deep learning algorithms using deep neural networks and this enables more accurate predictions than existing linear regression models. We used TCGA (The Cancer Genome Atlas) and CTEx (The Genotype-Tissue Expression) information for databases. 


GEP data from 35 different cancer types (N = 9,840) and normal tissue (N = 10,376) GEP data used a database of Broad GDAC firehose and GTEX respectively. All samples were classified into 18 classes according to the similarity of incidence locations, and the GEP data as the validation set used Gene expression omnibus (GEO) data (GSE28702, N = 83).

DEARGEN has developed a model that consists of five AI neural network layers. In short, a simple multi-layer perceptron (MLP) used nonlinear computation to capture the GEP pattern of a sample, and then overcome the overfitting problem by using the Batch normalization layer and the Dropout layer in the learning phase. We studied and validated the model with 18,221 samples (approximately 90%) from the total data and evaluated the performance of the model with 1,995 samples (approximately 10%). We selected the model with the highest accuracy in the validation set data through five-fold cross-validation. We used Adam optimizer with a learning ratio of 0.01 in learning, and we did multi-task learning as the loss function based on cross-entropy binary log loss.


The overall accuracy of deep learning predictions for the location of cancer incidence in the test set of primary cancers was 99.15% (1,978 predictions out of 1,995 samples). It was interesting that accurate 100% predictability was shown in the cancers that occur especially in the lungs, colon, kidneys, head and neck, bladder, pancreas, skin melanocytes and cervix. In TCGA metastatic cancer samples, the primary occurrence location was predicted with 85.31% accuracy (337 out of 397 samples). And in particular, all metastatic tissues (N = 8) of thyroid cancer were perfectly predicted in our deep learning model. 

Also, the model was developed from RNA sequencing data tailored to the microarray platform and predicted colorectal cancer metastasis to various lesions with an accuracy of 95.18% (predicted 79 of 83 samples) (GSE28702).


DEARGEN’s deep learning model has shown nearly 100% accuracy in predicting the sites for cancers to metastasize by analyzing GEP data of cancer and normal tissue. Furthermore, this deep learning model has the potential to be used to locate cancer metastases as a clinical measurement tool in the case that it is difficult to diagnose the location of cancer incidence by Current pathological examination. Besides, it will help to identify commonalities and differences in the location of cancer incidence in researching the etiology of cancer.