Inquiry
Cancer Data Mining Service

Cancer Data Mining Service

Online Inquiry

Why Perform Cancer Data Mining Service?

The Cancer Data Mining empowers researchers to leverage vast amounts of public cancer data, extracting valuable scientific insights and enhancing research efficiency. By deeply analyzing these datasets, researchers can uncover potential biological mechanisms and identify new therapeutic targets or biomarkers without the need for costly clinical trials. The cancer data available in public databases provides substantial foundational support for scientific research, encompassing multiple layers such as genomics, epigenomics, transcriptomics, and proteomics. This comprehensive data enables researchers to achieve significant advancements in fields like precision medicine, drug development, and clinical prognosis analysis.

What Can We Offer ?

We offer comprehensive cancer data mining services, encompassing data acquisition, preprocessing, differential expression analysis, mutation analysis, survival analysis, and functional annotation. Through in-depth analysis of cancer genomic data, we help clients uncover critical cancer-related biological information, such as oncogenes, tumor suppressor genes, and mutation hotspots. Additionally, we provide customized data mining solutions tailored to the specific research needs of our clients, enabling in-depth statistical and functional analysis for particular cancer types or biomarkers. This approach aids clients in saving both research time and costs.

Workflow for Cancer Data Mining

Cancer Data Download

Cancer public data primarily originates from multiple international databases that offer various types of cancer-related data. The Cancer Genome Atlas (TCGA) is one of the most renowned cancer databases, providing genomic, transcriptomic, and epigenetic data from thousands of cancer samples. The Catalogue of Somatic Mutations in Cancer (COSMIC) focuses on somatic mutation information. cBioPortal allows users to interactively explore cancer genomic data, offering multiple data types such as gene mutations and copy number alterations (CNA). Databases like the Clinical Proteomic Tumor Analysis Consortium (CPTAC) specialize in cancer proteomics data, covering multiple biological layers of cancer. The Cancer Imaging Archive (TCIA) provides an extensive collection of radiological and histopathological image datasets. Researchers can select the appropriate database for data download based on their specific needs.

Cancer Data Mining Content

Differential Expression Analysis DESeq2
EdgeR
Limma
Variant Association Analysis Variant Calling (e.g., GATK, MuTect2, VarScan)
Mutation-Expression Correlation
Somatic Mutation Analysis (e.g., MuSiC, MutSigCV)
Single-Cell Analysis: scRNA-seq data analysis (e.g., Seurat, Scanpy)
scATAC-seq data analysis
Methylation Analysis Differential Methylation Regions (e.g., RnBeads, ChAMP)
Integration with Gene Expression
Methylation Quantitative Trait Loci (meQTL) Analysis
Integration Analysis Multi-Omics Data Integration (e.g., iCluster, MINT)
Cross-Platform Correlation Analysis
Pathway Enrichment and Functional Annotation Pathway Enrichment (e.g., KEGG, Reactome,GO)
Gene Set Enrichment Analysis (GSEA)
Protein-Protein Interaction Network Analysis
Gene Network Analysis Network Construction (e.g., WGCNA, Cytoscape)
Network Module Identification
Non-coding RNA Analysis Cancer Regulatory Networks (miRDeep, miRanda)
ncRNA Biomarker Discovery
Survival Analysis and Prognostic Model Building Survival Curve Estimation (e.g., Kaplan-Meier Plot)
Cox proportional hazards model

What Are the Advantages of Our Service?

Proficient Expertise in Sequencing and Omics Data Processing

Our extensive background in cancer data sequencing and multi-omics integration allows us to adeptly handle complex genomic, transcriptomic, and proteomic datasets. This proficiency ensures the extraction of meaningful insights required to address multifaceted biological challenges in oncological research.

Comprehensive Talent Pool and Domain Expertise

Our team is composed of seasoned professionals, including data scientists, statistical analysts, and industry-specific experts. This rich reservoir of talent facilitates in-depth analysis and personalized client support, empowering research initiatives with innovative and scientifically sound solutions.

Innovative Technology and Analytical Tools

By employing state-of-the-art data mining and machine learning technologies, we ensure the delivery of precise, high-efficiency analyses across various cancer-related datasets. This integrates novel computational models that identify genomic biomarkers and therapeutic targets, providing clients the competitive advantage necessary for groundbreaking cancer research.

Robust Data Security

Understanding the critical nature of cancer data, we deploy comprehensive data security measures and adhere to strict compliance policies. This assures clients of the utmost confidentiality, integrity, and privacy, ensuring their data is well-protected throughout the analysis process.

Integrated Multi-Omics Approach

Our service excels in integrating GWAS, transcriptomics, proteomics, and metabolomics data, providing a holistic view of biological systems. This multidimensional strategy uncovers complex molecular interactions and pathways, advancing understanding of cancer mechanisms and therapeutic targets.

What Does Cancer Data Mining Reveal?

Cancer Copy Number Variation (CNV) Analysis Utilizing the TCGA Database

This study focuses on 26 RNA modification "writers" in Colorectal cancer (CRC) , including 3 A-I modification "writers", 7 m6A modification "writers", 4 m1A modification "writers", and 12 APA modification "writers". Among the 404 TCGA-COAD samples, 119 (29.46%) exhibit mutations in the RNA modification "writers".

Genetic alterations of RNA modification writers in colorectal cancer, showing mutation rates in 404 patients from the TCGA dataset.Figure 1.Genetic alterations of RNA modification "writers" in CRC. The mutation rates of 26 RNA modification "writers" in 404 TCGA CRC patients. Each column is a patient. The top graph indicates Tumor Mutation Burden (TMB), while the right graph displays variant type proportions.(Chen 2021)

Cluster Analysis of Transcriptome Expression within the GEO Database

By analyzing data from the GEO database, the transcript expression levels of these 26 writers in over a thousand samples can be divided into two distinct patterns through unsupervised clustering (727 in cluster_1 and 968 in cluster_2).

Heatmap showing correlation between RNA modification writers in colorectal cancer, with red for positive and blue for negative correlations.Figure 2. The heatmap illustrates the correlation between RNA modification "writers" in CRC. Positive correlations are highlighted in red, while negative correlations are depicted in blue. *p < 0.05, **p < 0.01, and ***p < 0.001, as determined by the Spearman correlation analysis.(Chen,2021)

Enrichment Analysis of Differentially Expressed Genes

The enrichment analysis of differentially expressed genes between Cluster 1 and Cluster 2 reveals that the former is enriched in stromal and carcinogenic activation, TGF-β signaling pathway, and cell adhesion pathways, whereas Cluster 2 is enriched in proliferation and apoptosis, as well as mismatch repair pathways.

Heatmap of GSVA enrichment analysis showing activated (red) and inhibited (blue) biological pathways across RNA modification patterns in CRC cohorts.Figure 3. A heatmap from the GSVA enrichment analysis shows biological pathway activation in various RNA modification patterns.Red for activated pathways, blue for inhibited. CRC cohort names serve as sample annotations.(Chen,2021)

Survival Analysis Reveals Survival Probability in Different Patterns

The two clusters exhibit significant differences in survival time in CRC, with patients in Cluster 1 having a lower survival rate. This preliminarily suggests that changes in writers can lead to alterations in certain tumor-related gene sets, which indeed result in significant differences in patient survival rates.

Kaplan-Meier survival curves comparing two RNA modification clusters.Figure 4. Kaplan-Meier curves compare overall survival between RNA modification patterns, Cluster_1 (red) and Cluster_2 (blue), in GSE39582. CRC sample groupings are shown below, with p < 0.05 in the log-rank test indicating statistical significance.(Chen,2021)

Title:Single Cell Analysis Unveils B Cell-Dominated Immune Subtypes in HNSCC for Enhanced Prognostic and Therapeutic Stratification

Publication:International Journal of Oral Science

Main Methods:scRNA-seq data mining,Gene Ontology (GO) enrichment analysis, cell-cell communication analysis

Abstract:This study explored the role of tumor-infiltrating B cells (TIL-Bs) in head and neck squamous cell carcinoma (HNSCC) using single-cell RNA sequencing data. The analysis identified two B cell-dominated immune subtypes in HNSCC, defined as B cell activation and B cell inhibition groups. These groups were associated with different clinical outcomes and immune responses, including immune checkpoint blockade (ICB) therapy. A four-gene prognostic model (JCHAIN, GZMB, IGHA1, and PDRX4) was developed to predict overall survival, showing significant potential for therapeutic stratification. The study highlights the importance of B cell-driven immune responses in shaping HNSCC prognosis and response to immunotherapy.

Research Results:

Identification of B Cell-Dominated Immune Subtypes in HNSCC scRNA-seq Dataset

This study selected 26 HNSCC samples from GSE139324, and after quality control, obtained an overall immune cell atlas. Clustering was performed using Seurat, identifying major immune cell types such as B cells, NK cells, myeloid cells, and T cells. Subsequently, B cells were further subdivided, identifying 15 subclusters. Following this, ConsensusClusterPlus was utilized for new unsupervised clustering of the HNSCC samples, conducting feature selection and reclustering the samples. Principal component analysis further divided the samples into two groups based on gene sets, referred to as the B cell activation group and the B cell inhibition group.

Single-cell analysis identifying immune subtypes from TIL-Bs in cancer research.Figure 5.Single-cell analysis reveals immune subtypes derived from TIL-Bs.(a) UMAP of tumor-infiltrating immune cells from 26 HNSCC scRNA-seq samples, grouped by clusters. (b) UMAP showing classical marker gene expression in immune clusters.(c) Heatmap of signature genes per immune cluster, with five key genes per cluster. (d) Bar chart of immune cell type proportions in HNSCC samples. (e) UMAP of 15 B cell clusters. (f) Two subgroups identified via ConsensusClusterPlus using marker genes. (g) t-SNE of the HNSCC cohort, grouped by sample.(Li,2024)

Characterization of B Cell Activation and Inhibition Groups

Differential gene analysis identified 43 differentially expressed genes between the B cell activation group and the B cell inhibition group. Among these, 21 upregulated genes were defined as B cell activation gene signatures (BCAGS) and used for subsequent validation analyses. Most upregulated genes were associated with B cell activation, whereas downregulated genes were related to B cell inhibition. According to GO enrichment analysis, the upregulated genes were highly enriched in B cell activation, B cell receptor signaling pathways, and the regulation of B cell activation pathways. To systematically study cell-cell communication between the B cell activation and inhibition groups, the authors conducted an unbiased analysis of the overall and differential number and strength of ‘ligand-receptor‘ interactions using CellChat. The results indicated an increased overall strength of immune cell interactions in the B cell activation group.

Volcano plot showing DEGs between B cell activation and inhibition groups, violin plot of gene expression, and GO analysis of B-cell signatures.Figure 6.Characterization of immunological subtypes based on TIL-Bs. (a) Volcano plot of DEGs between B cell activation and inhibition groups, highlighting downregulated, upregulated, and non-significant genes. (b) Violin plot showing IGHG1, IGHA1, CD24, and IGHM expression in both groups. (c) GO cluster plot with a chord dendrogram of upregulated gene clusters based on B-cell signatures. (d) Circle plots displaying interaction number and strength differences in cell-cell communication between groups. (Li,2024)

Validation of B Cell Subtype Classification in Another scRNA-seq Datasets

To further demonstrate the advantages of using the identified B-cell signature genes to classify HNSCC patients, the authors collected another HNSCC single-cell dataset (GSE164690) as a validation cohort. After feature selection and re-clustering, patients in this cohort were also consistently divided into two distinct groups. The volcano plot illustrated the transcriptomic differences between the B-cell activation group and the B-cell suppression group, revealing 29 upregulated and 101 downregulated genes. GO enrichment analysis showed that the upregulated DEGs were significantly enriched in the immunoglobulin complex, circulation, immunoglobulin receptor binding, and regulation of B-cell activation pathways. Consistent with the previous cohort, the upregulated genes were predominantly related to B-cell activation.

Validation of a classification method using TIL-Bs in a secondary dataset.Figure 7.Validation of classification method based on TIL-Bs in another dataset. (a) Two subgroups were identified using the ConsensusClusterPlus R package based on B-cell signature genes. (b) The t-SNE plot displays the HNSCC scRNA-seq cohort (n = 15), with samples color-coded by group.(c) The volcano plot highlights differentially expressed genes between the B-cell activation and inhibition groups, showing "Down," "NoSignifi," and "Up" regulated DEGs. (d) A GO cluster plot, featuring a chord dendrogram, illustrates the clustering of significantly upregulated genes in B cells based on B-cell signature classification. (e) A violin plot demonstrates the expression of IGHG1, IGHA1, CD24, and IGHM in the B-cell activation vs. inhibition groups. (Li,2024)

Application of B Cell Classification in TCGA HNSCC Cohort

To evaluate whether this B-cell signature gene classification approach is applicable to TCGA HNSCC, the authors performed an unsupervised clustering of the RNA-seq expression profiles from 501 patient samples. This resulted in two clusters based on BCAGS, and t-SNE clustering showed that samples were divided into two groups. Differential gene analysis identified 388 upregulated and 42 downregulated genes in the two TCGA HNSCC patient groups. GO function enrichment analysis of DEGs in the B-cell activation group revealed significant enrichment in lymphocyte activation and immune activation pathways. Subsequently, the authors linked these groupings to clinical information to explore the prognostic value of this tumor classification. Kaplan-Meier curves indicated significantly better survival outcomes for patients in the B-cell activation group. ssGSEA was utilized to visualize the relative abundance of 28 infiltrating immune cell populations, showing higher immune cell infiltration in the B-cell activation group.

TCGA HNSCC cohort analysis showing consensus matrix, t-SNE clusters, gene expression differences, survival curves, and immune cell enrichment.Figure 8.Application of classification method based on TIL-Bs in TCGA cohort.(a) Consensus matrix for the TCGA HNSCC cohort (n = 501). (b) t-SNE plot showing TCGA samples divided into two clusters.(c) Volcano plot of differentially expressed genes between cluster 1 and 2. (d) GO cluster plot with a chord dendrogram of significantly upregulated genes in the B-cell activation group. (e) Kaplan-Meier survival curves for TCGA HNSCC patients in B-cell activation vs. inhibition groups, stratified by B-cell signature genes. (f) Enrichment levels of 28 immune-related cells in B-cell activation and inhibition groups via ssGSEA.(Li,2024)

Development of a Four-Gene Prognostic Model in HNSCC

The authors utilized LASSO regression analysis to optimize BCAGS, revealing a four-gene set (JCHAIN, GZMB, IGHA1, and PDRX4) that facilitated the construction of a key prognostic risk model. Among these, JCHAIN, GZMB, and IGHA1 were significantly upregulated in the low-risk group, while PDRX4 was significantly upregulated in the high-risk group. Subsequently, using univariate Cox regression analysis in the TCGA cohort, the authors found that the risk score could independently predict patient prognosis, unaffected by traditional clinical factors.

Boxplot of four-gene expression levels in HNSCC groups and a forest plot of OS-related clinical factors from Cox regression analysis.Figure 9.Unveiling a four-gene risk model for prognostication in HNSCC patients.(a) Boxplot showing modeling gene expression levels in two groups. (b) Forest plot of OS-related clinical factors from univariate Cox regression.(Li,2024)

Conclusion

This study identified two distinct B cell-driven immune subtypes in HNSCC, showing that the B cell activation group is associated with better survival, stronger immune infiltration, and enhanced immunotherapy response. A four-gene model (JCHAIN, GZMB, IGHA1, PDRX4) was developed to predict patient outcomes and stratify risk. The findings underscore the critical role of B cells in shaping tumor immunity and offer potential biomarkers for improving prognosis in HNSCC.

1.Why is Cancer Data Mining Important for Research?

Cancer Data Mining is essential for understanding the complex biology of cancer. By analyzing large datasets, researchers can identify patterns that lead to the discovery of new biomarkers, improve prognostic models.

2.What Are the Main Applications of Cancer Data Mining?

Cancer Data Mining is applied in various areas such as identifying genetic mutations linked to cancer, predicting treatment responses, discovering novel therapeutic targets, and improving early detection strategies. It plays a critical role in precision oncology, enabling personalized treatment plans based on individual patient profiles.

3.What Types of Data Are Utilized in Cancer Data Mining?

Cancer Data Mining employs a diverse array of data types, including genomic data (DNA/RNA sequencing), clinical data (patient records and treatment histories), imaging data (MRI, CT scans), pathology data (histopathological assessments), and environmental data (risk factors). Each data type contributes valuable insights for comprehensive analysis.

4.How Does Machine Learning Enhance Cancer Data Mining?

Machine learning enhances Cancer Data Mining by enabling the analysis of complex datasets, identifying hidden patterns, and making predictions about treatment outcomes. Algorithms can classify tumors, predict patient responses based on genetic profiles, and support the development of personalized treatment strategies.

5.What Are the Challenges of Cancer Data Mining?

The challenges of Cancer Data Mining include data quality and consistency issues, privacy and security concerns, the complexity of integrating diverse data sources, and the need for standardized data generation practices. Addressing these challenges is crucial for enhancing the reliability and applicability of research findings in clinical settings.

References

  1. Chen, H., et al. Cross-talk of four types of RNA modification writers defines tumor microenvironment and pharmacogenomic landscape in colorectal cancer. Molecular cancer.2021, 20(1), 29.
  2. Li, K., et al. Single cell analysis unveils B cell-dominated immune subtypes in HNSCC for enhanced prognostic and therapeutic stratification. International journal of oral science.2024, 16(1), 29.
* For Research Use Only. Not for use in diagnostic procedures.
Online Inquiry