Gene Set Enrichment Analysis (GSEA) for Head and Neck Cancer

1. Setting Up the Environment

First, we set up our working environment and import the necessary libraries:

import pandas as pd
from importlib import reload
import src.Engines.analysis_engine as analysis_engine
import src.Connectors.gcp_bigquery_utils as gcp_bigquery_utils
reload(analysis_engine)
reload(gcp_bigquery_utils)

2. Downloading Dataset from BigQuery

We download the dataset for a specific primary diagnosis and primary site:

project_id = 'rnaseqml'
dataset_id = 'rnaseqexpression'
table_id = 'expr_clustered_08082024'
bq_queries = gcp_bigquery_utils.BigQueryQueries(project_id=project_id, 
                                              dataset_id=dataset_id,
                                              table_id=table_id)

pr_site = 'Head and Neck'
pr_diag = 'Squamous cell carcinoma, NOS'
data_from_bq = bq_queries.get_df_for_pydeseq(primary_site=pr_site, primary_diagnosis=pr_diag)

Note: The dataset contains 407 rows and 7 columns, including case IDs, primary site, sample type, and gene expression data.

3. Data Preprocessing for PyDeSeq and GSEA

We preprocess the data for differential expression analysis:

analysis_eng = analysis_engine.AnalysisEngine(data_from_bq, analysis_type='DE')
if not analysis_eng.check_tumor_normal_counts():
    raise ValueError("Tumor and Normal counts should be at least 10 each")
gene_ids_or_gene_cols_df = pd.read_csv('/Users/abhilashdhal/Projects/personal_docs/data/Transcriptomics/data/gene_annotation/gene_id_to_gene_name_mapping.csv')
gene_ids_or_gene_cols = list(gene_ids_or_gene_cols_df['gene_id'].to_numpy())

exp_df = analysis_eng.expand_data_from_bq(data_from_bq, gene_ids_or_gene_cols=gene_ids_or_gene_cols, analysis_type='DE')
metadata = analysis_eng.metadata_for_pydeseq(exp_df=exp_df)
counts_for_de = analysis_eng.counts_from_bq_df(exp_df, gene_ids_or_gene_cols)

4. Running DESeq2 Analysis

We perform differential expression analysis using DESeq2:

res_pydeseq = analysis_eng.run_pydeseq(metadata=metadata, counts=counts_for_de)

Note: The DESeq2 analysis compares tumor samples against normal samples, providing log2 fold changes and adjusted p-values for each gene.

5. Preparing Data for GSEA

We merge the DESeq2 results with gene names for GSEA:

res_pydeseq_with_gene_names = pd.merge(res_pydeseq, gene_ids_or_gene_cols_df, left_on='index', right_on='gene_id')

6. Running GSEA

Finally, we perform Gene Set Enrichment Analysis using the KEGG pathway database:

from gseapy.plot import gseaplot
import gseapy as gp
from gseapy import dotplot

gene_set = 'KEGG_2021_Human'
result, plot, pre_res = analysis_eng.run_gsea(res_pydeseq_with_gene_names, gene_set)

Conclusion

In this analysis, we've performed a comprehensive gene expression analysis of Head and Neck Squamous Cell Carcinoma. We started by retrieving data from BigQuery, preprocessed it for differential expression analysis, ran DESeq2 to identify differentially expressed genes, and finally performed GSEA to identify enriched pathways in our dataset. The GSEA results provide insights into the biological processes that may be altered in this cancer type, potentially revealing new therapeutic targets or biomarkers.