UMAP and HDBSCAN Clustering for Kidney Cancer Gene Expression Data

Introduction

In this analysis, we'll explore gene expression data from kidney cancer samples using UMAP for dimensionality reduction and HDBSCAN for clustering. We'll go through the process of data preprocessing, visualization, and clustering to uncover patterns in the data.

1. Data Preprocessing

1.1 Loading and Merging Data

We start by loading gene expression data and clinical information:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import OmicsUtils.DimRedMappers.clusterer
import OmicsUtils.DimRedMappers.umap_embedders

fpkm_unstr_df_with_labels = pd.read_csv('./Transcriptomics/data/processed_data/fpkm_unstr_data_with_labels.csv')
exposure_tsv = pd.read_csv('./Transcriptomics/data/clinical.cart.2023-10-29/exposure.tsv', sep='\t')

ge_kidney_cancer_data_with_tgca_labels = pd.merge(fpkm_unstr_df_with_labels,
                                                  exposure_tsv[['case_submitter_id', 'project_id']],
                                                  left_on='Case_ID',
                                                  right_on='case_submitter_id')
    

1.2 Data Transformation

We apply log transformation to gene expression values and visualize the distribution:

description_df = fpkm_unstr_df_with_labels.iloc[:, :60660].describe().reset_index()
showing_value_range = description_df[description_df['index'].isin(['min', 'mean', 'max'])].iloc[:,1:].values.flatten()

ax = sns.distplot(showing_value_range)
ax.set_title("Distribution of min, mean, max of all gene expression values")
    
Distribution of gene expression values

2. Data Splitting

We split our data into training, validation, and test sets:

from sklearn.model_selection import train_test_split

# Split data into training+validation and test sets
train_val_ids, test_ids = train_test_split(ml_df_without_overlap.index, test_size=0.2, random_state=42, stratify=ml_df_without_overlap['project_id'])

# Further split training+validation into training and validation sets
train_ids, val_ids = train_test_split(train_val_ids, test_size=0.2, random_state=42, stratify=ml_df_without_overlap.loc[train_val_ids, 'project_id'])

# Create a DataFrame to store the split information
train_val_test_ids_df = pd.DataFrame({'Case_ID': ml_df_without_overlap.index, 'split': 'train'})
train_val_test_ids_df.loc[val_ids, 'split'] = 'val'
train_val_test_ids_df.loc[test_ids, 'split'] = 'test'

# Merge with project_id information
train_val_test_ids_df = pd.merge(train_val_test_ids_df,
                                 ml_df_without_overlap[['project_id']],
                                 left_on='Case_ID',
                                 right_on='Case_ID')

# Visualize the split
plt.figure(figsize=(10,5))
ax = sns.countplot(data=train_val_test_ids_df, x='project_id', hue='split')
ax.set_title('Samples count in KC tumor types training/testing/validation')
plt.show()
    

3. UMAP Dimensionality Reduction

We use UMAP to reduce the dimensionality of our gene expression data:

umap_embedder = OmicsUtils.DimRedMappers.umap_embedders.UMAPEmbedder(n_neighbors=15, min_dist=0.1, n_components=2, metric='cosine')
umap_embedding = umap_embedder.fit_transform(ml_df_without_overlap[gene_cols])

umap_embedding_df = pd.DataFrame(umap_embedding, columns=['embed_x', 'embed_y'])
umap_embedding_df['tumor_type_labels'] = ml_df_without_overlap['project_id']
umap_embedding_df['Sample_Type'] = ml_df_without_overlap['Sample_Type']

ax1 = sns.scatterplot(data=umap_embedding_df, x="embed_x", y="embed_y", hue="tumor_type_labels")
ax1.set_title("UMAP embedding of gene expression data")
plt.show()
    

4. HDBSCAN Clustering

We apply HDBSCAN clustering to our UMAP-reduced data:

clusterer = OmicsUtils.DimRedMappers.clusterer.HDBSCANClusterer(min_cluster_size=5, min_samples=5)
cluster_labels = clusterer.fit_predict(umap_embedding)

umap_embedding_df['cluster_labels'] = cluster_labels

ax3 = sns.scatterplot(data=umap_embedding_df, x="embed_x", y="embed_y", hue="cluster_labels")
ax3.set_title("HDBSCAN Clustering with bayesian optimized clusters")
plt.show()
    

5. Visualization of Results

Finally, we create a combined plot to compare the UMAP embeddings colored by tumor type, sample type, and cluster labels:

fig, axes = plt.subplots(3, 1, figsize=(10, 20))

sns.scatterplot(ax=axes[0], data=umap_embedding_df, x='embed_x', y='embed_y', hue='tumor_type_labels')
sns.scatterplot(ax=axes[1], data=umap_embedding_df, x='embed_x', y='embed_y', hue='Sample_Type')
sns.scatterplot(ax=axes[2], data=umap_embedding_df, x='embed_x', y='embed_y', hue='cluster_labels')
fig.suptitle('UMAP(Cosine) + HDBSCAN on Kidney Cancer Gene Expression Data', fontsize=20)
fig.tight_layout()
fig.subplots_adjust(top=0.95)
plt.show()
    

Conclusion

In this analysis, we've demonstrated how to use UMAP for dimensionality reduction and HDBSCAN for clustering on kidney cancer gene expression data. The visualizations reveal interesting patterns and potential subgroups within the data, which could have implications for understanding different types or stages of kidney cancer. Further investigation into the biological significance of these clusters could provide valuable insights for cancer research and treatment strategies.