Predicting DNA Methylation in AML: An EpiBench Case Study
This case study demonstrates how EpiBench can be applied to predict DNA methylation patterns in Acute Myeloid Leukemia (AML) samples. Last updated: 2024-10-15
Research Question
Can a CNN model trained on CD34+ cell data accurately predict DNA methylation patterns in AML samples based on histone modification data?
This question addresses a key challenge in epigenetic research: the transferability of predictive models across different cell types and disease states. If successful, such an approach could reduce the need for costly whole-genome bisulfite sequencing (WGBS) in clinical samples.
Dataset Description
Training Data: CD34+ Hematopoietic Progenitor Cells
The model was trained using data from healthy CD34+ hematopoietic progenitor cells:
- DNA Methylation: Whole-genome bisulfite sequencing data (>25x coverage)
- Histone Modifications: ChIP-seq data for H3K4me3, H3K27ac, H3K27me3, and H3K9me3
- Total Sample Size: 5 donors, with ~1 million CpG sites analyzed
Test Data: AML Patient Samples
The trained model was evaluated using data from AML patient samples:
- DNA Methylation: WGBS data (ground truth for evaluation)
- Histone Modifications: ChIP-seq data for the same histone marks as the training data
- Total Sample Size: 10 AML patients with different genetic backgrounds (4 with FLT3-ITD, 3 with NPM1 mutations, 3 with complex karyotypes)
Analysis Workflow
1. Data Preparation
First, we prepared the data for both the training and testing phases:
# Load data for CD34+ cells (training)
cd34_data = epibench.data.load_multi_omics_dataset(
methylation_file="cd34_methylation.bed",
histone_files={
"H3K4me3": "cd34_h3k4me3.bw",
"H3K27ac": "cd34_h3k27ac.bw",
"H3K27me3": "cd34_h3k27me3.bw",
"H3K9me3": "cd34_h3k9me3.bw"
},
reference_genome="hg38.fa"
)
# Load data for AML samples (testing)
aml_data = epibench.data.load_multi_omics_dataset(
methylation_file="aml_methylation.bed",
histone_files={
"H3K4me3": "aml_h3k4me3.bw",
"H3K27ac": "aml_h3k27ac.bw",
"H3K27me3": "aml_h3k27me3.bw",
"H3K9me3": "aml_h3k9me3.bw"
},
reference_genome="hg38.fa"
)
# Split CD34+ data into training and validation sets
train_data, val_data = cd34_data.train_val_split(val_ratio=0.2)
2. Model Architecture
We designed a multi-branch CNN specifically for this task:
# Configure model architecture
model = epibench.models.MultibranchCNN(
sequence_length=1000,
n_histone_marks=4,
kernel_sizes=[5, 15, 25], # Capture patterns at different scales
filters=[64, 128, 64],
dropout_rate=0.3,
task="regression" # Predicting methylation levels (0-1)
)
# Configure training parameters
model.configure_training(
optimizer="adam",
learning_rate=0.001,
batch_size=128,
n_epochs=100,
early_stopping=True,
patience=10
)
3. Training
We trained the model on CD34+ data with careful monitoring for overfitting:
# Train the model
history = model.train(
train_data=train_data,
val_data=val_data,
save_best_model=True,
model_path="cd34_to_aml_methylation_model.pt"
)
# Visualize training progress
epibench.visualization.plot_training_history(history)
4. Cross-Sample Evaluation
The critical test was applying the CD34+-trained model to AML samples:
# Evaluate model on AML data
aml_performance = model.evaluate(aml_data)
print(f"Mean Squared Error on AML samples: {aml_performance['mse']:.4f}")
print(f"Pearson correlation: {aml_performance['pearson']:.4f}")
print(f"Spearman correlation: {aml_performance['spearman']:.4f}")
# Generate predictions
aml_predictions = model.predict(aml_data)
# Compare predictions to actual values
epibench.visualization.create_scatter_plot(
actual=aml_data.methylation_values,
predicted=aml_predictions,
title="Predicted vs. Actual DNA Methylation in AML Samples"
)
Results
Performance Metrics
The model achieved promising results when transferring from CD34+ to AML samples:
Metric | CD34+ Validation | AML Test |
---|---|---|
Mean Squared Error | 0.042 | 0.076 |
Pearson Correlation | 0.87 | 0.79 |
Spearman Correlation | 0.82 | 0.74 |
Patient-Specific Variations
We observed interesting variations in prediction accuracy across different genetic backgrounds:
AML Subtype | Pearson Correlation | Notable Observations |
---|---|---|
FLT3-ITD | 0.75 | Reduced accuracy in enhancer regions |
NPM1 Mutant | 0.81 | Strong performance in promoter regions |
Complex Karyotype | 0.63 | Generally lower accuracy across all regions |
Genomic Context Analysis
The model’s performance varied by genomic context:
- Promoters: Highest accuracy (r = 0.85)
- CpG Islands: Very good performance (r = 0.82)
- Enhancers: Moderate performance (r = 0.73)
- Gene Bodies: Good performance (r = 0.78)
- Intergenic Regions: Lowest accuracy (r = 0.68)
Model Interpretability
Histone Feature Importance
We used integrated gradients to understand which histone marks were most influential for predictions:
Histone Mark | Global Importance | Context-Specific Observations |
---|---|---|
H3K4me3 | 0.35 | Most important at promoters |
H3K27ac | 0.28 | Critical for enhancer regions |
H3K27me3 | 0.26 | Important for repressed regions |
H3K9me3 | 0.11 | Lesser overall impact |
Sequential Pattern Analysis
When examining sequence-specific patterns:
- CpG density showed strong positive correlation with prediction accuracy
- Specific transcription factor binding motifs (including RUNX1 and GATA2) were frequently identified in regions of high attribution
Clinical Implications
Our findings suggest several potential clinical applications:
- Cost-Effective Methylation Profiling: Predicting genome-wide methylation from histone ChIP-seq could reduce costs
- AML Heterogeneity Understanding: Different subtypes show varying epigenetic predictability
- Therapeutic Target Identification: Regions where methylation predictions diverge from actual values may represent disease-specific alterations
Limitations and Future Directions
While promising, our approach has several limitations:
- Sample Size: Limited to 10 AML samples; more diverse samples needed
- Technical Variability: ChIP-seq quality impacts prediction accuracy
- Biological Complexity: Some methylation patterns may be influenced by factors beyond histones
Future work will focus on:
- Including more histone marks and chromatin accessibility data
- Developing AML subtype-specific models
- Incorporating longitudinal samples to track epigenetic changes during treatment
Conclusions
This case study demonstrates that:
- EpiBench can successfully transfer learning from normal to malignant cells
- Histone modification patterns retain strong predictive power for DNA methylation across cell states
- The approach reveals context-specific insights into epigenetic regulation in AML
All code is available in our GitHub repository, allowing researchers to replicate this analysis with their own samples.
This note explores one application of EpiBench. To learn more about the platform itself, visit the Introduction to EpiBench page. For technical details on running similar analyses, see the Analysis Process Guide.