Predicting DNA Methylation in AML: An EpiBench Case Study

AML Methylation Visualization

This case study demonstrates how EpiBench can be applied to predict DNA methylation patterns in Acute Myeloid Leukemia (AML) samples. Last updated: 2024-10-15

Research Question

Can a CNN model trained on CD34+ cell data accurately predict DNA methylation patterns in AML samples based on histone modification data?

This question addresses a key challenge in epigenetic research: the transferability of predictive models across different cell types and disease states. If successful, such an approach could reduce the need for costly whole-genome bisulfite sequencing (WGBS) in clinical samples.

Dataset Description

Training Data: CD34+ Hematopoietic Progenitor Cells

The model was trained using data from healthy CD34+ hematopoietic progenitor cells:

  • DNA Methylation: Whole-genome bisulfite sequencing data (>25x coverage)
  • Histone Modifications: ChIP-seq data for H3K4me3, H3K27ac, H3K27me3, and H3K9me3
  • Total Sample Size: 5 donors, with ~1 million CpG sites analyzed

Test Data: AML Patient Samples

The trained model was evaluated using data from AML patient samples:

  • DNA Methylation: WGBS data (ground truth for evaluation)
  • Histone Modifications: ChIP-seq data for the same histone marks as the training data
  • Total Sample Size: 10 AML patients with different genetic backgrounds (4 with FLT3-ITD, 3 with NPM1 mutations, 3 with complex karyotypes)

Analysis Workflow

1. Data Preparation

First, we prepared the data for both the training and testing phases:

# Load data for CD34+ cells (training)
cd34_data = epibench.data.load_multi_omics_dataset(
    methylation_file="cd34_methylation.bed",
    histone_files={
        "H3K4me3": "cd34_h3k4me3.bw",
        "H3K27ac": "cd34_h3k27ac.bw",
        "H3K27me3": "cd34_h3k27me3.bw",
        "H3K9me3": "cd34_h3k9me3.bw"
    },
    reference_genome="hg38.fa"
)
 
# Load data for AML samples (testing)
aml_data = epibench.data.load_multi_omics_dataset(
    methylation_file="aml_methylation.bed",
    histone_files={
        "H3K4me3": "aml_h3k4me3.bw",
        "H3K27ac": "aml_h3k27ac.bw",
        "H3K27me3": "aml_h3k27me3.bw",
        "H3K9me3": "aml_h3k9me3.bw"
    },
    reference_genome="hg38.fa"
)
 
# Split CD34+ data into training and validation sets
train_data, val_data = cd34_data.train_val_split(val_ratio=0.2)

2. Model Architecture

We designed a multi-branch CNN specifically for this task:

# Configure model architecture
model = epibench.models.MultibranchCNN(
    sequence_length=1000,
    n_histone_marks=4,
    kernel_sizes=[5, 15, 25],  # Capture patterns at different scales
    filters=[64, 128, 64],
    dropout_rate=0.3,
    task="regression"  # Predicting methylation levels (0-1)
)
 
# Configure training parameters
model.configure_training(
    optimizer="adam",
    learning_rate=0.001,
    batch_size=128,
    n_epochs=100,
    early_stopping=True,
    patience=10
)

3. Training

We trained the model on CD34+ data with careful monitoring for overfitting:

# Train the model
history = model.train(
    train_data=train_data,
    val_data=val_data,
    save_best_model=True,
    model_path="cd34_to_aml_methylation_model.pt"
)
 
# Visualize training progress
epibench.visualization.plot_training_history(history)

4. Cross-Sample Evaluation

The critical test was applying the CD34+-trained model to AML samples:

# Evaluate model on AML data
aml_performance = model.evaluate(aml_data)
 
print(f"Mean Squared Error on AML samples: {aml_performance['mse']:.4f}")
print(f"Pearson correlation: {aml_performance['pearson']:.4f}")
print(f"Spearman correlation: {aml_performance['spearman']:.4f}")
 
# Generate predictions
aml_predictions = model.predict(aml_data)
 
# Compare predictions to actual values
epibench.visualization.create_scatter_plot(
    actual=aml_data.methylation_values,
    predicted=aml_predictions,
    title="Predicted vs. Actual DNA Methylation in AML Samples"
)

Results

Performance Metrics

The model achieved promising results when transferring from CD34+ to AML samples:

MetricCD34+ ValidationAML Test
Mean Squared Error0.0420.076
Pearson Correlation0.870.79
Spearman Correlation0.820.74

Patient-Specific Variations

We observed interesting variations in prediction accuracy across different genetic backgrounds:

AML SubtypePearson CorrelationNotable Observations
FLT3-ITD0.75Reduced accuracy in enhancer regions
NPM1 Mutant0.81Strong performance in promoter regions
Complex Karyotype0.63Generally lower accuracy across all regions

Genomic Context Analysis

The model’s performance varied by genomic context:

Performance by Genomic Context

  • Promoters: Highest accuracy (r = 0.85)
  • CpG Islands: Very good performance (r = 0.82)
  • Enhancers: Moderate performance (r = 0.73)
  • Gene Bodies: Good performance (r = 0.78)
  • Intergenic Regions: Lowest accuracy (r = 0.68)

Model Interpretability

Histone Feature Importance

We used integrated gradients to understand which histone marks were most influential for predictions:

Histone MarkGlobal ImportanceContext-Specific Observations
H3K4me30.35Most important at promoters
H3K27ac0.28Critical for enhancer regions
H3K27me30.26Important for repressed regions
H3K9me30.11Lesser overall impact

Sequential Pattern Analysis

When examining sequence-specific patterns:

  • CpG density showed strong positive correlation with prediction accuracy
  • Specific transcription factor binding motifs (including RUNX1 and GATA2) were frequently identified in regions of high attribution

Clinical Implications

Our findings suggest several potential clinical applications:

  1. Cost-Effective Methylation Profiling: Predicting genome-wide methylation from histone ChIP-seq could reduce costs
  2. AML Heterogeneity Understanding: Different subtypes show varying epigenetic predictability
  3. Therapeutic Target Identification: Regions where methylation predictions diverge from actual values may represent disease-specific alterations

Limitations and Future Directions

While promising, our approach has several limitations:

  • Sample Size: Limited to 10 AML samples; more diverse samples needed
  • Technical Variability: ChIP-seq quality impacts prediction accuracy
  • Biological Complexity: Some methylation patterns may be influenced by factors beyond histones

Future work will focus on:

  1. Including more histone marks and chromatin accessibility data
  2. Developing AML subtype-specific models
  3. Incorporating longitudinal samples to track epigenetic changes during treatment

Conclusions

This case study demonstrates that:

  1. EpiBench can successfully transfer learning from normal to malignant cells
  2. Histone modification patterns retain strong predictive power for DNA methylation across cell states
  3. The approach reveals context-specific insights into epigenetic regulation in AML

All code is available in our GitHub repository, allowing researchers to replicate this analysis with their own samples.


This note explores one application of EpiBench. To learn more about the platform itself, visit the Introduction to EpiBench page. For technical details on running similar analyses, see the Analysis Process Guide.