Checkpoint System

Overview

The checkpoint system provides the ability to save and restore simulation state at regular intervals. This feature is critical for:

  • Crash recovery: Resume long-running simulations after unexpected failures

  • Debugging: Start debugging from a specific year without re-running the entire simulation

  • Performance: Avoid re-computing early years when testing changes that only affect later years

  • Analysis: Save state at key points for detailed analysis

Key Features

Automatic Checkpointing

By default, checkpoints are automatically saved every 5 years during simulation. This happens in the finalise_iteration handler after each year completes:

# Automatic checkpoint every 5 years
if env.year % 5 == 0:
    checkpoint_system.save_checkpoint(env.year, env, uow)

Manual Checkpointing

You can also trigger checkpoints manually by raising a SaveCheckpoint event:

from steelo.service_layer.checkpoint import SaveCheckpoint
from steelo.domain import Year

# Save checkpoint for current year
bus.handle(SaveCheckpoint(year=Year(2030)))

What Gets Saved

Each checkpoint includes:

  1. Environment State

    • Current year

    • Cost curves for products

    • Average bill of materials (BOMs)

    • Dynamic feedstocks data

    • Capacity additions and limits

    • Cost of capital by region

    • Regional capex data

    • Carbon costs

  2. Repository State

    • All plants with their furnace groups

    • Demand centers

    • Suppliers

    • Trade tariffs

    • Plant groups

  3. Metadata

    • Checkpoint year

    • Timestamp

    • Simulation ID

    • Checkpoint version

File Structure

Checkpoints are saved in the checkpoints/ directory (configurable) with the following naming convention:

checkpoints/
├── checkpoint_year_2025_sim_20240115_143022.pkl
├── checkpoint_year_2025_sim_20240115_143022_metadata.json
├── checkpoint_year_2030_sim_20240115_143022.pkl
├── checkpoint_year_2030_sim_20240115_143022_metadata.json
└── ...
  • .pkl files contain the serialized simulation state

  • _metadata.json files contain human-readable checkpoint information

Usage Examples

Basic Checkpoint Operations

from steelo.service_layer.checkpoint import SimulationCheckpoint

# Create checkpoint system
checkpoint_system = SimulationCheckpoint("checkpoints")

# Save checkpoint
checkpoint_system.save_checkpoint(Year(2025), env, uow)

# Load checkpoint
checkpoint_data = checkpoint_system.load_checkpoint(Year(2025))
if checkpoint_data:
    # Restore state (implementation depends on your needs)
    restored_year = checkpoint_data['year']
    restored_env_state = checkpoint_data['environment_state']
    restored_repo_state = checkpoint_data['repository_state']

# List available checkpoints
checkpoints = checkpoint_system.list_checkpoints()
for checkpoint in checkpoints:
    print(f"Year: {checkpoint.year}, Saved at: {checkpoint.timestamp}")

# Clean old checkpoints (keep last 10)
checkpoint_system.clean_old_checkpoints(keep_last_n=10)

Resuming from Checkpoint

To resume a simulation from a checkpoint, you would typically:

  1. Load the checkpoint for the desired year

  2. Restore the environment and repository state

  3. Continue simulation from the next year

# Example: Resume from year 2030
checkpoint_data = checkpoint_system.load_checkpoint(Year(2030))
if checkpoint_data:
    # Restore environment state
    env.year = checkpoint_data['environment_state']['year']
    env.cost_curve = checkpoint_data['environment_state']['cost_curve']
    # ... restore other environment attributes
    
    # Restore repository state
    # This would require deserialization logic for domain objects
    
    # Continue simulation from year 2031
    start_year = env.year + 1

Configuration

Checkpoint Frequency

The default checkpoint frequency (every 5 years) can be modified in handlers.py:

# Save checkpoint every N years
if env.year % N == 0:
    checkpoint_system.save_checkpoint(env.year, env, uow)

Checkpoint Directory

The checkpoint directory defaults to "checkpoints" but can be configured when bootstrapping the application:

from steelo.bootstrap import bootstrap

# Configure with custom checkpoint directory
bus = bootstrap(checkpoint_dir="/path/to/custom/checkpoint/dir")

# Use default checkpoint directory ("checkpoints/")
bus = bootstrap()

The checkpoint system is now injected as a dependency into handlers that need it, following the dependency injection pattern used throughout the application.

Checkpoint Retention

Old checkpoints can be automatically cleaned up to save disk space:

# Keep only the last 20 checkpoints
checkpoint_system.clean_old_checkpoints(keep_last_n=20)

Error Handling

The checkpoint system includes robust error handling:

  • Save failures: Logged as warnings but don’t interrupt simulation

  • Load failures: Return None if checkpoint can’t be loaded

  • Missing checkpoints: Handled gracefully with appropriate warnings

Performance Considerations

  • Disk space: Each checkpoint can be several MB to GB depending on simulation size

  • Save time: Checkpoint saves are typically fast (<1 second) but may increase with simulation complexity

  • Frequency trade-off: More frequent checkpoints provide better recovery granularity but use more disk space

Future Enhancements

Potential improvements to the checkpoint system:

  1. Incremental checkpoints: Save only changes since last checkpoint

  2. Compression: Compress checkpoint files to save disk space

  3. Cloud storage: Support for saving checkpoints to S3/cloud storage

  4. Parallel checkpointing: Save checkpoints in background thread

  5. Checkpoint validation: Verify checkpoint integrity on save/load

  6. Selective state: Configure which parts of state to include in checkpoints

Technical Details

Serialization

The checkpoint system uses Python’s pickle module for serialization with the highest protocol version for best performance. Metadata is saved as JSON for human readability.

Compatibility

Checkpoints include a version number to handle future changes to the checkpoint format. Currently version “1.0”.

Thread Safety

The current implementation is not thread-safe. If running multiple simulations in parallel, each should use a separate checkpoint directory or include a unique simulation ID.