Checkpoint System¶

Overview¶

The checkpoint system provides the ability to save and restore simulation state at regular intervals. This feature is critical for:

Crash recovery: Resume long-running simulations after unexpected failures
Debugging: Start debugging from a specific year without re-running the entire simulation
Performance: Avoid re-computing early years when testing changes that only affect later years
Analysis: Save state at key points for detailed analysis

Key Features¶

Automatic Checkpointing¶

By default, checkpoints are automatically saved every 5 years during simulation. This happens in the finalise_iteration handler after each year completes:

# Automatic checkpoint every 5 years
if env.year % 5 == 0:
    checkpoint_system.save_checkpoint(env.year, env, uow)

Manual Checkpointing¶

You can also trigger checkpoints manually by raising a SaveCheckpoint event:

from steelo.service_layer.checkpoint import SaveCheckpoint
from steelo.domain import Year

# Save checkpoint for current year
bus.handle(SaveCheckpoint(year=Year(2030)))

What Gets Saved¶

Each checkpoint includes:

Environment State
- Current year
- Cost curves for products
- Average bill of materials (BOMs)
- Dynamic feedstocks data
- Capacity additions and limits
- Cost of capital by region
- Regional capex data
- Carbon costs
Repository State
- All plants with their furnace groups
- Demand centers
- Suppliers
- Trade tariffs
- Plant groups
Metadata
- Checkpoint year
- Timestamp
- Simulation ID
- Checkpoint version

File Structure¶

Checkpoints are saved in the checkpoints/ directory (configurable) with the following naming convention:

checkpoints/
├── checkpoint_year_2025_sim_20240115_143022.pkl
├── checkpoint_year_2025_sim_20240115_143022_metadata.json
├── checkpoint_year_2030_sim_20240115_143022.pkl
├── checkpoint_year_2030_sim_20240115_143022_metadata.json
└── ...

.pkl files contain the serialized simulation state
_metadata.json files contain human-readable checkpoint information

Usage Examples¶

Basic Checkpoint Operations¶

from steelo.service_layer.checkpoint import SimulationCheckpoint

# Create checkpoint system
checkpoint_system = SimulationCheckpoint("checkpoints")

# Save checkpoint
checkpoint_system.save_checkpoint(Year(2025), env, uow)

# Load checkpoint
checkpoint_data = checkpoint_system.load_checkpoint(Year(2025))
if checkpoint_data:
    # Restore state (implementation depends on your needs)
    restored_year = checkpoint_data['year']
    restored_env_state = checkpoint_data['environment_state']
    restored_repo_state = checkpoint_data['repository_state']

# List available checkpoints
checkpoints = checkpoint_system.list_checkpoints()
for checkpoint in checkpoints:
    print(f"Year: {checkpoint.year}, Saved at: {checkpoint.timestamp}")

# Clean old checkpoints (keep last 10)
checkpoint_system.clean_old_checkpoints(keep_last_n=10)

Resuming from Checkpoint¶

To resume a simulation from a checkpoint, you would typically:

Load the checkpoint for the desired year
Restore the environment and repository state
Continue simulation from the next year

# Example: Resume from year 2030
checkpoint_data = checkpoint_system.load_checkpoint(Year(2030))
if checkpoint_data:
    # Restore environment state
    env.year = checkpoint_data['environment_state']['year']
    env.cost_curve = checkpoint_data['environment_state']['cost_curve']
    # ... restore other environment attributes
    
    # Restore repository state
    # This would require deserialization logic for domain objects
    
    # Continue simulation from year 2031
    start_year = env.year + 1

Configuration¶

Checkpoint Frequency¶

The default checkpoint frequency (every 5 years) can be modified in handlers.py:

# Save checkpoint every N years
if env.year % N == 0:
    checkpoint_system.save_checkpoint(env.year, env, uow)

Checkpoint Directory¶

The checkpoint directory defaults to "checkpoints" but can be configured when bootstrapping the application:

from steelo.bootstrap import bootstrap

# Configure with custom checkpoint directory
bus = bootstrap(checkpoint_dir="/path/to/custom/checkpoint/dir")

# Use default checkpoint directory ("checkpoints/")
bus = bootstrap()

The checkpoint system is now injected as a dependency into handlers that need it, following the dependency injection pattern used throughout the application.

Checkpoint Retention¶

Old checkpoints can be automatically cleaned up to save disk space:

# Keep only the last 20 checkpoints
checkpoint_system.clean_old_checkpoints(keep_last_n=20)

Error Handling¶

The checkpoint system includes robust error handling:

Save failures: Logged as warnings but don’t interrupt simulation
Load failures: Return None if checkpoint can’t be loaded
Missing checkpoints: Handled gracefully with appropriate warnings

Performance Considerations¶

Disk space: Each checkpoint can be several MB to GB depending on simulation size
Save time: Checkpoint saves are typically fast (<1 second) but may increase with simulation complexity
Frequency trade-off: More frequent checkpoints provide better recovery granularity but use more disk space

Future Enhancements¶

Potential improvements to the checkpoint system:

Incremental checkpoints: Save only changes since last checkpoint
Compression: Compress checkpoint files to save disk space
Cloud storage: Support for saving checkpoints to S3/cloud storage
Parallel checkpointing: Save checkpoints in background thread
Checkpoint validation: Verify checkpoint integrity on save/load
Selective state: Configure which parts of state to include in checkpoints

Technical Details¶

Serialization¶

The checkpoint system uses Python’s pickle module for serialization with the highest protocol version for best performance. Metadata is saved as JSON for human readability.

Compatibility¶

Checkpoints include a version number to handle future changes to the checkpoint format. Currently version “1.0”.

Thread Safety¶

The current implementation is not thread-safe. If running multiple simulations in parallel, each should use a separate checkpoint directory or include a unique simulation ID.