ADR-004: Lifecycle Hooks for Data Enrichment
Status
Accepted
Context
Standard ETL often requires more than just simple mapping or validation. Users need to enrich data from external sources (APIs, Caches), perform complex transformations, or manage state across chunks.
Decision
We implement a lifecycle-based hook system that allows Python code to execute at specific stages of the pipeline.
Stages
pre_validation_hooks: Execute before Pydantic validation. Useful for sanitizing raw data or early filtering.execute(Main Hook): The primary transformation logic.post_validation_hooks: Execute after validation. Useful for enrichment that depends on valid data structures.setup/teardown: Framework-level hooks for initializing and cleaning up resources (e.g., database connections, API clients).
Batch Processing
Hooks receive batches of data, allowing for efficient operations like bulk API calls or vectorized transformations within the hook itself.
Consequences
- Benefit: High flexibility for complex "Chaotic ETL" workloads.
- Benefit: Separation of concerns between data validation (Schema) and business logic (Hooks).
- Benefit: Batching within hooks allows for performance optimizations.
- Drawback: Heavy logic in hooks will slow down the pipeline as it runs in the Python interpreter.