ADR-002: Batch-based Pipeline Processing

Status

Accepted

Context

Processing data row-by-row in Python is slow due to function call overhead and the GIL. Conversely, processing an entire multi-gigabyte file in memory at once is not feasible for many environments.

Decision

ZooPipe uses a batch-based (chunked) processing model. Data is read from the input source in batches (defaulting to 1000-5000 records), processed as a block throughout the pipeline, and written to the output in batches.

Implementation Details

The Rust core fetches a chunk of raw data.
This chunk is passed to the Python layer for validation (Pydantic) and transformation (Hooks) as a list of dictionaries.
The results are returned to Rust for final output writing.
Memory management is optimized by reusing buffers where possible and minimizing copies between Python and Rust.

Consequences

Benefit: Amortizes the cost of crossing the Python-Rust boundary across multiple records.
Benefit: Predictable and low memory usage regardless of total dataset size.
Benefit: Enables efficient batch inserts for databases (SQL) and batch writes for files (Parquet/Arrow).
Drawback: Latency per record might be slightly higher than a pure streaming approach, though overall throughput is much higher.
Drawback: Error handling must manage batch failures and record-level granularity within a batch.