Skip to content

ADR-002: Batch-based Pipeline Processing

Status

Accepted

Context

Processing data row-by-row in Python is slow due to function call overhead and the GIL. Conversely, processing an entire multi-gigabyte file in memory at once is not feasible for many environments.

Decision

ZooPipe uses a batch-based (chunked) processing model. Data is read from the input source in batches (defaulting to 1000-5000 records), processed as a block throughout the pipeline, and written to the output in batches.

Implementation Details

  • The Rust core fetches a chunk of raw data.
  • This chunk is passed to the Python layer for validation (Pydantic) and transformation (Hooks) as a list of dictionaries.
  • The results are returned to Rust for final output writing.
  • Memory management is optimized by reusing buffers where possible and minimizing copies between Python and Rust.

Consequences

  • Benefit: Amortizes the cost of crossing the Python-Rust boundary across multiple records.
  • Benefit: Predictable and low memory usage regardless of total dataset size.
  • Benefit: Enables efficient batch inserts for databases (SQL) and batch writes for files (Parquet/Arrow).
  • Drawback: Latency per record might be slightly higher than a pure streaming approach, though overall throughput is much higher.
  • Drawback: Error handling must manage batch failures and record-level granularity within a batch.