Arrow backend architecture, vectorization strategy, and performance expectations
T’s DataFrame operations are backed by Apache Arrow, a columnar memory format designed for efficient analytical processing. The Arrow integration provides:
group_by()
operations use Arrow’s hash-based grouping when a native handle is
presentEvery operation in T follows a dual-path pattern:
native_handle (e.g., from read_csv()),
operations delegate to Arrow Compute kernels via FFI for zero-copy,
vectorized executionmutate() materializes a table),
operations use pure OCaml implementations that work on typed columnar
arraysThis ensures correctness regardless of backing storage, while maximizing performance when native Arrow buffers are available.
The Arrow_column module provides
column_view and numeric_view types that
reference the backing Arrow table without copying data:
┌─────────────────────────────────────────────┐
│ Arrow Table (native memory) │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Column A │ │ Column B │ │ Column C │ │
│ │ (Float64)│ │ (Int64) │ │ (String) │ │
│ └──────────┘ └──────────┘ └──────────┘ │
└─────────────────────────────────────────────┘
↑ ↑
FloatView ba IntView ba
(Bigarray) (Bigarray)
(zero-copy) (zero-copy)
For numeric columns (Float64, Int64),
zero_copy_view returns a Bigarray that shares memory with
the Arrow buffer — no allocation or copying occurs. The GC finalizer on
the backing table ensures the Arrow memory remains valid.
| Operation | Native Arrow Path | Pure OCaml Fallback |
|---|---|---|
select() (project) |
Zero-copy column selection | List-based column lookup |
filter() |
Arrow filter kernel with bool mask | Element-wise mask application |
arrange() (sort) |
Arrow sort kernel | Index-based reordering |
add_scalar, multiply_scalar, etc. |
Arrow Compute arithmetic kernels | Element-wise loop |
sqrt, abs, log,
exp, pow |
Arrow Compute unary kernels | Array.map with stdlib math |
sum, mean, min,
max |
Arrow Compute aggregation kernels | Array.fold_left |
compare (eq, lt, gt, le, ge) |
Arrow Compute comparison kernel | Element-wise comparison |
group_by |
Arrow hash-based grouping | Hashtable-based grouping |
group_aggregate (sum, mean, count) |
Arrow group aggregation kernels | Per-group fold |
When does the fallback trigger?
mutate() adds or replaces a column, the table is
materialized as pure OCaml (the native handle is dropped)add_column() modifies the table structureThe following expectations assume standard hardware (modern x86-64, 8+ GB RAM) and typical datasets with 10–20 columns:
| Operation | 10k rows | 100k rows | 1M rows |
|---|---|---|---|
Column selection (select) |
<10ms | <50ms | <500ms |
Row filtering (filter) |
<10ms | <100ms | <1s |
| Arithmetic operations | <20ms | <200ms | <2s |
Aggregation (sum, mean) |
<5ms | <50ms | <500ms |
| Grouping + summarization | <50ms | <500ms | <5s |
| Window functions | <30ms | <300ms | <3s |
| CSV reading | <50ms | <200ms | <2s |
Performance scales approximately linearly with row count for columnar operations. Actual timings depend on hardware, dataset characteristics (column count, string lengths, group cardinality), and whether the native Arrow path is active.
Materialization after mutate: Adding or
replacing columns materializes the entire table as pure OCaml arrays,
dropping the native Arrow handle. Subsequent operations on the modified
table use the fallback path. This can be significant for pipelines that
alternate mutate() and compute-heavy operations.
Single-threaded execution: All operations run on a single thread. Arrow’s multi-threaded capabilities (Rayon-based parallelism) are not yet exposed through the FFI layer.
String columns: Zero-copy views are only
available for numeric columns (Float64,
Int64). String column operations always copy data into
OCaml heap memory.
Large group counts: Group-by with very high cardinality (>10,000 unique groups) uses O(n × g) operations in the OCaml fallback path, where n is row count and g is group count.
Memory usage: Pure OCaml fallback stores data as
option array (boxed), using more memory than Arrow’s
compact nullable representation. For 1M-row datasets, expect ~2× memory
overhead compared to native Arrow storage.
The following optimizations are planned for future versions: