Performance Analysis

Generated by scripts/profile_performance.sh — run the script to populate with actual measurements

Test Environment

Date: (run scripts/profile_performance.sh to populate)
Platform: (auto-detected)
OCaml: (auto-detected)

Timing Results

10k Rows (100 groups)

Operation	Time (s)
Project 2 columns	—
Filter rows	—
Sum column	—
Group-by	—
Group aggregate (mean)	—

100k Rows (1000 groups)

Operation	Time (s)
Project 2 columns	—
Sum column	—
Group-by	—
Group aggregate (sum)	—
sqrt (vectorized)	—
abs (vectorized)	—
compare scalar	—

1M Rows (10000 groups)

Operation	Time (s)
Project 2 columns	—
Sum column	—
Mean column	—
Group-by	—
Group aggregate (mean)	—

Analysis

Scaling Behavior

Operations should scale approximately linearly with row count: - 10x rows → ~10x time for columnar operations - Group-by scaling depends on group cardinality

Hot Paths

The most time-critical operations for large datasets are: 1. Group-by + aggregation: Dominates pipeline execution time for grouped summarizations 2. Filter: Boolean mask construction + row extraction 3. CSV reading: I/O bound for large files; Arrow native reader provides significant speedup

Optimization Opportunities

Materialization avoidance: After mutate(), the native Arrow handle is dropped. Lazy evaluation could defer materialization.
Column pruning: Pipelines that only use a subset of columns could skip loading unused columns from CSV.
Parallel execution: Arrow Compute supports multi-threaded execution via Rayon; not yet exposed through FFI.

Targets

See docs/performance.md for detailed performance expectations and the Arrow backend architecture overview.