How T ensures perfect reproducibility for data analysis through Nix integration and deterministic execution.
Traditional data science faces a reproducibility crisis:
Result: “Works on my machine” syndrome — analyses that can’t be reproduced months or years later.
T provides perfect reproducibility: The same T code with the same data produces identical results, always.
Nix is a declarative package manager that ensures: - Bit-for-bit identical builds across machines - Isolated environments (no global state) - Versioned dependencies (pinned to exact commits)
Every T project is a Nix flake:
flake.nix:
{
description = "My T analysis project";
inputs = {
nixpkgs.url = "github:NixOS/nixpkgs/nixos-23.11";
tlang.url = "github:b-rodrigues/tlang/v0.1.0";
};
outputs = { self, nixpkgs, tlang }: {
devShell = nixpkgs.lib.mkShell {
buildInputs = [
tlang.packages.${system}.default
# Any other dependencies
];
};
};
}Lock file (flake.lock): Pins exact
versions
{
"nodes": {
"tlang": {
"locked": {
"narHash": "sha256-...",
"rev": "abc123...",
"type": "github"
}
}
}
}Scenario: You run an analysis today. A colleague tries to run it in 2026.
Without Nix: - Package versions have changed - APIs have breaking changes - Results differ or code errors
With Nix:
# 2026: Same exact environment as 2024
nix develop github:myuser/my-analysis?rev=abc123
dune exec src/repl.exe < analysis.t
# Identical output, guaranteedThe flake.lock file ensures every dependency (OCaml,
Arrow, system libraries) is pinned to the exact version used
originally.
T has no built-in randomness: - No random number generators (RNG) - No random seeds - No sampling without explicit seed
To add randomness (future feature):
-- Explicit, reproducible randomness
rng = random_seed(12345)
sample = random_sample(data, 100, rng)
Pipelines execute in topological order (determined by dependencies):
p = pipeline {
z = x + y
x = 10
y = 20
}
-- Always executes: x, y, then z
-- Order of declaration doesn't matter
Same inputs → Same outputs, always.
Intent blocks make implicit knowledge explicit:
intent {
description: "Customer churn prediction for Q4 2023",
data_source: "customers.csv from Salesforce export 2023-12-31",
assumptions: [
"Churn defined as no purchase in 90 days",
"Test accounts excluded (account_type != 'test')",
"Incomplete records removed (age, email must be present)"
],
constraints: [
"Age between 18 and 100",
"Email format validated",
"Purchase amounts > 0"
],
preprocessing: [
"Column names cleaned (snake_case)",
"NA values in 'income' imputed with median",
"Outliers beyond 3σ removed"
],
environment: {
t_version: "0.1.0",
nix_revision: "abc123",
run_date: "2024-01-15"
}
}
-- Analysis code follows...
Benefits: - Future readers understand context - LLMs can regenerate code correctly - Auditors can verify assumptions - Changes to assumptions are versioned (Git)
my-analysis/
├── flake.nix # Nix configuration
├── flake.lock # Locked dependencies
├── data/
│ └── sales.csv # Input data (or fetch script)
├── scripts/
│ └── analysis.t # T analysis script
├── outputs/
│ └── report.csv # Generated results
└── README.md # Documentation
flake.nix:
{
description = "Q4 2023 Sales Analysis";
inputs = {
nixpkgs.url = "github:NixOS/nixpkgs/nixos-23.11";
tlang.url = "github:b-rodrigues/tlang/v0.1.0";
};
outputs = { self, nixpkgs, tlang }: {
# ... configuration ...
};
}scripts/analysis.t:
intent {
description: "Q4 2023 revenue analysis by region",
data_source: "data/sales.csv (exported 2023-12-31)",
assumptions: "All sales in USD, no refunds included",
created: "2024-01-15",
author: "Data Team"
}
sales = read_csv("data/sales.csv", clean_colnames = true)
analysis = pipeline {
cleaned = sales
|> filter(\(row) row.amount > 0)
|> filter(\(row) not is_na(row.region))
by_region = cleaned
|> group_by("region")
|> summarize("total_revenue", \(g) sum(g.amount))
|> arrange("total_revenue", "desc")
}
write_csv(analysis.by_region, "outputs/report.csv")
# Enter reproducible environment
nix develop
# Run analysis
dune exec src/repl.exe < scripts/analysis.t
# Output: outputs/report.csvgit init
git add flake.nix flake.lock scripts/ data/ README.md
git commit -m "Initial Q4 2023 analysis"
git tag v1.0.0
git pushOn any machine, any time:
# Clone project
git clone https://github.com/myorg/q4-analysis.git
cd q4-analysis
# Enter exact same environment
nix develop
# Run analysis
dune exec src/repl.exe < scripts/analysis.t
# Verify identical output
diff outputs/report.csv expected_report.csv
# (No differences)Nix + Docker for maximum portability:
FROM nixos/nix
WORKDIR /analysis
COPY . .
RUN nix develop --command dune build
CMD ["nix", "develop", "--command", "dune", "exec", "src/repl.exe", "<", "analysis.t"]Verify reproducibility on every commit:
.github/workflows/reproduce.yml:
name: Verify Reproducibility
on: [push]
jobs:
reproduce:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- uses: cachix/install-nix-action@v15
- run: nix develop --command dune exec src/repl.exe < analysis.t
- run: diff outputs/result.csv expected_result.csvUse DVC or Git-LFS for data versioning:
# Track data files
dvc add data/sales.csv
git add data/sales.csv.dvc
# Data is versioned alongside codeTrade-off: Perfect reproducibility sometimes costs performance.
Strategies:
Example:
intent {
performance: "Optimized for 8-core CPU, 16GB RAM",
runtime_expected: "5 minutes on reference hardware",
caching: "Pipeline nodes cached in .cache/"
}
See Also: - Installation Guide — Setting up Nix - Pipeline Tutorial — Building reproducible workflows - LLM Collaboration — Intent blocks for AI