How T enables structured, auditable collaboration between humans and Large Language Models for data analysis.
Current LLM-assisted coding suffers from:
Result: LLM-generated code is useful for prototyping but hard to maintain, audit, or regenerate.
T provides a structured way to onboard LLMs to new projects via the
t init command. When a project is initialized, T generates
two essential files that should be provided to the AI agent at the start
of any conversation:
AGENTS.md)This file tells the LLM exactly how the current project is structured and what the coding conventions are (e.g., “Nix is mandatory”, “Use Arrow for data transfer”). It serves as the project’s “rules of engagement” for AI assistants.
T-LANGUAGE-REFERENCE.md)To handle different LLM context windows and project needs, T allows you to select a “Context Level” during initialization:
| Level | Description | Use Case |
|---|---|---|
| small | Core syntax and top 20 functions | Simple scripts, low-context models |
| medium | Exhaustive standard library index (Default) | General analysis and pipeline development |
| full | Comprehensive manual with detailed examples | Complex logic and package development |
| huge | Concatenated documentation of the entire ecosystem | Deep debugging and system-level tasks |
By providing these files, you ensure the LLM has the exact technical context needed to generate valid, idiomatic T code without trial-and-error.
T treats LLMs as first-class collaborators with structured boundaries:
Intent blocks are machine-readable metadata that capture analytical goals.
intent {
description: "Analyze customer churn patterns by age group",
goal: "Identify age ranges with highest churn risk"
}
-- Analysis code follows...
intent {
-- High-level description
description: "Customer lifetime value segmentation",
goal: "Segment customers into value tiers for targeted marketing",
-- Data specifications
data_source: "customers.csv from CRM export 2023-12-31",
required_columns: ["customer_id", "total_spend", "purchase_count", "signup_date"],
-- Analytical assumptions
assumptions: [
"Churn defined as no purchase in 90 days",
"LTV calculated as total_spend / months_active",
"Test accounts excluded"
],
-- Business constraints
constraints: [
"Minimum 50 customers per segment",
"Segment labels: high_value, medium_value, low_value",
"Thresholds: high > $1000, medium > $500"
],
-- Data quality rules
validation: [
"total_spend > 0",
"purchase_count > 0",
"signup_date between 2020-01-01 and 2023-12-31"
],
-- Expected outputs
outputs: [
"customer_segments.csv with columns: customer_id, segment, ltv",
"segment_summary.csv with counts and average LTV per segment"
],
-- Metadata
created: "2024-01-15",
author: "Marketing Team",
llm_assistant: "GPT-4",
version: "1.0"
}
-- LLM generates implementation based on intent
analysis = pipeline {
raw = read_csv("customers.csv", clean_colnames = true)
validated = raw
|> filter($total_spend > 0)
|> filter($purchase_count > 0)
with_ltv = validated
|> mutate($ltv, \(row) row.total_spend / months_since(row.signup_date))
segmented = with_ltv
|> mutate($segment, \(row)
if (row.ltv > 1000) "high_value"
else if (row.ltv > 500) "medium_value"
else "low_value"
)
summary = segmented
|> group_by($segment)
|> summarize($count = nrow($segment), $avg_ltv = mean($ltv))
}
write_csv(analysis.segmented, "customer_segments.csv")
write_csv(analysis.summary, "segment_summary.csv")
Benefits: - LLM understands exact requirements - Human can verify LLM understood correctly - Future LLMs can regenerate code from intent - Intent serves as documentation - Changes to intent are versioned (Git)
Instead of generating entire scripts, LLMs generate pipeline nodes.
Human prompt: “Analyze sales by region”
LLM generates (entire script):
import pandas as pd
df = pd.read_csv("sales.csv")
df = df[df['amount'] > 0]
df_grouped = df.groupby('region')['amount'].sum()
df_grouped = df_grouped.sort_values(ascending=False)
print(df_grouped)Problems: - If requirements change, LLM rewrites everything - No separation between data loading, cleaning, analysis - Hard to modify one step without breaking others
Human writes intent:
intent {
description: "Sales analysis by region",
steps: {
load: "Load sales.csv",
clean: "Remove zero/negative amounts",
analyze: "Sum revenue by region, sort descending"
}
}
LLM generates pipeline nodes:
analysis = pipeline {
-- Node 1: Load (stable)
raw = read_csv("sales.csv")
-- Node 2: Clean (can regenerate independently)
cleaned = raw |> filter($amount > 0)
-- Node 3: Analyze (can regenerate independently)
by_region = cleaned
|> group_by($region)
|> summarize($total = sum($amount))
|> arrange($total, "desc")
}
Benefits: - Change request: “Also
filter by date” - LLM only regenerates cleaned node -
raw and by_region unchanged - Local
reasoning: Each node is independently understandable -
Cacheable: Unchanged nodes don’t re-execute
Step 1: Human writes intent
intent {
description: "Customer cohort analysis",
cohort_definition: "First purchase month",
metric: "Average order value by cohort",
timeframe: "2023-01-01 to 2023-12-31"
}
Step 2: LLM generates implementation
cohort_analysis = pipeline {
orders = read_csv("orders.csv")
-- LLM fills in details based on intent
}
Step 3: Human reviews, provides feedback
"Include only completed orders"
Step 4: LLM updates (localized change)
cleaned = orders |> filter($status == "completed")
Step 1: Human provides data sample
sample = read_csv("data.csv")
explain(sample)
-- DataFrame(100 rows x 5 cols: [date, product, region, quantity, price])
Step 2: Human requests analysis
"Calculate total revenue by product, show top 10"
Step 3: LLM generates with intent
intent {
description: "Top 10 products by revenue",
data: "data.csv with date, product, region, quantity, price",
computation: "revenue = quantity * price, group by product, sort descending, top 10"
}
top_products = sample
|> mutate($revenue = $quantity * $price)
|> group_by($product)
|> summarize($total_revenue = sum($revenue))
|> arrange($total_revenue, "desc")
|> head(10)
Iteration 1: Basic implementation
intent { description: "Average sales by month" }
monthly = sales |> group_by($month) |> summarize($avg = mean($amount))
Iteration 2: Add NA handling
intent {
description: "Average sales by month",
requirements: "Handle missing amounts"
}
monthly = sales |> group_by($month) |> summarize($avg = mean($amount))
Iteration 3: Add validation
intent {
description: "Average sales by month",
requirements: "Handle missing amounts, exclude zero sales"
}
monthly = sales
|> filter($amount > 0)
|> group_by($month)
|> summarize($avg = mean($amount))
Each iteration: Intent updated, LLM regenerates, human verifies.
T provides introspection functions for LLM context:
df = read_csv("customers.csv")
explain(df)
-- "DataFrame(1000 rows x 5 cols: [id, name, age, city, ltv])"
-- JSON format for LLM consumption
explain_json(df)
-- {"type": "DataFrame", "rows": 1000, "columns": [...], "sample": [...]}
i = intent { description: "Analysis", goal: "Insights" }
intent_fields(i)
-- {description: "Analysis", goal: "Insights"}
intent_get(i, "description")
-- "Analysis"
p = pipeline {
x = 10
y = x * 2
z = y + 5
}
pipeline_nodes(p)
-- ["x", "y", "z"]
pipeline_deps(p, "z")
-- ["y"]
-- LLM can understand dependency graph
na_rm,
validation checksHuman: “I need to analyze employee salaries by department and identify outliers”
LLM Response:
intent {
description: "Employee salary analysis with outlier detection",
data_source: "employees.csv",
required_columns: ["employee_id", "department", "salary"],
analysis_steps: [
"Calculate statistics by department (mean, SD)",
"Identify outliers (>2 SD from department mean)",
"Generate summary report"
],
outlier_definition: "Salary > department_mean + 2*department_sd",
outputs: [
"department_stats.csv: mean, sd, count per department",
"outliers.csv: employees with salaries > 2 SD from mean"
],
created: "2024-01-15",
llm_assistant: "Claude"
}
salary_analysis = pipeline {
-- Load data
employees = read_csv("employees.csv", clean_colnames = true)
-- Department statistics
dept_stats = employees
|> group_by($department)
|> summarize($mean_salary = mean($salary), $sd_salary = sd($salary), $count = nrow($department))
-- Identify outliers
-- (Join would be manual or use a join function if available)
-- For now, calculate globally
overall_mean = mean(employees.salary, na_rm = true)
overall_sd = sd(employees.salary, na_rm = true)
threshold = overall_mean + 2 * overall_sd
outliers = employees
|> filter($salary > threshold)
|> select($employee_id, $department, $salary)
}
-- Export results
write_csv(salary_analysis.dept_stats, "department_stats.csv")
write_csv(salary_analysis.outliers, "outliers.csv")
print("Analysis complete:")
print("Department stats: department_stats.csv")
print("Outliers: outliers.csv")
Human Review: “Good! But I want outliers per department, not overall”
LLM Update (regenerates outliers
node):
-- Update: Per-department outliers
with_dept_stats = employees
|> group_by($department)
|> mutate($dept_mean, \(g) mean(g.salary, na_rm = true))
|> mutate($dept_sd, \(g) sd(g.salary, na_rm = true))
outliers = with_dept_stats
|> filter($salary > $dept_mean + 2 * $dept_sd)
|> select($employee_id, $department, $salary, $dept_mean, $dept_sd)
Note: Only outliers node changed;
dept_stats and employees nodes unchanged.
Intent blocks + version control = complete audit trail:
git log --oneline intent_blocks/
abc123 Update: Exclude test accounts from churn analysis
def456 Add validation: minimum transaction amount $1
789ghi Initial: Customer churn analysis
git show abc123:src/pipeline.t
# Shows exactly what assumptions changed and whySee Also: - Reproducibility — Nix for reproducible environments - Examples — Intent-driven analysis examples - Pipeline Tutorial — Pipeline structure