Practical examples of data wrangling with T’s core verbs
T provides a powerful suite of data manipulation functions in the
colcraft package, inspired by dplyr and
tidyr. These functions take a DataFrame as their first
argument and return a new DataFrame, making them highly composable with
the pipe operator.
df = read_csv("employees.csv")
-- DataFrame(100 rows x 5 cols: [name, age, score, dept, salary])
nrow(df) -- 100
ncol(df) -- 5
colnames(df) -- ["name", "age", "score", "dept", "salary"]
Use glimpse() for a quick, tidy overview of a
DataFrame’s structure, dimensions, and contents:
df |> glimpse()
-- # A DataFrame: 100 × 5
-- name (String): "Alice", "Bob", "Charlie", ...
-- age (Int): 30, 25, 42, ...
-- score (Float): 88.5, 92.0, 75.2, ...
-- dept (String): "eng", "eng", "sales", ...
-- salary (Float): 85000.0, 92000.0, 65000.0, ...
Unlike print(), which shows a formatted table,
glimpse() lists columns vertically with their types and a
sample of values, making it ideal for wide datasets or quick
exploration.
-- Custom separator (e.g., semicolon-delimited files)
df = read_csv("data.tsv", separator = ";")
-- Skip comment lines at the top of a file
df = read_csv("data_with_header.csv", skip_lines = 2)
-- No header row (columns named V1, V2, ...)
df = read_csv("raw_data.csv", skip_header = true)
-- Normalize column names to safe snake_case identifiers
df = read_csv("messy_headers.csv", clean_colnames = true)
The clean_colnames option (or standalone function)
normalizes column names:
-- As a read_csv option
df = read_csv("data.csv", clean_colnames = true)
-- "Growth%" → "growth_percent"
-- "MILLION€" → "million_euro"
-- "café" → "cafe"
-- "A.1" → "a_1"
-- As a standalone function on a DataFrame
df2 = clean_colnames(df)
-- On a list of strings
clean_colnames(["A.1", "A-1"]) -- ["a_1", "a_1_2"]
Duplicate names after cleaning are disambiguated: the first
occurrence stays unchanged, subsequent duplicates get _2,
_3, etc.
You can create DataFrames manually from a list of rows using the
dataframe() function. Each row can be a Dict or a named
List.
-- From a list of named Lists (idiomatic row constructor)
df = dataframe([
[name: "Alice", age: 30, score: 88.5],
[name: "Bob", age: 25, score: 92.0]
])
-- DataFrame(2 rows x 3 cols: [name, age, score])
This is particularly useful for: - Creating lookup tables - Writing test cases - Entering small datasets manually
-- Save a DataFrame to CSV
write_csv(df, "output.csv")
-- Save with a custom separator
write_csv(df, "output.tsv", separator = "\t")
Select keeps only the specified columns:
df |> select($name, $age)
-- DataFrame(100 rows x 2 cols: [name, age])
df |> select($name)
-- DataFrame(100 rows x 1 cols: [name])
Error handling:
df |> select($nonexistent)
-- Error(KeyError: "Column(s) not found: nonexistent")
df |> select(42)
-- Error(TypeError: "select() expects $column syntax")
Filter keeps rows where a predicate returns true:
df |> filter($age > 30)
df |> filter($dept == "eng")
-- Combine with pipe
df |> filter($dept == "eng") |> nrow
-- 42 (number of engineers)
The NSE syntax auto-transforms $age > 30 into
\(row) row.age > 30.
Mutate adds a new column or replaces an existing one:
-- Named-arg NSE syntax: $col = NSE expression
df |> mutate($age_plus_10 = $age + 10)
df |> mutate($bonus = $salary * 0.1)
-- Positional NSE for column name with explicit lambda
df |> mutate($age_plus_10, \(row) row.age + 10)
Error handling:
mutate(42, $x, \(r) r)
-- Error(TypeError: "mutate() expects a DataFrame as first argument")
df |> mutate(42, \(r) r)
-- Error(TypeError: "mutate() expects $column = expr syntax")
Arrange sorts a DataFrame by a column:
df |> arrange($age)
df |> arrange($age, "desc")
Error handling:
df |> arrange($nonexistent)
-- Error(KeyError: "Column 'nonexistent' not found in DataFrame")
df |> arrange($age, "up")
-- Error(ValueError: "arrange() direction must be "asc" or "desc"")
Group creates a grouped DataFrame for subsequent summarization:
df |> group_by($dept)
-- DataFrame(100 rows x 5 cols: [...]) grouped by [dept]
Grouping is a marker — it doesn’t change the data, but tells
summarize() and mutate() how to split the
computation.
Error handling:
df |> group_by($nonexistent)
-- Error(KeyError: "Column(s) not found: nonexistent")
df |> group_by(42)
-- Error(TypeError: "group_by() expects $column syntax")
Summarize computes aggregate statistics:
-- Named-arg NSE syntax: $col = NSE aggregation
df |> summarize($total_score = sum($score))
-- Positional NSE with lambda
df |> summarize($total_rows, \(d) nrow(d))
-- DataFrame(1 rows x 1 cols: [total_rows])
-- Named-arg NSE syntax
df |> group_by($dept)
|> summarize($count = nrow($dept), $avg_score = mean($score))
-- Positional NSE with lambda
df |> group_by($dept)
|> summarize($count, \(g) nrow(g))
-- DataFrame(N rows x 2 cols: [dept, count])
-- One row per group
T includes a robust implementation of tidying functions for reshaping your data.
Pivots data from wide to long format by consolidating multiple columns into name-value pairs.
-- Pivot age and score into a single "measure" column
df |> pivot_longer($age, $score, names_to = "measure", values_to = "val")
-- Resulting columns: name, dept, salary, measure, val
Pivots data from long to wide format by spreading unique values across multiple columns.
-- Widen by department names, using salary as values
df |> select($name, $dept, $salary)
|> pivot_wider(names_from = $dept, values_from = $salary)
Splits a single character column into multiple columns based on a delimiter.
-- Split a "date" column into "year", "month", "day"
df |> separate($date, into = ["year", "month", "day"], sep = "-")
Combines multiple columns into a single character column.
-- Combine year, month, day into a single "full_date" column
df |> unite("full_date", $year, $month, $day, sep = "-")
Groups data and packs selected columns into a single nested list-column containing DataFrames.
-- Nest columns mpg and hp into a column named "data"
-- Other columns (cyl, etc.) remain as grouping keys
nt = df |> nest($mpg, $hp)
-- Use selection helpers to nest all columns starting with 's'
nt = df |> nest(data = starts_with("s"))
-- Automatic grouping behavior:
-- If already grouped, nest() without arguments nests all non-grouping columns.
nt = df |> group_by($cyl) |> nest()
-- Resulting schema: [cyl, data] where data contains all other columns
Expands a nested list-column back into regular rows and columns.
-- Unnest the "data" column back into individual columns
df_flat = nt |> unnest($data)
Removes rows containing missing values in the specified columns.
-- Remove rows where Ozone or Solar.R are NA
df |> drop_na($Ozone, $`Solar.R`) -- Dots in names require backticks
Replaces missing values with a specific constant for each column.
-- Replace NAs in Ozone with 0
df |> replace_na([Ozone: 0, `Solar.R`: 0])
Propagates non-missing values in a specific direction.
-- Fill missing categories downwards
df |> fill($category, .direction = "down")
Ensures all combinations of the specified columns exist in the data.
-- Ensure every item has a row for every group
df |> complete($group, $item_id, fill = [value: 0])
Generates all unique combinations of variables found in a dataset.
-- Every combination of type and size, regardless of whether it exists in df
df |> expand($type, $size)
Creates a new DataFrame from the cross-product of inputs.
-- Independent of any existing DataFrame
crossing(x = [1, 2, 3], y = ["a", "b"])
Used inside expand() or complete() to only
include combinations already present in the data.
-- Only combinations of type and size that actually appear in the data
df |> expand(nesting($type, $size))
-- In complete(): use nesting() to restrict some columns to existing combinations
df |> complete($group, nesting($item_id, $item_name))
Create categorical data with defined levels. Factors respect their level order when sorting, rather than alphabetical order.
Creates a factor from a vector or column, optionally specifying levels.
-- Create a factor from a vector
let f = factor(["low", "high", "medium"], levels = ["low", "medium", "high"])
-- Create a factor column using mutate
df |> mutate($size_fct = factor($size, levels = ["small", "medium", "large"]))
Coerces a column to a factor (categorical) value, deriving levels from the unique values present.
df |> mutate($category = as_factor($category))
Reorders factor levels in descending order of frequency.
df |> mutate($category = fct_infreq($category))
When mutate() is applied to a grouped DataFrame, the
function receives the group sub-DataFrame instead of a single row. The
result is broadcast to every row in that group:
-- Add group size to each row
df |> group_by($dept)
|> mutate($dept_size, \(g) nrow(g))
-- Compute group mean and broadcast
df |> group_by($dept)
|> mutate($mean_score, \(g) mean(g.score))
-- Chain grouped mutate with filter
df |> group_by($dept)
|> mutate($dept_size, \(g) nrow(g))
|> filter($dept_size > 2)
-- Keep only rows from large departments
Note: Grouped mutate preserves the group keys on the resulting DataFrame.
The real power is in composing verbs with the pipe operator:
df |> filter($age > 25)
|> select($name, $score)
|> nrow
df |> mutate($senior, \(row) row.age >= 30)
|> filter($senior == true)
|> nrow
df |> filter($age > 25)
|> select($name, $score)
|> arrange($score, "desc")
|> nrow
result = df
|> group_by($dept)
|> summarize($count, \(g) nrow(g))
result.dept -- Vector of department names
result.count -- Vector of counts per department
When working with data pipelines, operations can fail (missing files,
invalid data, etc.). The maybe-pipe ?|> forwards errors
to a recovery function instead of short-circuiting:
-- Normal pipe short-circuits on error:
error("missing") |> \(x) x + 1 -- Error (never calls function)
-- Maybe-pipe forwards the error:
with_default = \(x) if (is_error(x)) 0 else x
error("missing") ?|> with_default -- 0
-- Provide fallback values for errors
with_default = \(result) if (is_error(result)) 0 else result
value1 = 5 ?|> with_default -- 5
value2 = error("missing data") ?|> with_default -- 0
recovery = \(x) if (is_error(x)) 0 else x
increment = \(x) x + 1
-- ?|> forwards error to recovery, then |> continues normally
error("fail") ?|> recovery |> increment -- 1
Math and stats functions work on DataFrame columns:
-- Column statistics
mean(df.salary) -- mean salary
sd(df.salary) -- salary standard deviation
quantile(df.score, 0.5) -- median score
-- Column transformations
sqrt(df.score) -- Vector of sqrt values
abs(df.balance) -- Vector of absolute values
By default, aggregation functions error when
encountering NA values, forcing explicit handling. Use
na_rm = true to skip NA values:
-- Default: errors on NA (forces explicit handling)
mean(df.salary) -- Error if any salary is NA
-- Skip NA values with na_rm = true
mean(df.salary, na_rm = true) -- mean of non-NA salaries
sum(df.amount, na_rm = true) -- sum of non-NA amounts
sd(df.score, na_rm = true) -- sd of non-NA scores
quantile(df.age, 0.5, na_rm = true) -- median of non-NA ages
cor(df.x, df.y, na_rm = true) -- correlation with pairwise deletion
model = lm(data = df, formula = salary ~ age)
model.slope -- coefficient
model.intercept -- intercept
model.r_squared -- R² goodness of fit
model.n -- number of observations
For complex analyses, wrap your workflow in a pipeline for structure and caching:
p = pipeline {
raw = read_csv("sales.csv")
-- Clean
clean = raw |> filter($amount > 0)
-- Analyze by region
by_region = clean
|> group_by($region)
|> summarize($total, \(g) sum(g.amount))
-- Sort results
ranked = by_region |> arrange($total, "desc")
}
p.ranked -- regions ranked by total sales
Use explain() to get structured metadata about any
value:
e = explain(df)
e.kind -- "dataframe"
e.nrow -- number of rows
e.ncol -- number of columns
e.schema -- column type information
e.na_stats -- NA counts per column
e.example_rows -- sample rows
Window functions operate on vectors without collapsing them, useful for ranking, shifting, and cumulative calculations.
-- Rank values (ties broken by position)
row_number([10, 30, 20]) -- Vector[1, 3, 2]
-- Rank with ties (minimum rank, with gaps)
min_rank([1, 1, 2, 2, 2]) -- Vector[1, 1, 3, 3, 3]
-- Dense rank (no gaps)
dense_rank([1, 1, 2, 2]) -- Vector[1, 1, 2, 2]
-- Divide into n groups
ntile([1, 2, 3, 4, 5, 6], 3) -- Vector[1, 1, 2, 2, 3, 3]
-- Previous value (lagged by 1)
lag([10, 20, 30, 40]) -- Vector[NA, 10, 20, 30]
-- Next value (lead by 1)
lead([10, 20, 30, 40]) -- Vector[20, 30, 40, NA]
-- Custom offset
lag([10, 20, 30, 40], 2) -- Vector[NA, NA, 10, 20]
cumsum([1, 2, 3, 4]) -- Vector[1, 3, 6, 10]
cummin([5, 3, 4, 1]) -- Vector[5, 3, 3, 1]
cummax([1, 3, 2, 5]) -- Vector[1, 3, 3, 5]
cummean([2, 4, 6]) -- Vector[2.0, 3.0, 4.0]
All window functions handle NA values gracefully:
-- Ranking with NA
row_number([3, NA, 1]) -- Vector[2, NA, 1]
min_rank([3, NA, 1, 3]) -- Vector[2, NA, 1, 2]
-- Cumulative with NA (propagation)
cumsum([1, NA, 3]) -- Vector[1, NA, NA]
cummax([1, NA, 5]) -- Vector[1, NA, NA]
Now that you’ve mastered the core data manipulation verbs, explore specialized data types and advanced modeling in T: