Practical examples of data wrangling with T’s core verbs
T provides six core data manipulation functions in the
colcraft package: select(),
filter(), mutate(), arrange(),
group_by(), and summarize(). All functions
take a DataFrame as their first argument and return a new DataFrame,
making them composable with the pipe operator.
df = read_csv("employees.csv")
-- DataFrame(100 rows x 5 cols: [name, age, score, dept, salary])
nrow(df) -- 100
ncol(df) -- 5
colnames(df) -- ["name", "age", "score", "dept", "salary"]
Select keeps only the specified columns:
-- Select two columns
df |> select("name", "age")
-- DataFrame(100 rows x 2 cols: [name, age])
-- Select a single column
df |> select("name")
-- DataFrame(100 rows x 1 cols: [name])
Error handling:
df |> select("nonexistent")
-- Error(KeyError: "Column(s) not found: nonexistent")
df |> select(42)
-- Error(TypeError: "select() expects string column names")
Filter keeps rows where a predicate returns true:
-- Keep employees older than 30
df |> filter(\(row) row.age > 30)
-- Keep engineering department
df |> filter(\(row) row.dept == "eng")
-- Combine with pipe
df |> filter(\(row) row.dept == "eng") |> nrow
-- 42 (number of engineers)
The predicate receives each row as a record with dot-access to columns.
Mutate adds a new column or replaces an existing one:
-- Add a new column
df |> mutate("age_plus_10", \(row) row.age + 10)
-- DataFrame with new column 'age_plus_10'
-- Replace an existing column
df |> mutate("age", \(row) row.age + 1)
-- DataFrame with updated 'age' column (same column count)
Error handling:
mutate(42, "x", \(r) r)
-- Error(TypeError: "mutate() expects a DataFrame as first argument")
df |> mutate(42, \(r) r)
-- Error(TypeError: "mutate() expects a string column name as second argument")
Arrange sorts a DataFrame by a column:
-- Sort ascending (default)
df |> arrange("age")
-- Sort descending
df |> arrange("age", "desc")
-- Verify sort order
df |> arrange("age") |> select("name") |> \(d) d.name
-- Vector sorted by age
Error handling:
df |> arrange("nonexistent")
-- Error(KeyError: "Column 'nonexistent' not found in DataFrame")
df |> arrange("age", "up")
-- Error(ValueError: "arrange() direction must be "asc" or "desc"")
Group creates a grouped DataFrame for subsequent summarization:
df |> group_by("dept")
-- DataFrame(100 rows x 5 cols: [...]) grouped by [dept]
Grouping is a marker — it doesn’t change the data, but tells
summarize() how to split the computation.
Error handling:
df |> group_by("nonexistent")
-- Error(KeyError: "Column(s) not found: nonexistent")
df |> group_by(42)
-- Error(TypeError: "group_by() expects string column names")
Summarize computes aggregate statistics:
df |> summarize("total_rows", \(d) nrow(d))
-- DataFrame(1 rows x 1 cols: [total_rows])
df |> group_by("dept")
|> summarize("count", \(g) nrow(g))
-- DataFrame(N rows x 2 cols: [dept, count])
-- One row per group
The real power is in composing verbs with the pipe operator:
df |> filter(\(row) row.age > 25)
|> select("name", "score")
|> nrow
df |> mutate("senior", \(row) row.age >= 30)
|> filter(\(row) row.senior == true)
|> nrow
df |> filter(\(row) row.age > 25)
|> select("name", "score")
|> arrange("score", "desc")
|> nrow
result = df
|> group_by("dept")
|> summarize("count", \(g) nrow(g))
result.dept -- Vector of department names
result.count -- Vector of counts per department
Math and stats functions work on DataFrame columns:
-- Column statistics
mean(df.salary) -- mean salary
sd(df.salary) -- salary standard deviation
quantile(df.score, 0.5) -- median score
-- Column transformations
sqrt(df.score) -- Vector of sqrt values
abs(df.balance) -- Vector of absolute values
model = lm(df, "salary", "age")
model.slope -- coefficient
model.intercept -- intercept
model.r_squared -- R² goodness of fit
model.n -- number of observations
For complex analyses, wrap your workflow in a pipeline for structure and caching:
p = pipeline {
raw = read_csv("sales.csv")
-- Clean
clean = raw |> filter(\(row) row.amount > 0)
-- Analyze by region
by_region = clean
|> group_by("region")
|> summarize("total", \(g) sum(g.amount))
-- Sort results
ranked = by_region |> arrange("total", "desc")
}
p.ranked -- regions ranked by total sales
Use explain() to get structured metadata about any
value:
e = explain(df)
e.kind -- "dataframe"
e.nrow -- number of rows
e.ncol -- number of columns
e.schema -- column type information
e.na_stats -- NA counts per column
e.example_rows -- sample rows