Statistical Models & Tidy Output

[!IMPORTANT] Native Support Note: T provides native implementations for Linear Models (lm) and PMML-imported Decision Trees and Random Forests. For other advanced modeling (GLMs, Mixed Models, Machine Learning), T uses a polyglot approach where models are trained in R, Python, or Julia nodes and exchanged through PMML or ONNX.

T treats models as first-class objects that can be summarized and evaluated regardless of which runtime created them.

Fitting Models

`lm()` — Linear Regression (Native)

Fits an Ordinary Least Squares (OLS) model. Following T’s “Data First” philosophy, the dataset is the first argument.

-- Positional: data, then formula
model = lm(mtcars, mpg ~ wt + hp)

-- Named args also supported
model = lm(data = mtcars, formula = mpg ~ wt + hp)

Advanced Modeling (Polyglot)

To fit models beyond simple OLS, use R, Python, or Julia nodes within a T pipeline. These nodes can serialize model artifacts through PMML or ONNX depending on the scoring path you want.

Example: Logistic Regression in R

p = pipeline {
    model_node = rn(
        command = <{
            glm(Survived ~ Pclass + Sex + Age, data = titanic, family = binomial())
        }>,
        serializer = "pmml"
    )
}
build_pipeline(p)
model = read_node("model_node")
summary(model) -- Fully supported in T!

Example: Logistic Regression in Python via ONNX

p = pipeline {
    model_node = pyn(
        command = <{
            from sklearn.linear_model import LogisticRegression
            clf = LogisticRegression().fit(X, y)
            clf
        }>,
        serializer = ^onnx
    )
}
build_pipeline(p)
model = read_node("model_node")

Example: Neural Network in Julia (Flux)

Julia nodes are particularly powerful for deep learning. T handles Julia’s “World Age” issues automatically, allowing you to train complex Flux models and export them as ONNX artifacts.

p = pipeline {
    model_node = jln(
        command = <{
            using Flux, ONNX
            model = Chain(Dense(2, 10, relu), Dense(10, 1, sigmoid))
            # ... training logic ...
            model
        }>,
        serializer = ^onnx
    )
}
build_pipeline(p)
model = read_node("model_node")
predict(new_data, model) -- Native ONNX scoring in T

Model Inspection & Diagnostics

[!TIP] Native Convenience: All model inspection and diagnostic functions are implemented natively in T. This means that even if a model was originally trained in R or Python (and imported via PMML), you can perform summaries, calculate residuals, and run hypothesis tests without needing an active R or Python environment. This approach provides significant speed advantages and simplifies high-performance pipelines.

T adopts the broom philosophy: model outputs should be DataFrames or Tidy Dictionaries.

`summary(model)`

Returns a tidy representation of coefficients. It returns a Dict where the tidy DataFrame is in _tidy_df.

s = summary(model)
s._tidy_df
-- # A DataFrame: 3 × 5
--   term         estimate  std_error  statistic  p_value

`coef(model)`

A convenience function that returns a two-column DataFrame with just term and estimate.

`fit_stats(model)`

Returns a single-row DataFrame of model-level statistics (R-squared, AIC, BIC, etc.). For PMML decision trees and random forests, this includes tree metadata such as n_trees, n_features, model_type, and mining_function.

stats = fit_stats(model)
-- # A DataFrame: 1 × 15
--   r_squared  adj_r_squared  aic    bic    nobs

`conf_int(model, level = 0.95)`

Computes confidence intervals for model coefficients.

ci = conf_int(model, level: 0.99)
-- # A DataFrame: 3 × 3
--   term         lower     upper

`compare(model1, model2, ...)`

Aligns multiple model coefficient tables into a single wide DataFrame for side-by-side comparison.

comp = compare(m1, m2)
-- Returns DataFrame with columns: estimate_1, std_error_1, ..., estimate_2, ...

`add_diagnostics(data, model)`

Augments the original data with core model-based columns: fitted, resid, and std_resid.

aug = add_diagnostics(mtcars, model)
-- Adds columns: fitted, resid, std_resid

`add_diagnostics(data, model)`

Similar to add_diagnostics, but adds a more comprehensive set of diagnostic columns (leverage, influence, etc.).

diag = add_diagnostics(mtcars, model)
-- Adds columns: fitted, resid, hat, sigma, cooksd, std_resid

`residuals(data, model, type = "response")`

Returns a DataFrame containing the actual response, the fitted values, and the calculated resid (residuals).

res = residuals(mtcars, model, type: "pearson")
-- # A DataFrame: 32 × 3 [actual, fitted, resid]

Hypothesis Testing & Diagnostics

`anova(model1, model2, ...)`

Performs Analysis of Variance (ANOVA) comparing two or more nested models.

m1 = lm(mtcars, mpg ~ wt)
m2 = lm(mtcars, mpg ~ wt + hp + qsec)
av = anova(m1, m2)
-- Returns an ANOVA table with Statistics and P-values

`wald_test(model, terms, value = 0.0)`

Performs a joint Wald test on a subset of model coefficients.

-- Test if both 'hp' and 'qsec' are jointly equal to zero
w = wald_test(model, terms: ["hp", "qsec"])

`vcov(model)`

Returns the Variance-Covariance matrix of the coefficients as a square DataFrame.

v = vcov(model)

Prediction

The predict(data, model) function performs vectorized predictions natively in T for native, PMML-backed, and ONNX-backed models.

-- Fast, native evaluation in T
-- Even if the model was trained in R or Python (and imported via PMML)
preds = predict(new_data, model)

T supports various link functions for GLMs (imported via PMML), including Logit, Probit, Log, Inverse, and Cloglog.

PMML Trees & Boosting: T can now evaluate PMML-imported Decision Trees, Random Forests, and XGBoost (GBTree) models natively (no external runtime). This includes PMML exports from scikit-learn via sklearn2pmml. Use t_read_pmml() to load the model and predict(df, model) to score new data.

ONNX Native Inference: T can run ONNX models natively through ONNX Runtime using t_read_onnx() plus predict(df, model). The current implementation supports single-input/single-output models and expects the selected numeric feature columns to match the model input width.

Model Interchange: PMML and ONNX

Why PMML?

The Predictive Model Markup Language (PMML) is the bridge between T and other runtimes when you want T-native scoring. It allows: 1. R Integration: Using any R model that has a PMML exporter (e.g. stats::glm, survival::coxph). 2. Python Integration: Using scikit-learn or statsmodels. 3. Reproducibility: Models persist independently of the original runtime code.

Why ONNX?

ONNX is the preferred interchange format when you want broad ML model coverage or faster native inference through ONNX Runtime. It allows: 1. Python ML Export: scikit-learn models via skl2onnx. 2. Native T Loading: Reading models with t_read_onnx(path) and scoring them with predict(data, model). 3. R/Python/Julia Runtime Loading: Reading models via the onnx R package, Python onnxruntime, or Julia ONNXRunTime. 4. Broader Coverage: Neural-network and non-PMML model families that PMML cannot represent well.

Use ^pmml when you want T’s hand-written classical-model evaluator. Use ^onnx when you want a portable model artifact with native ONNX Runtime inference in T or cross-runtime execution in Python, R, or Julia. T ensures that ONNX models trained in Python (Scikit-Learn) or Julia (Flux) produce consistent results when evaluated natively in T.

[!NOTE] Julia World Age: When using Julia libraries that generate code at runtime (like Flux or Zygote), T automatically wraps execution in Base.invokelatest to prevent “World Age” errors. This makes Julia nodes as robust as Python or R nodes for complex modeling tasks.

Cross-Runtime Consistency

T’s statistical evaluator is verified against R’s reference implementation. Results match R’s broom::tidy() and stats::predict() exactly.

Comparison with R’s broom Package

R (broom) / stats	T equivalent
`broom::tidy(fit)`	`summary(model)`
`broom::glance(fit)`	`fit_stats(model)`
`broom::augment(fit, data)`	`add_diagnostics(df, model)`
`stats::residuals(fit)`	`residuals(df, model)`
`stats::coef(fit)`	`coef(model)`
`stats::vcov(fit)`	`vcov(model)`
`stats::anova(m1, m2)`	`anova(m1, m2)`
`survey::regTermTest`	`wald_test(model, terms)`

Next Steps

Now that you can fit and inspect statistical models, explore how to build reproducible data pipelines and manage projects in T:

Pipeline Tutorial — Learn how to build reproducible, DAG-based data analysis workflows.
PMML Tutorial — Learn the supported PMML workflows for moving models between R, Python, and T.
Project Development — Master T’s project structure and dependency management.
Package Development — Create reusable T libraries.
API Reference — Complete function reference by package.

Statistical Models & Tidy Output

Fitting Models

lm() — Linear Regression (Native)

Advanced Modeling (Polyglot)

Example: Logistic Regression in R

Example: Logistic Regression in Python via ONNX

Example: Neural Network in Julia (Flux)

Model Inspection & Diagnostics

summary(model)

coef(model)

fit_stats(model)

conf_int(model, level = 0.95)

compare(model1, model2, ...)

add_diagnostics(data, model)

add_diagnostics(data, model)

residuals(data, model, type = "response")

Hypothesis Testing & Diagnostics

anova(model1, model2, ...)

wald_test(model, terms, value = 0.0)

vcov(model)

Prediction

Model Interchange: PMML and ONNX

Why PMML?

Why ONNX?

Cross-Runtime Consistency

Comparison with R’s broom Package

Next Steps

`lm()` — Linear Regression (Native)

`summary(model)`

`coef(model)`

`fit_stats(model)`

`conf_int(model, level = 0.95)`

`compare(model1, model2, ...)`

`add_diagnostics(data, model)`

`add_diagnostics(data, model)`

`residuals(data, model, type = "response")`

`anova(model1, model2, ...)`

`wald_test(model, terms, value = 0.0)`

`vcov(model)`