ADR 0002 — Scaling Terms (Heteroskedastic CTMs)#

Date: 2026-05-21
Status: Accepted
Deciders: RMKruse
Issue: #69 Scaling Terms: ADR for scaled-baseline API and monotonicity strategy
Parent: #28 Scaling Terms (epic), #33 Architectural extensions


Context#

The shift CTM h(y|x) = h_0(y) + x_d · β is homoskedastic — the conditional transformation differs from the baseline only by a covariate-dependent translation. The heteroskedastic generalisation, mirroring R tram::*(scale=...), multiplies the baseline by a strictly positive covariate-dependent factor:

h(y | x_d, x_s) = h_0(y) · exp(x_s · γ) + x_d · β.

exp(x_s · γ) > 0 keeps the transformation monotone in y whenever h_0 is monotone, but it couples γ to θ_b non-linearly — the gradient ∂h / ∂γ = h_0(y) · x_s · exp(x_s · γ) depends on the current θ_b and γ, so the Jacobian rebuilds every iteration (unlike the constant [B | X] Jacobian of the shift path).

This ADR fixes the six open questions enumerated in #69 before any likelihood / optimiser code is written, so the downstream tracer-bullet slice (#70) and the censoring / predict / vcov / tram-class slices (#71–#77) can be built against a single agreed-on API.


Decision 1 — Public Surface#

Chosen: A single new kwarg scaling= on ConditionalTransformationModel, MLT, and every _TramModel subclass. The kwarg accepts an ndarray of pre-built scaling-design columns, supplied at construction. No new model class is introduced; existing classes route to the scaled-likelihood path whenever self.scaling is not None.

Type signature:

class ConditionalTransformationModel:
    def __init__(
        self,
        basis: BernsteinBasis | InteractionBasis,
        ...,
        scaling: NDArray[np.float64] | None = None,
    ) -> None: ...

The same kwarg is threaded through MLT.__init__ and all _TramModel subclasses (BoxCox, Coxph, Colr, Lm, Survreg). Polr does not accept scaling= in the v0.4 release — ordinal scaling would require a separate ADR (sign of γ, identifiability against the cutpoint block).

Why ndarray of pre-built columns, not a basis-like spec:

  • exp(x_s · γ) is linear-in-γ inside the exponential — there is no basis evaluation step that produces design columns from a raw scaling covariate. Any feature engineering the user wants (polynomial, spline, interaction, dummy expansion) can be done externally and passed as raw columns, identical to how the shift design X is supplied today.

  • The interacting= kwarg from ADR 0001 accepts a basis because the tensor-product structure a(y) b(x) literally requires a basis evaluation on x — that argument does not apply here.

  • An ndarray fixes q_s = scaling.shape[1] at __init__ time, which lets __init__ validate the parameter-vector layout without having to defer to fit() (mirrors how BernsteinBasis(order, support) fixes p up front).

Convention for predict / new data: at predict(...) and any other post-fit method that takes new X_new, a parallel keyword X_scale_new: NDArray | None = None is supplied; its column count must equal self.scaling.shape[1]. Shape mismatch raises ValueError with the same message family as the existing X_new checks.

Convention at fit: y.shape[0] (or CensoredData.n) must equal self.scaling.shape[0]. Mismatch is a ValueError.

Centring: users are responsible for column-centring their scaling design if they want γ = 0 to correspond to the marginal-mean variance. No automatic centring is performed — mirroring how mltpy does not auto-centre X today.

Rejected alternatives:

  • Basis-like spec (parallel to interacting=): rejected for the reasons above; an evaluate(x) interface adds nothing because there is no expansion step.

  • Boolean flag (scaling: bool = False): rejected because the issue requires the scaling is not None detection pattern, and a False default is truthy w.r.t. that idiom. Also forces the column count to be discovered lazily, which complicates __init__ validation.

  • Integer column count (scaling: int | None = None): pragmatic but loses the “ndarray of pre-built columns” framing requested in #69 and would still require X_scale at fit, adding a second kwarg. Pre-built ndarray subsumes this — scaling.shape[1] is the count.


Decision 2 — Parameter-Vector Layout#

Chosen: theta_ = [theta_b | β | γ], length p + q_d + q_s, where

  • theta_b is the Bernstein coefficient vector (length p = basis.order + 1),

  • β is the shift block (length q_d = X.shape[1], 0 if no X),

  • γ is the scaling block (length q_s = scaling.shape[1], absent if scaling is None).

The order is fixed so that the existing theta_[:p] and theta_[p:p+q_d] slices used throughout likelihood.py and model.py remain valid byte-for-byte when scaling is None; the new γ block is strictly appended.

coef_, feature_names_in_, n_free_params_#

  • coef_ returns the same vector as today (the β block). A new property gamma_coef_ returns the γ block; None when scaling is inactive. theta_b is exposed as basis_coef_ (already informally accessible via theta_[:p]).

  • feature_names_in_ continues to name the β columns. A second attribute scaling_feature_names_in_ is populated symmetrically from the scaling ndarray’s column metadata (DataFrame columns) or ["S1", "S2", ...].

  • n_free_params_ = p + q_d + q_s. No change to the formula; the new block adds q_s to the count.

_split_theta generalisation#

The current

def _split_theta(theta, p, X) -> (theta_b, beta): ...

is generalised to

def _split_theta(
    theta: NDArray[np.float64],
    p: int,
    q_d: int,
    q_s: int,
) -> tuple[NDArray[np.float64], NDArray[np.float64] | None, NDArray[np.float64] | None]:
    theta_b = theta[:p]
    beta    = theta[p : p + q_d] if q_d > 0 else None
    gamma   = theta[p + q_d : p + q_d + q_s] if q_s > 0 else None
    return theta_b, beta, gamma

Existing call sites that pass X is None / not None are mechanically rewritten to pass q_d = 0 if X is None else X.shape[1] and q_s = 0 (i.e. unchanged). The shift-only test surface is byte-identical.

A thin wrapper _split_theta_shift(theta, p, X) matching the old two-return signature is not kept — every call site is touched once in the tracer-bullet slice (#70). See Decision 6 for backward-compatibility expectations.

Interaction × Scaling#

Scaling stacks cleanly on top of shift models. Combining scaling= with an InteractionBasis is out of scope for v0.4 and raises ValueError("scaling is not supported with InteractionBasis") at __init__. Both extensions touch the parameter-vector layout non-trivially and need their own integration ADR before being mixed.


Decision 3 — Monotonicity Strategy#

Chosen: The existing D @ theta_b 0 constraint suffices for every base distribution except "exponential"; γ is unconstrained. No new rows are added to constraints.py for the scaled path.

Why γ is unconstrained:

∂h(y | x) / ∂y = h_0'(y) · exp(x_s · γ).

The scaling factor exp(x_s · γ) is strictly positive for every finite γ and every finite x_s. If h_0'(y) 0 (the standard D @ theta_b 0 constraint), then ∂h/∂y 0. Monotonicity in y is therefore inherited from h_0; the scaling block contributes a positive multiplicative factor that cannot flip the sign.

Boundedness: the auglag / SLSQP / trust-constr solvers do not require bounds on γ because the negative log-likelihood is coercive in γ for every base distribution we ship (exponential tails of exp(x_s · γ) in the likelihood penalise drift to ±∞). In practice the solver may take large interior steps early; the AugLag perturb-and-project restart loop already handles infeasibility for θ_b, and γ only affects the likelihood, not the linear constraint surface.

The "exponential" base-distribution corner case:

The exponential link has support [0, ∞), enforced today by adding the row theta_b[0] + X_i · β 0 for each training observation (nonneg_lower=True branch of build_constraints). With scaling, the corresponding constraint becomes

theta_b[0] · exp(x_s_i · γ) + X_d_i · β ≥ 0

non-linear in γ, so the existing linear-constraint scaffolding does not extend directly. Three options were considered:

  1. Add a non-linear constraint to the auglag PHR loop. Possible but doubles the active-set complexity for a rarely-used link.

  2. Restrict theta_b[0] 0 and X_d_i · β 0 for every i. A sufficient (not necessary) condition; the latter is hard to enforce without bounding β.

  3. Forbid base_distribution="exponential" with scaling != None for v0.4. Raise ValueError at MLT.__init__.

Chosen: option (3). The exponential link is the rarest of the seven base distributions and the only one that requires support-feasibility rows; combining it with scaling has no R tram parity to chase (R does not expose this combination either). Lifting the restriction would require a follow-up ADR and a non-linear-constraint extension to _auglag.py.

build_constraints itself does not change in this release. The docstring is updated to record the new precondition (scaling=None when nonneg_lower=True).


Decision 4 — Likelihood-Path Generalisation#

Chosen: Reuse ADR 0001’s Option (a). The private _ll_* and _grad_* censoring-dispatch functions accept a precomputed (h, h_prime, J, J_prime) quadruple rather than re-deriving the design matrix internally. A thin builder function assembles the quadruple from (theta, basis, y, X, scaling) once per likelihood evaluation; the censoring branch then consumes the quadruple uniformly across the shift, interaction, and scaled paths.

Notation. Let

f_i  := exp(0.5 · x_s_i · γ)      (n,)     positive scaling factor
b_i  := B_basis(y_i)              (n, p)   y-basis row
db_i := dB_basis(y_i) / dy        (n, p)   y-basis derivative row.

The factor of 0.5 in the exponent matches mlt’s internal convention (mlt:::tmlt evaluates sterm <- exp(0.5 * <scaling_predict>)), so γ is sign- and magnitude-aligned with R tram’s scaling coefficient. This was discovered while implementing the tracer-bullet slice (#70) — without the 0.5 factor, mltpy’s γ converges to half R’s γ even though the fitted likelihood matches. See mltpy/likelihood.py::_ll_none for the implementation.

Then

h_i        = (b_i · θ_b) · f_i + X_d_i · β              (+ offset_i)
h_prime_i  = (db_i · θ_b) · f_i.

The full Jacobian rows are:

J        = [ B · diag(f) | X_d | h_0(y) · diag(f) · X_s ]   (n, p + q_d + q_s)
J_prime  = [ dB · diag(f) |  0  | h_0'(y) · diag(f) · X_s ]  (n, p + q_d + q_s)

where h_0(y_i) = b_i · θ_b and h_0'(y_i) = db_i · θ_b — both depend on the current θ_b, so J and J_prime rebuild every iteration in the scaled path (unlike the shift path, where [B | X_d] is fixed).

Score / gradient. With s_i = ∂ℓ_i / ∂h_i (already computed by the censoring-specific _ll_* for every link) and q_i = ∂ℓ_i / ∂(log h'_i) = 1 for exact rows / 0 for censored, the full gradient is

∂ℓ / ∂θ  =  J.T · s  +  (J_prime / h_prime).T · q   (weighted analogously).

Both terms already exist in the shift and interaction paths; the only new responsibility is the builder that assembles J / J_prime with the diag(f) factor and the h_0(y) · diag(f) · X_s columns.

Rejected alternative — parallel scale= argument. Adding a scale=... kwarg to each of the four _ll_* and _ll_and_grad_* families would double their surface area for one additional code path. Option (a) — quadruple-in, scalar-out — already absorbs the interaction path and is the natural place for scaling to plug in.

Censoring coverage. Every censoring branch (NONE, RIGHT, LEFT, INTERVAL) receives the same (h, h_prime, J, J_prime) quadruple from the builder; the only change is that interval-censored rows need h_a, h_b and their J-rows at both endpoints. The builder returns (h_a, h_b, J_a, J_b) instead of (h, J) for those rows — mirroring the existing _ll_interval structure.


Decision 5 — Sign Convention vs. R tram#

mltpy parameterises h + X_d · β; R tram::* parameterises h X_d · β (documented for Polr and inherited by the rest of the _TramModel subclasses). mltpy’s β is therefore the negative of R’s β.

For the scaling block, R tram::Lm(..., scale = ~ z) parameterises

h_R(y | x_d, x_s) = h_0(y) · exp(0.5 · x_s · γ_R) − x_d · β_R,

i.e. the sign on β flips but the scaling block enters with the same sign and magnitude as ours. So mltpy’s γ is sign-aligned with R tram’s γno flip required when checking parity. (Confirmed against tram::Lm’s vignette §2 and against mlt:::tmlt in mlt/R/methods.R, which evaluates sterm <- exp(0.5 * predict( model, terms = "bscaling")).)

Documentation requirement for parity fixtures (#70, #77): the fixture-generating R script (reference/generate_reference.R) writes -coef(fit)[shift_block] for the β block and the raw coef(fit)[scale_block] for the γ block. The Python parity test compares element-wise without any further sign manipulation.

A unit test in #70 ratchets this convention: it fits the same toy data under R and mltpy and asserts np.allclose(mltpy.coef_, -r.beta) AND np.allclose(mltpy.gamma_coef_, r.gamma). If the test fails, the convention is wrong, not the implementation.


Decision 6 — Backward Compatibility#

All non-scaling models must be byte-identical to v0.3.x. Concretely:

  • Every existing test in tests/ passes without modification.

  • The shift-path code path (self.scaling is None) does not see any of the new _split_theta machinery: the dispatch branch tests if self.scaling is not None: and routes to the new path; the else branch is the existing implementation, verbatim.

  • theta_, coef_, feature_names_in_, n_free_params_, hessian_, vcov(), estfun(), predict(), residuals(), simulate(), score() all retain their current behaviour for scaling=None.

  • build_constraints, build_constraint_matrices, the auglag / SLSQP / trust-constr dispatch in optimizer.py accept the same arguments; no new required parameters anywhere.

  • The _split_theta rewrite is mechanically equivalent for q_s = 0theta_b, beta, gamma = _split_theta(theta, p, q_d, 0) returns (theta_b, beta, None), and every call site is updated to ignore the trailing None.

The scaling code path is dead code for every existing test until #70 flips the first row.


Stub files#

Alongside this ADR, a scaling= kwarg is added to MLT.__init__ (and to ConditionalTransformationModel.__init__) with a NotImplementedError on any non-None value. This lets downstream slices import the kwarg name without a follow-up rename, and lets test fixtures fail loudly (“scaled likelihood is not yet implemented”) rather than silently returning a shift-only fit.

No corresponding stub is added to _TramModel subclasses yet — those land in #73–#77 alongside the per-class convergence tuning.


Consequences#

  • Slice #70 (tracer bullet) implements _build_scaled_h_J, the exact

    • normal scaled likelihood, and the auglag plumbing for the new parameter block. R parity test ratchets sign convention.

  • Slice #71 adds censoring (RIGHT, LEFT, INTERVAL) and the remaining base distributions to the scaled path.

  • Slice #72 extends predict() (distribution, density, hazard, quantile) — quantile inversion needs the per-row bracket [theta_b[0] · f_i, theta_b[-1] · f_i] + X_d_i · β derived from f_i.

  • Slices #73–#76 wire scaling= through BoxCox, Coxph, Colr, Lm, Survreg.

  • Slice #77 adds vcov / sandwich SE / Wald-test coverage for the γ block.

  • Slice #78 is the vignette covering heteroskedastic regression (Lm + scale) and non-PH survival (Coxph + scale).

Out of scope for v0.4:

  • scaling= combined with InteractionBasis (ADR follow-up).

  • scaling= combined with base_distribution="exponential" (requires a non-linear constraint extension).

  • scaling= on Polr (ordinal sign-convention ADR follow-up).

  • Basis-like spec for the scaling design (deferred until a user-driven need emerges; the ndarray surface is forward-compatible — a basis-like argument could be added as a union without breaking existing callers).