# ADR 0002 — Scaling Terms (Heteroskedastic CTMs) **Date:** 2026-05-21 **Status:** Accepted **Deciders:** RMKruse **Issue:** [#69 Scaling Terms: ADR for scaled-baseline API and monotonicity strategy](https://github.com/RMKruse/mltpy/issues/69) **Parent:** [#28 Scaling Terms (epic)](https://github.com/RMKruse/mltpy/issues/28), [#33 Architectural extensions](https://github.com/RMKruse/mltpy/issues/33) --- ## Context The shift CTM `h(y|x) = h_0(y) + x_d · β` is homoskedastic — the conditional transformation differs from the baseline only by a covariate-dependent translation. The heteroskedastic generalisation, mirroring R `tram::*(scale=...)`, multiplies the baseline by a strictly positive covariate-dependent factor: h(y | x_d, x_s) = h_0(y) · exp(x_s · γ) + x_d · β. `exp(x_s · γ) > 0` keeps the transformation monotone in `y` whenever `h_0` is monotone, but it couples `γ` to `θ_b` non-linearly — the gradient `∂h / ∂γ = h_0(y) · x_s · exp(x_s · γ)` depends on the current `θ_b` *and* `γ`, so the Jacobian rebuilds every iteration (unlike the constant `[B | X]` Jacobian of the shift path). This ADR fixes the six open questions enumerated in #69 *before* any likelihood / optimiser code is written, so the downstream tracer-bullet slice (#70) and the censoring / predict / vcov / tram-class slices (#71–#77) can be built against a single agreed-on API. --- ## Decision 1 — Public Surface **Chosen:** A single new kwarg `scaling=` on `ConditionalTransformationModel`, `MLT`, and every `_TramModel` subclass. The kwarg accepts an **ndarray of pre-built scaling-design columns**, supplied at construction. No new model class is introduced; existing classes route to the scaled-likelihood path whenever `self.scaling is not None`. **Type signature:** ```python class ConditionalTransformationModel: def __init__( self, basis: BernsteinBasis | InteractionBasis, ..., scaling: NDArray[np.float64] | None = None, ) -> None: ... ``` The same kwarg is threaded through `MLT.__init__` and all `_TramModel` subclasses (`BoxCox`, `Coxph`, `Colr`, `Lm`, `Survreg`). `Polr` does **not** accept `scaling=` in the v0.4 release — ordinal scaling would require a separate ADR (sign of `γ`, identifiability against the cutpoint block). **Why ndarray of pre-built columns, not a basis-like spec:** - `exp(x_s · γ)` is linear-in-`γ` *inside* the exponential — there is no basis evaluation step that produces design columns from a raw scaling covariate. Any feature engineering the user wants (polynomial, spline, interaction, dummy expansion) can be done externally and passed as raw columns, identical to how the shift design `X` is supplied today. - The `interacting=` kwarg from ADR 0001 accepts a basis because the tensor-product structure `a(y) ⊗ b(x)` literally requires a basis evaluation on `x` — that argument does not apply here. - An ndarray fixes `q_s = scaling.shape[1]` at `__init__` time, which lets `__init__` validate the parameter-vector layout without having to defer to `fit()` (mirrors how `BernsteinBasis(order, support)` fixes `p` up front). **Convention for predict / new data:** at `predict(...)` and any other post-fit method that takes new `X_new`, a parallel keyword `X_scale_new: NDArray | None = None` is supplied; its column count must equal `self.scaling.shape[1]`. Shape mismatch raises `ValueError` with the same message family as the existing `X_new` checks. **Convention at fit:** `y.shape[0]` (or `CensoredData.n`) must equal `self.scaling.shape[0]`. Mismatch is a `ValueError`. **Centring:** users are responsible for column-centring their scaling design if they want `γ = 0` to correspond to the marginal-mean variance. No automatic centring is performed — mirroring how mltpy does not auto-centre `X` today. **Rejected alternatives:** - *Basis-like spec* (parallel to `interacting=`): rejected for the reasons above; an `evaluate(x)` interface adds nothing because there is no expansion step. - *Boolean flag* (`scaling: bool = False`): rejected because the issue requires the `scaling is not None` detection pattern, and a `False` default is truthy w.r.t. that idiom. Also forces the column count to be discovered lazily, which complicates `__init__` validation. - *Integer column count* (`scaling: int | None = None`): pragmatic but loses the "ndarray of pre-built columns" framing requested in #69 and would still require `X_scale` at fit, adding a second kwarg. Pre-built ndarray subsumes this — `scaling.shape[1]` is the count. --- ## Decision 2 — Parameter-Vector Layout **Chosen:** `theta_ = [theta_b | β | γ]`, length `p + q_d + q_s`, where - `theta_b` is the Bernstein coefficient vector (length `p = basis.order + 1`), - `β` is the shift block (length `q_d = X.shape[1]`, 0 if no `X`), - `γ` is the scaling block (length `q_s = scaling.shape[1]`, absent if `scaling is None`). The order is fixed so that the existing `theta_[:p]` and `theta_[p:p+q_d]` slices used throughout `likelihood.py` and `model.py` remain valid *byte-for-byte* when `scaling is None`; the new `γ` block is strictly appended. ### `coef_`, `feature_names_in_`, `n_free_params_` - `coef_` returns the same vector as today (the `β` block). A new property `gamma_coef_` returns the `γ` block; `None` when scaling is inactive. `theta_b` is exposed as `basis_coef_` (already informally accessible via `theta_[:p]`). - `feature_names_in_` continues to name the `β` columns. A second attribute `scaling_feature_names_in_` is populated symmetrically from the scaling ndarray's column metadata (DataFrame columns) or `["S1", "S2", ...]`. - `n_free_params_` = `p + q_d + q_s`. No change to the formula; the new block adds `q_s` to the count. ### `_split_theta` generalisation The current ```python def _split_theta(theta, p, X) -> (theta_b, beta): ... ``` is generalised to ```python def _split_theta( theta: NDArray[np.float64], p: int, q_d: int, q_s: int, ) -> tuple[NDArray[np.float64], NDArray[np.float64] | None, NDArray[np.float64] | None]: theta_b = theta[:p] beta = theta[p : p + q_d] if q_d > 0 else None gamma = theta[p + q_d : p + q_d + q_s] if q_s > 0 else None return theta_b, beta, gamma ``` Existing call sites that pass `X is None / not None` are mechanically rewritten to pass `q_d = 0 if X is None else X.shape[1]` and `q_s = 0` (i.e. unchanged). The shift-only test surface is byte-identical. A thin wrapper `_split_theta_shift(theta, p, X)` matching the old two-return signature is **not** kept — every call site is touched once in the tracer-bullet slice (#70). See Decision 6 for backward-compatibility expectations. ### Interaction × Scaling Scaling stacks cleanly on top of *shift* models. Combining `scaling=` with an `InteractionBasis` is **out of scope for v0.4** and raises `ValueError("scaling is not supported with InteractionBasis")` at `__init__`. Both extensions touch the parameter-vector layout non-trivially and need their own integration ADR before being mixed. --- ## Decision 3 — Monotonicity Strategy **Chosen:** The existing `D @ theta_b ≥ 0` constraint suffices for every base distribution except `"exponential"`; `γ` is unconstrained. No new rows are added to `constraints.py` for the scaled path. **Why `γ` is unconstrained:** ∂h(y | x) / ∂y = h_0'(y) · exp(x_s · γ). The scaling factor `exp(x_s · γ)` is strictly positive for every finite `γ` and every finite `x_s`. If `h_0'(y) ≥ 0` (the standard `D @ theta_b ≥ 0` constraint), then `∂h/∂y ≥ 0`. Monotonicity in `y` is therefore inherited from `h_0`; the scaling block contributes a positive multiplicative factor that cannot flip the sign. **Boundedness:** the auglag / SLSQP / trust-constr solvers do not require bounds on `γ` because the negative log-likelihood is coercive in `γ` for every base distribution we ship (exponential tails of `exp(x_s · γ)` in the likelihood penalise drift to ±∞). In practice the solver may take large interior steps early; the AugLag perturb-and-project restart loop already handles infeasibility for `θ_b`, and `γ` only affects the likelihood, not the linear constraint surface. **The `"exponential"` base-distribution corner case:** The exponential link has support `[0, ∞)`, enforced today by adding the row `theta_b[0] + X_i · β ≥ 0` for each training observation (`nonneg_lower=True` branch of `build_constraints`). With scaling, the corresponding constraint becomes theta_b[0] · exp(x_s_i · γ) + X_d_i · β ≥ 0 — **non-linear in `γ`**, so the existing linear-constraint scaffolding does not extend directly. Three options were considered: 1. Add a non-linear constraint to the auglag PHR loop. Possible but doubles the active-set complexity for a rarely-used link. 2. Restrict `theta_b[0] ≥ 0` *and* `X_d_i · β ≥ 0` for every `i`. A sufficient (not necessary) condition; the latter is hard to enforce without bounding `β`. 3. **Forbid `base_distribution="exponential"` with `scaling != None`** for v0.4. Raise `ValueError` at `MLT.__init__`. **Chosen:** option (3). The exponential link is the rarest of the seven base distributions and the only one that requires support-feasibility rows; combining it with scaling has no R `tram` parity to chase (R does not expose this combination either). Lifting the restriction would require a follow-up ADR and a non-linear-constraint extension to `_auglag.py`. `build_constraints` itself does **not** change in this release. The docstring is updated to record the new precondition (`scaling=None` when `nonneg_lower=True`). --- ## Decision 4 — Likelihood-Path Generalisation **Chosen:** Reuse ADR 0001's Option (a). The private `_ll_*` and `_grad_*` censoring-dispatch functions accept a precomputed `(h, h_prime, J, J_prime)` quadruple rather than re-deriving the design matrix internally. A thin builder function assembles the quadruple from `(theta, basis, y, X, scaling)` once per likelihood evaluation; the censoring branch then consumes the quadruple uniformly across the shift, interaction, and scaled paths. **Notation.** Let f_i := exp(0.5 · x_s_i · γ) (n,) positive scaling factor b_i := B_basis(y_i) (n, p) y-basis row db_i := dB_basis(y_i) / dy (n, p) y-basis derivative row. The factor of **0.5 in the exponent** matches mlt's internal convention (`mlt:::tmlt` evaluates `sterm <- exp(0.5 * )`), so γ is sign- *and* magnitude-aligned with R `tram`'s scaling coefficient. This was discovered while implementing the tracer-bullet slice (#70) — without the 0.5 factor, mltpy's γ converges to half R's γ even though the fitted likelihood matches. See `mltpy/likelihood.py::_ll_none` for the implementation. Then h_i = (b_i · θ_b) · f_i + X_d_i · β (+ offset_i) h_prime_i = (db_i · θ_b) · f_i. The full Jacobian rows are: J = [ B · diag(f) | X_d | h_0(y) · diag(f) · X_s ] (n, p + q_d + q_s) J_prime = [ dB · diag(f) | 0 | h_0'(y) · diag(f) · X_s ] (n, p + q_d + q_s) where `h_0(y_i) = b_i · θ_b` and `h_0'(y_i) = db_i · θ_b` — both depend on the current `θ_b`, so `J` and `J_prime` rebuild every iteration in the scaled path (unlike the shift path, where `[B | X_d]` is fixed). **Score / gradient.** With `s_i = ∂ℓ_i / ∂h_i` (already computed by the censoring-specific `_ll_*` for every link) and `q_i = ∂ℓ_i / ∂(log h'_i) = 1` for exact rows / `0` for censored, the full gradient is ∂ℓ / ∂θ = J.T · s + (J_prime / h_prime).T · q (weighted analogously). Both terms already exist in the shift and interaction paths; the only new responsibility is the **builder** that assembles `J / J_prime` with the `diag(f)` factor and the `h_0(y) · diag(f) · X_s` columns. **Rejected alternative — parallel `scale=` argument.** Adding a `scale=...` kwarg to each of the four `_ll_*` and `_ll_and_grad_*` families would double their surface area for one additional code path. Option (a) — quadruple-in, scalar-out — already absorbs the interaction path and is the natural place for scaling to plug in. **Censoring coverage.** Every censoring branch (`NONE`, `RIGHT`, `LEFT`, `INTERVAL`) receives the same `(h, h_prime, J, J_prime)` quadruple from the builder; the only change is that interval-censored rows need `h_a`, `h_b` *and their J-rows at both endpoints*. The builder returns `(h_a, h_b, J_a, J_b)` instead of `(h, J)` for those rows — mirroring the existing `_ll_interval` structure. --- ## Decision 5 — Sign Convention vs. R `tram` mltpy parameterises `h + X_d · β`; R `tram::*` parameterises `h − X_d · β` (documented for `Polr` and inherited by the rest of the `_TramModel` subclasses). mltpy's `β` is therefore the **negative** of R's `β`. For the scaling block, R `tram::Lm(..., scale = ~ z)` parameterises h_R(y | x_d, x_s) = h_0(y) · exp(0.5 · x_s · γ_R) − x_d · β_R, i.e. the sign on `β` flips but the scaling block enters with the **same sign *and* magnitude** as ours. So mltpy's `γ` is sign-aligned with R `tram`'s `γ` — *no flip required* when checking parity. (Confirmed against `tram::Lm`'s vignette §2 and against `mlt:::tmlt` in `mlt/R/methods.R`, which evaluates `sterm <- exp(0.5 * predict( model, terms = "bscaling"))`.) **Documentation requirement for parity fixtures (#70, #77):** the fixture-generating R script (`reference/generate_reference.R`) writes `-coef(fit)[shift_block]` for the `β` block and the raw `coef(fit)[scale_block]` for the `γ` block. The Python parity test compares element-wise without any further sign manipulation. A unit test in #70 ratchets this convention: it fits the same toy data under R and mltpy and asserts `np.allclose(mltpy.coef_, -r.beta)` AND `np.allclose(mltpy.gamma_coef_, r.gamma)`. If the test fails, the convention is wrong, not the implementation. --- ## Decision 6 — Backward Compatibility **All non-scaling models must be byte-identical to v0.3.x.** Concretely: - Every existing test in `tests/` passes without modification. - The shift-path code path (`self.scaling is None`) does **not** see any of the new `_split_theta` machinery: the dispatch branch tests `if self.scaling is not None:` and routes to the new path; the `else` branch is the existing implementation, verbatim. - `theta_`, `coef_`, `feature_names_in_`, `n_free_params_`, `hessian_`, `vcov()`, `estfun()`, `predict()`, `residuals()`, `simulate()`, `score()` all retain their current behaviour for `scaling=None`. - `build_constraints`, `build_constraint_matrices`, the auglag / SLSQP / trust-constr dispatch in `optimizer.py` accept the same arguments; no new required parameters anywhere. - The `_split_theta` rewrite is mechanically equivalent for `q_s = 0` — `theta_b, beta, gamma = _split_theta(theta, p, q_d, 0)` returns `(theta_b, beta, None)`, and every call site is updated to ignore the trailing `None`. The scaling code path is dead code for every existing test until #70 flips the first row. --- ## Stub files Alongside this ADR, a `scaling=` kwarg is added to `MLT.__init__` (and to `ConditionalTransformationModel.__init__`) with a `NotImplementedError` on any non-`None` value. This lets downstream slices import the kwarg name without a follow-up rename, and lets test fixtures fail loudly ("scaled likelihood is not yet implemented") rather than silently returning a shift-only fit. No corresponding stub is added to `_TramModel` subclasses yet — those land in #73–#77 alongside the per-class convergence tuning. --- ## Consequences - **Slice #70** (tracer bullet) implements `_build_scaled_h_J`, the exact + normal scaled likelihood, and the auglag plumbing for the new parameter block. R parity test ratchets sign convention. - **Slice #71** adds censoring (`RIGHT`, `LEFT`, `INTERVAL`) and the remaining base distributions to the scaled path. - **Slice #72** extends `predict()` (`distribution`, `density`, `hazard`, `quantile`) — quantile inversion needs the per-row bracket `[theta_b[0] · f_i, theta_b[-1] · f_i] + X_d_i · β` derived from `f_i`. - **Slices #73–#76** wire `scaling=` through `BoxCox`, `Coxph`, `Colr`, `Lm`, `Survreg`. - **Slice #77** adds `vcov` / sandwich SE / Wald-test coverage for the `γ` block. - **Slice #78** is the vignette covering heteroskedastic regression (`Lm + scale`) and non-PH survival (`Coxph + scale`). **Out of scope for v0.4:** - `scaling=` combined with `InteractionBasis` (ADR follow-up). - `scaling=` combined with `base_distribution="exponential"` (requires a non-linear constraint extension). - `scaling=` on `Polr` (ordinal sign-convention ADR follow-up). - Basis-like spec for the scaling design (deferred until a user-driven need emerges; the ndarray surface is forward-compatible — a basis-like argument could be added as a union without breaking existing callers).