ADR 0002 — Scaling Terms (Heteroskedastic CTMs)#
Date: 2026-05-21
Status: Accepted
Deciders: RMKruse
Issue: #69 Scaling Terms: ADR for scaled-baseline API and monotonicity strategy
Parent: #28 Scaling Terms (epic), #33 Architectural extensions
Context#
The shift CTM h(y|x) = h_0(y) + x_d · β is homoskedastic — the conditional
transformation differs from the baseline only by a covariate-dependent
translation. The heteroskedastic generalisation, mirroring R tram::*(scale=...),
multiplies the baseline by a strictly positive covariate-dependent factor:
h(y | x_d, x_s) = h_0(y) · exp(x_s · γ) + x_d · β.
exp(x_s · γ) > 0 keeps the transformation monotone in y whenever h_0 is
monotone, but it couples γ to θ_b non-linearly — the gradient
∂h / ∂γ = h_0(y) · x_s · exp(x_s · γ) depends on the current θ_b and γ,
so the Jacobian rebuilds every iteration (unlike the constant [B | X] Jacobian
of the shift path).
This ADR fixes the six open questions enumerated in #69 before any likelihood / optimiser code is written, so the downstream tracer-bullet slice (#70) and the censoring / predict / vcov / tram-class slices (#71–#77) can be built against a single agreed-on API.
Decision 1 — Public Surface#
Chosen: A single new kwarg scaling= on ConditionalTransformationModel,
MLT, and every _TramModel subclass. The kwarg accepts an ndarray of
pre-built scaling-design columns, supplied at construction. No new model
class is introduced; existing classes route to the scaled-likelihood path
whenever self.scaling is not None.
Type signature:
class ConditionalTransformationModel:
def __init__(
self,
basis: BernsteinBasis | InteractionBasis,
...,
scaling: NDArray[np.float64] | None = None,
) -> None: ...
The same kwarg is threaded through MLT.__init__ and all _TramModel
subclasses (BoxCox, Coxph, Colr, Lm, Survreg). Polr does not
accept scaling= in the v0.4 release — ordinal scaling would require a
separate ADR (sign of γ, identifiability against the cutpoint block).
Why ndarray of pre-built columns, not a basis-like spec:
exp(x_s · γ)is linear-in-γinside the exponential — there is no basis evaluation step that produces design columns from a raw scaling covariate. Any feature engineering the user wants (polynomial, spline, interaction, dummy expansion) can be done externally and passed as raw columns, identical to how the shift designXis supplied today.The
interacting=kwarg from ADR 0001 accepts a basis because the tensor-product structurea(y) ⊗ b(x)literally requires a basis evaluation onx— that argument does not apply here.An ndarray fixes
q_s = scaling.shape[1]at__init__time, which lets__init__validate the parameter-vector layout without having to defer tofit()(mirrors howBernsteinBasis(order, support)fixespup front).
Convention for predict / new data: at predict(...) and any other
post-fit method that takes new X_new, a parallel keyword
X_scale_new: NDArray | None = None is supplied; its column count must
equal self.scaling.shape[1]. Shape mismatch raises ValueError with the
same message family as the existing X_new checks.
Convention at fit: y.shape[0] (or CensoredData.n) must equal
self.scaling.shape[0]. Mismatch is a ValueError.
Centring: users are responsible for column-centring their scaling
design if they want γ = 0 to correspond to the marginal-mean variance.
No automatic centring is performed — mirroring how mltpy does not
auto-centre X today.
Rejected alternatives:
Basis-like spec (parallel to
interacting=): rejected for the reasons above; anevaluate(x)interface adds nothing because there is no expansion step.Boolean flag (
scaling: bool = False): rejected because the issue requires thescaling is not Nonedetection pattern, and aFalsedefault is truthy w.r.t. that idiom. Also forces the column count to be discovered lazily, which complicates__init__validation.Integer column count (
scaling: int | None = None): pragmatic but loses the “ndarray of pre-built columns” framing requested in #69 and would still requireX_scaleat fit, adding a second kwarg. Pre-built ndarray subsumes this —scaling.shape[1]is the count.
Decision 2 — Parameter-Vector Layout#
Chosen: theta_ = [theta_b | β | γ], length p + q_d + q_s, where
theta_bis the Bernstein coefficient vector (lengthp = basis.order + 1),βis the shift block (lengthq_d = X.shape[1], 0 if noX),γis the scaling block (lengthq_s = scaling.shape[1], absent ifscaling is None).
The order is fixed so that the existing theta_[:p] and theta_[p:p+q_d]
slices used throughout likelihood.py and model.py remain valid byte-for-byte
when scaling is None; the new γ block is strictly appended.
coef_, feature_names_in_, n_free_params_#
coef_returns the same vector as today (theβblock). A new propertygamma_coef_returns theγblock;Nonewhen scaling is inactive.theta_bis exposed asbasis_coef_(already informally accessible viatheta_[:p]).feature_names_in_continues to name theβcolumns. A second attributescaling_feature_names_in_is populated symmetrically from the scaling ndarray’s column metadata (DataFrame columns) or["S1", "S2", ...].n_free_params_=p + q_d + q_s. No change to the formula; the new block addsq_sto the count.
_split_theta generalisation#
The current
def _split_theta(theta, p, X) -> (theta_b, beta): ...
is generalised to
def _split_theta(
theta: NDArray[np.float64],
p: int,
q_d: int,
q_s: int,
) -> tuple[NDArray[np.float64], NDArray[np.float64] | None, NDArray[np.float64] | None]:
theta_b = theta[:p]
beta = theta[p : p + q_d] if q_d > 0 else None
gamma = theta[p + q_d : p + q_d + q_s] if q_s > 0 else None
return theta_b, beta, gamma
Existing call sites that pass X is None / not None are mechanically
rewritten to pass q_d = 0 if X is None else X.shape[1] and
q_s = 0 (i.e. unchanged). The shift-only test surface is byte-identical.
A thin wrapper _split_theta_shift(theta, p, X) matching the old
two-return signature is not kept — every call site is touched once
in the tracer-bullet slice (#70). See Decision 6 for backward-compatibility
expectations.
Interaction × Scaling#
Scaling stacks cleanly on top of shift models. Combining scaling=
with an InteractionBasis is out of scope for v0.4 and raises
ValueError("scaling is not supported with InteractionBasis") at
__init__. Both extensions touch the parameter-vector layout
non-trivially and need their own integration ADR before being mixed.
Decision 3 — Monotonicity Strategy#
Chosen: The existing D @ theta_b ≥ 0 constraint suffices for every
base distribution except "exponential"; γ is unconstrained. No new
rows are added to constraints.py for the scaled path.
Why γ is unconstrained:
∂h(y | x) / ∂y = h_0'(y) · exp(x_s · γ).
The scaling factor exp(x_s · γ) is strictly positive for every finite
γ and every finite x_s. If h_0'(y) ≥ 0 (the standard
D @ theta_b ≥ 0 constraint), then ∂h/∂y ≥ 0. Monotonicity in y
is therefore inherited from h_0; the scaling block contributes a
positive multiplicative factor that cannot flip the sign.
Boundedness: the auglag / SLSQP / trust-constr solvers do not require
bounds on γ because the negative log-likelihood is coercive in γ
for every base distribution we ship (exponential tails of exp(x_s · γ)
in the likelihood penalise drift to ±∞). In practice the solver may
take large interior steps early; the AugLag perturb-and-project restart
loop already handles infeasibility for θ_b, and γ only affects the
likelihood, not the linear constraint surface.
The "exponential" base-distribution corner case:
The exponential link has support [0, ∞), enforced today by adding the
row theta_b[0] + X_i · β ≥ 0 for each training observation
(nonneg_lower=True branch of build_constraints). With scaling,
the corresponding constraint becomes
theta_b[0] · exp(x_s_i · γ) + X_d_i · β ≥ 0
— non-linear in γ, so the existing linear-constraint scaffolding
does not extend directly. Three options were considered:
Add a non-linear constraint to the auglag PHR loop. Possible but doubles the active-set complexity for a rarely-used link.
Restrict
theta_b[0] ≥ 0andX_d_i · β ≥ 0for everyi. A sufficient (not necessary) condition; the latter is hard to enforce without boundingβ.Forbid
base_distribution="exponential"withscaling != Nonefor v0.4. RaiseValueErroratMLT.__init__.
Chosen: option (3). The exponential link is the rarest of the seven
base distributions and the only one that requires support-feasibility
rows; combining it with scaling has no R tram parity to chase (R does
not expose this combination either). Lifting the restriction would
require a follow-up ADR and a non-linear-constraint extension to
_auglag.py.
build_constraints itself does not change in this release. The
docstring is updated to record the new precondition (scaling=None
when nonneg_lower=True).
Decision 4 — Likelihood-Path Generalisation#
Chosen: Reuse ADR 0001’s Option (a). The private _ll_* and
_grad_* censoring-dispatch functions accept a precomputed
(h, h_prime, J, J_prime) quadruple rather than re-deriving the design
matrix internally. A thin builder function assembles the quadruple
from (theta, basis, y, X, scaling) once per likelihood evaluation;
the censoring branch then consumes the quadruple uniformly across the
shift, interaction, and scaled paths.
Notation. Let
f_i := exp(0.5 · x_s_i · γ) (n,) positive scaling factor
b_i := B_basis(y_i) (n, p) y-basis row
db_i := dB_basis(y_i) / dy (n, p) y-basis derivative row.
The factor of 0.5 in the exponent matches mlt’s internal convention
(mlt:::tmlt evaluates sterm <- exp(0.5 * <scaling_predict>)), so γ is
sign- and magnitude-aligned with R tram’s scaling coefficient. This
was discovered while implementing the tracer-bullet slice (#70) — without
the 0.5 factor, mltpy’s γ converges to half R’s γ even though the fitted
likelihood matches. See mltpy/likelihood.py::_ll_none for the
implementation.
Then
h_i = (b_i · θ_b) · f_i + X_d_i · β (+ offset_i)
h_prime_i = (db_i · θ_b) · f_i.
The full Jacobian rows are:
J = [ B · diag(f) | X_d | h_0(y) · diag(f) · X_s ] (n, p + q_d + q_s)
J_prime = [ dB · diag(f) | 0 | h_0'(y) · diag(f) · X_s ] (n, p + q_d + q_s)
where h_0(y_i) = b_i · θ_b and h_0'(y_i) = db_i · θ_b — both depend
on the current θ_b, so J and J_prime rebuild every iteration in
the scaled path (unlike the shift path, where [B | X_d] is fixed).
Score / gradient. With s_i = ∂ℓ_i / ∂h_i (already computed by the
censoring-specific _ll_* for every link) and
q_i = ∂ℓ_i / ∂(log h'_i) = 1 for exact rows / 0 for censored, the
full gradient is
∂ℓ / ∂θ = J.T · s + (J_prime / h_prime).T · q (weighted analogously).
Both terms already exist in the shift and interaction paths; the only
new responsibility is the builder that assembles J / J_prime with
the diag(f) factor and the h_0(y) · diag(f) · X_s columns.
Rejected alternative — parallel scale= argument. Adding a
scale=... kwarg to each of the four _ll_* and _ll_and_grad_*
families would double their surface area for one additional code path.
Option (a) — quadruple-in, scalar-out — already absorbs the interaction
path and is the natural place for scaling to plug in.
Censoring coverage. Every censoring branch (NONE, RIGHT, LEFT,
INTERVAL) receives the same (h, h_prime, J, J_prime) quadruple from
the builder; the only change is that interval-censored rows need h_a,
h_b and their J-rows at both endpoints. The builder returns
(h_a, h_b, J_a, J_b) instead of (h, J) for those rows — mirroring
the existing _ll_interval structure.
Decision 5 — Sign Convention vs. R tram#
mltpy parameterises h + X_d · β; R tram::* parameterises
h − X_d · β (documented for Polr and inherited by the rest of the
_TramModel subclasses). mltpy’s β is therefore the negative of
R’s β.
For the scaling block, R tram::Lm(..., scale = ~ z) parameterises
h_R(y | x_d, x_s) = h_0(y) · exp(0.5 · x_s · γ_R) − x_d · β_R,
i.e. the sign on β flips but the scaling block enters with the
same sign and magnitude as ours. So mltpy’s γ is sign-aligned
with R tram’s γ — no flip required when checking parity.
(Confirmed against tram::Lm’s vignette §2 and against mlt:::tmlt
in mlt/R/methods.R, which evaluates sterm <- exp(0.5 * predict( model, terms = "bscaling")).)
Documentation requirement for parity fixtures (#70, #77): the
fixture-generating R script (reference/generate_reference.R) writes
-coef(fit)[shift_block] for the β block and the raw coef(fit)[scale_block]
for the γ block. The Python parity test compares element-wise without
any further sign manipulation.
A unit test in #70 ratchets this convention: it fits the same toy data
under R and mltpy and asserts np.allclose(mltpy.coef_, -r.beta) AND
np.allclose(mltpy.gamma_coef_, r.gamma). If the test fails, the
convention is wrong, not the implementation.
Decision 6 — Backward Compatibility#
All non-scaling models must be byte-identical to v0.3.x. Concretely:
Every existing test in
tests/passes without modification.The shift-path code path (
self.scaling is None) does not see any of the new_split_thetamachinery: the dispatch branch testsif self.scaling is not None:and routes to the new path; theelsebranch is the existing implementation, verbatim.theta_,coef_,feature_names_in_,n_free_params_,hessian_,vcov(),estfun(),predict(),residuals(),simulate(),score()all retain their current behaviour forscaling=None.build_constraints,build_constraint_matrices, the auglag / SLSQP / trust-constr dispatch inoptimizer.pyaccept the same arguments; no new required parameters anywhere.The
_split_thetarewrite is mechanically equivalent forq_s = 0—theta_b, beta, gamma = _split_theta(theta, p, q_d, 0)returns(theta_b, beta, None), and every call site is updated to ignore the trailingNone.
The scaling code path is dead code for every existing test until #70 flips the first row.
Stub files#
Alongside this ADR, a scaling= kwarg is added to MLT.__init__ (and to
ConditionalTransformationModel.__init__) with a NotImplementedError
on any non-None value. This lets downstream slices import the kwarg
name without a follow-up rename, and lets test fixtures fail loudly
(“scaled likelihood is not yet implemented”) rather than silently
returning a shift-only fit.
No corresponding stub is added to _TramModel subclasses yet — those
land in #73–#77 alongside the per-class convergence tuning.
Consequences#
Slice #70 (tracer bullet) implements
_build_scaled_h_J, the exactnormal scaled likelihood, and the auglag plumbing for the new parameter block. R parity test ratchets sign convention.
Slice #71 adds censoring (
RIGHT,LEFT,INTERVAL) and the remaining base distributions to the scaled path.Slice #72 extends
predict()(distribution,density,hazard,quantile) — quantile inversion needs the per-row bracket[theta_b[0] · f_i, theta_b[-1] · f_i] + X_d_i · βderived fromf_i.Slices #73–#76 wire
scaling=throughBoxCox,Coxph,Colr,Lm,Survreg.Slice #77 adds
vcov/ sandwich SE / Wald-test coverage for theγblock.Slice #78 is the vignette covering heteroskedastic regression (
Lm + scale) and non-PH survival (Coxph + scale).
Out of scope for v0.4:
scaling=combined withInteractionBasis(ADR follow-up).scaling=combined withbase_distribution="exponential"(requires a non-linear constraint extension).scaling=onPolr(ordinal sign-convention ADR follow-up).Basis-like spec for the scaling design (deferred until a user-driven need emerges; the ndarray surface is forward-compatible — a basis-like argument could be added as a union without breaking existing callers).