{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Conditional transformation model with covariates\n", "\n", "When a conditioning covariate matrix $X$ is supplied, mltpy fits the\n", "conditional distribution\n", "$$F(y\\mid x) = \\Phi\\bigl(B_k(y)\\,\\theta + x^{\\top}\\beta\\bigr),$$\n", "where the Bernstein component $B_k(y)\\,\\theta$ absorbs the *shape* of\n", "the response distribution and $x^{\\top}\\beta$ shifts it on the latent\n", "normal scale.\n", "\n", "Unlike ordinary linear regression, nothing here is restricted to the\n", "mean: the model describes the *entire* conditional distribution, so we\n", "can extract the CDF, density, or any quantile at any covariate value.\n", "This is the conditional transformation model of\n", "Hothorn, Kneib & Bühlmann (2014).\n", "\n", "In this vignette we:\n", "1. Simulate a heteroscedastic two-covariate regression problem.\n", "2. Fit `mltpy.MLT` with `X` passed to `fit`.\n", "3. Compare conditional densities at three representative covariate\n", " values against the Gaussian densities implied by OLS." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "from scipy import stats\n", "\n", "import mltpy\n", "\n", "rng = np.random.default_rng(0)\n", "plt.rcParams[\"figure.dpi\"] = 110" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Heteroscedastic DGP\n", "\n", "$X_1, X_2 \\sim \\mathcal{N}(0, 1)$ independently, and\n", "$$Y \\mid X = x = \\bigl(1 + 0.8 x_1 - 0.5 x_2\\bigr) + \\bigl(0.3 + 0.2 |x_1|\\bigr)\\,\\varepsilon,\\quad \\varepsilon\\sim\\mathcal N(0,1).$$\n", "\n", "The residual variance grows with $|X_1|$, so the *conditional* distribution is\n", "not a location shift of a single shape — exactly the regime in which the\n", "naive Gaussian regression fails." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "n = 300\n", "X = rng.standard_normal((n, 2))\n", "eta = 1.0 + 0.8 * X[:, 0] - 0.5 * X[:, 1]\n", "sd = 0.3 + 0.2 * np.abs(X[:, 0])\n", "y = eta + sd * rng.standard_normal(n)\n", "print(f\"y range: [{y.min():.2f}, {y.max():.2f}]\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Fit the conditional transformation model" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "support = (float(y.min()) - 1.0, float(y.max()) + 1.0)\n", "model = mltpy.MLT(order=5, support=support)\n", "model.fit(y, X=X)\n", "\n", "p = 5 + 1\n", "beta = model.theta_[p:]\n", "print(f\"converged: {model.result_.converged}\")\n", "print(f\"log-likelihood: {model.result_.log_likelihood:.2f}\")\n", "print(f\"beta = {beta} (true shifts: [-0.8/sd, +0.5/sd] on latent scale)\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conditional CDF and density at three representative $x$ values\n", "\n", "We pick $x_1$ at the 10th, 50th, and 90th empirical percentile with $x_2$\n", "held at 0 to isolate the heteroscedastic axis. For each value we compute\n", "the CTM conditional distribution and the Gaussian distribution implied\n", "by OLS with constant residual variance." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X_design = np.column_stack([np.ones(n), X])\n", "coefs_ols, *_ = np.linalg.lstsq(X_design, y, rcond=None)\n", "resid = y - X_design @ coefs_ols\n", "sigma_ols = resid.std(ddof=X_design.shape[1])\n", "print(f\"OLS coefficients (intercept, x1, x2): {coefs_ols}\")\n", "print(f\"OLS residual SD (assumed constant): {sigma_ols:.3f}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x1_values = np.quantile(X[:, 0], [0.10, 0.50, 0.90])\n", "colors = [\"C0\", \"C2\", \"C3\"]\n", "labels = [\"x₁ @ 10th pct\", \"x₁ @ 50th pct\", \"x₁ @ 90th pct\"]\n", "\n", "y_grid = np.linspace(support[0] + 1e-3, support[1] - 1e-3, 400)\n", "\n", "fig, (ax_cdf, ax_pdf) = plt.subplots(1, 2, figsize=(11, 4))\n", "\n", "for x1, col, lbl in zip(x1_values, colors, labels):\n", " x_fixed = np.array([x1, 0.0])\n", " X_rep = np.tile(x_fixed, (len(y_grid), 1))\n", " cdf = model.predict(y_grid, X_new=X_rep, what=\"distribution\")\n", " pdf = model.predict(y_grid, X_new=X_rep, what=\"density\")\n", " ax_cdf.plot(y_grid, cdf, color=col, lw=2, label=lbl)\n", " ax_pdf.plot(y_grid, pdf, color=col, lw=2, label=lbl)\n", "\n", " # OLS-implied Gaussian conditional density at the same x\n", " mu_ols = coefs_ols[0] + coefs_ols[1] * x1 + coefs_ols[2] * 0.0\n", " ax_pdf.plot(\n", " y_grid,\n", " stats.norm.pdf(y_grid, mu_ols, sigma_ols),\n", " \"--\",\n", " color=col,\n", " alpha=0.6,\n", " )\n", "\n", "ax_cdf.set_xlabel(\"y\")\n", "ax_cdf.set_ylabel(\"F(y | x)\")\n", "ax_cdf.set_title(\"Conditional CDF\")\n", "ax_cdf.legend(loc=\"lower right\")\n", "\n", "ax_pdf.set_xlabel(\"y\")\n", "ax_pdf.set_ylabel(\"f(y | x)\")\n", "ax_pdf.set_title(\"Conditional density (solid = CTM, dashed = OLS Gaussian)\")\n", "ax_pdf.legend(loc=\"upper right\")\n", "\n", "fig.tight_layout();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Takeaways\n", "\n", "- At moderate $x_1$ the CTM and OLS densities roughly coincide: when the\n", " DGP's conditional noise is close to its mean level, a homoscedastic\n", " Gaussian is not grossly wrong.\n", "- At extreme $x_1$ the CTM density is visibly narrower or wider than\n", " the fixed-SD OLS Gaussian — the model recovers the heteroscedasticity\n", " directly from the likelihood, without any explicit variance model.\n", "- The same fitted object answers conditional-quantile queries: use\n", " `model.predict(np.array([0.1, 0.5, 0.9]), X_new=..., what=\"quantile\")`\n", " to obtain conditional prediction bands.\n", "- For censored responses the same covariate path is available through\n", " `mltpy.Coxph(...).fit(CensoredData, X=X)`." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "name": "python", "pygments_lexer": "ipython3" } }, "nbformat": 4, "nbformat_minor": 5 }