{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Flexible normal-distribution regression with Box-Cox\n", "\n", "The classical Box-Cox power transform fixes a one-parameter family of\n", "monotone maps that pull a skewed response closer to a normal distribution.\n", "That parametric choice is often too rigid: the resulting likelihood can\n", "still be badly mis-specified when the true transformation has shape that\n", "no single power captures.\n", "\n", "`mltpy.BoxCox` replaces the power family with a flexible *monotone*\n", "Bernstein polynomial. The model fits a transformation $h(y)$ by maximum\n", "likelihood under the constraint that $h$ is non-decreasing, and assumes\n", "$h(Y)\\sim\\mathcal{N}(0,1)$. This reproduces Figure 2 of Hothorn (2020),\n", "*Most Likely Transformations: The mlt Package*.\n", "\n", "In this vignette we:\n", "1. Generate a strongly skewed log-normal response.\n", "2. Fit `mltpy.BoxCox` and visualise the estimated $h(y)$.\n", "3. Compare the resulting density and log-likelihood with a Gaussian fit." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "from scipy import stats\n", "\n", "import mltpy\n", "\n", "rng = np.random.default_rng(0)\n", "plt.rcParams[\"figure.dpi\"] = 110" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Synthetic log-normal data\n", "\n", "200 observations from $Y = \\min(X, 200)$ with $X\\sim\\text{LogNormal}(3.5, 0.8)$.\n", "The upper clip keeps the support bounded — a requirement of the Bernstein basis." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "y = rng.lognormal(mean=3.5, sigma=0.8, size=200).clip(0, 200)\n", "print(f\"n = {y.size}, min = {y.min():.2f}, max = {y.max():.2f}, skew = {stats.skew(y):.2f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Fit the Box-Cox transformation model\n", "\n", "The `support` argument defines the compact interval on which the Bernstein\n", "basis lives. A small padding on both sides avoids numerical issues at the\n", "boundary. `order=6` is a reasonable default for $n \\approx 200$." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "support = (float(y.min()) - 0.1, float(y.max()) + 0.1)\n", "model = mltpy.BoxCox(support=support, order=6)\n", "model.fit(y)\n", "print(model.summary())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The estimated transformation $h(y)$\n", "\n", "`fitted_transformation` evaluates $h(y) = B_k(y)\\,\\hat{\\theta}$. Under the\n", "Gaussian assumption for the raw response, $h(y)$ would be an affine\n", "$z$-score line $(y - \\mu)/\\sigma$; the fitted curve shows how far the true\n", "transformation departs from linearity." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "y_sorted = np.sort(y)\n", "h_fit = model.fitted_transformation(y_sorted)\n", "\n", "mu, sigma = y.mean(), y.std(ddof=1)\n", "h_gauss = (y_sorted - mu) / sigma\n", "\n", "fig, ax = plt.subplots(figsize=(6, 4))\n", "ax.plot(y_sorted, h_fit, label=\"mltpy BoxCox $\\\\hat h(y)$\", lw=2)\n", "ax.plot(y_sorted, h_gauss, \"--\", label=\"Gaussian $z$-score\", color=\"C3\")\n", "ax.axhline(0, color=\"0.7\", lw=0.8)\n", "ax.set_xlabel(\"y\")\n", "ax.set_ylabel(\"h(y)\")\n", "ax.set_title(\"Fitted transformation vs. Gaussian assumption\")\n", "ax.legend()\n", "fig.tight_layout();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Density fit and comparison\n", "\n", "The CTM density on the observed scale is $f(y) = \\varphi(h(y))\\,h'(y)$.\n", "Overlaying the Gaussian MLE of the same data makes the mis-specification\n", "of the naive normal assumption visible." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "grid = np.linspace(support[0] + 1e-3, support[1] - 1e-3, 400)\n", "pdf_ctm = model.predict(grid, what=\"density\")\n", "pdf_gauss = stats.norm.pdf(grid, loc=mu, scale=sigma)\n", "\n", "fig, ax = plt.subplots(figsize=(6, 4))\n", "ax.hist(y, bins=25, density=True, alpha=0.4, color=\"0.6\", label=\"data\")\n", "ax.plot(grid, pdf_ctm, label=\"mltpy BoxCox\", lw=2)\n", "ax.plot(grid, pdf_gauss, \"--\", label=\"Gaussian MLE\", color=\"C3\")\n", "ax.set_xlabel(\"y\")\n", "ax.set_ylabel(\"density\")\n", "ax.set_title(\"Density estimate on the observed scale\")\n", "ax.legend()\n", "fig.tight_layout();" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ll_ctm = model.result_.log_likelihood\n", "ll_gauss = stats.norm.logpdf(y, loc=mu, scale=sigma).sum()\n", "print(f\"log-likelihood mltpy BoxCox : {ll_ctm:9.2f}\")\n", "print(f\"log-likelihood Gaussian MLE : {ll_gauss:9.2f}\")\n", "print(f\"improvement : {ll_ctm - ll_gauss:9.2f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Takeaways\n", "\n", "- The fitted $\\hat h(y)$ is visibly non-linear, concentrating resolution\n", " where data are dense and flattening in the heavy right tail.\n", "- The Gaussian fit systematically under-predicts the mode and overstates\n", " the left tail; the CTM density matches the histogram tightly.\n", "- The likelihood gain over the Gaussian MLE is large — the model\n", " recovers a far better distributional description from the same data\n", " while using only $k+1 = 7$ shape parameters.\n", "- The *same fitted object* yields CDF, quantile, density, and hazard\n", " predictions through `model.predict(..., what=...)`." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "name": "python", "pygments_lexer": "ipython3" } }, "nbformat": 4, "nbformat_minor": 5 }