{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Flexible normal-distribution regression with Box-Cox\n",
    "\n",
    "The classical Box-Cox power transform fixes a one-parameter family of\n",
    "monotone maps that pull a skewed response closer to a normal distribution.\n",
    "That parametric choice is often too rigid: the resulting likelihood can\n",
    "still be badly mis-specified when the true transformation has shape that\n",
    "no single power captures.\n",
    "\n",
    "`mltpy.BoxCox` replaces the power family with a flexible *monotone*\n",
    "Bernstein polynomial. The model fits a transformation $h(y)$ by maximum\n",
    "likelihood under the constraint that $h$ is non-decreasing, and assumes\n",
    "$h(Y)\\sim\\mathcal{N}(0,1)$. This reproduces Figure 2 of Hothorn (2020),\n",
    "*Most Likely Transformations: The mlt Package*.\n",
    "\n",
    "In this vignette we:\n",
    "1. Generate a strongly skewed log-normal response.\n",
    "2. Fit `mltpy.BoxCox` and visualise the estimated $h(y)$.\n",
    "3. Compare the resulting density and log-likelihood with a Gaussian fit."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "from scipy import stats\n",
    "\n",
    "import mltpy\n",
    "\n",
    "rng = np.random.default_rng(0)\n",
    "plt.rcParams[\"figure.dpi\"] = 110"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Synthetic log-normal data\n",
    "\n",
    "200 observations from $Y = \\min(X, 200)$ with $X\\sim\\text{LogNormal}(3.5, 0.8)$.\n",
    "The upper clip keeps the support bounded — a requirement of the Bernstein basis."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "y = rng.lognormal(mean=3.5, sigma=0.8, size=200).clip(0, 200)\n",
    "print(f\"n = {y.size},  min = {y.min():.2f},  max = {y.max():.2f},  skew = {stats.skew(y):.2f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Fit the Box-Cox transformation model\n",
    "\n",
    "The `support` argument defines the compact interval on which the Bernstein\n",
    "basis lives. A small padding on both sides avoids numerical issues at the\n",
    "boundary. `order=6` is a reasonable default for $n \\approx 200$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "support = (float(y.min()) - 0.1, float(y.max()) + 0.1)\n",
    "model = mltpy.BoxCox(support=support, order=6)\n",
    "model.fit(y)\n",
    "print(model.summary())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The estimated transformation $h(y)$\n",
    "\n",
    "`fitted_transformation` evaluates $h(y) = B_k(y)\\,\\hat{\\theta}$. Under the\n",
    "Gaussian assumption for the raw response, $h(y)$ would be an affine\n",
    "$z$-score line $(y - \\mu)/\\sigma$; the fitted curve shows how far the true\n",
    "transformation departs from linearity."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "y_sorted = np.sort(y)\n",
    "h_fit = model.fitted_transformation(y_sorted)\n",
    "\n",
    "mu, sigma = y.mean(), y.std(ddof=1)\n",
    "h_gauss = (y_sorted - mu) / sigma\n",
    "\n",
    "fig, ax = plt.subplots(figsize=(6, 4))\n",
    "ax.plot(y_sorted, h_fit, label=\"mltpy BoxCox $\\\\hat h(y)$\", lw=2)\n",
    "ax.plot(y_sorted, h_gauss, \"--\", label=\"Gaussian $z$-score\", color=\"C3\")\n",
    "ax.axhline(0, color=\"0.7\", lw=0.8)\n",
    "ax.set_xlabel(\"y\")\n",
    "ax.set_ylabel(\"h(y)\")\n",
    "ax.set_title(\"Fitted transformation vs. Gaussian assumption\")\n",
    "ax.legend()\n",
    "fig.tight_layout();"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Density fit and comparison\n",
    "\n",
    "The CTM density on the observed scale is $f(y) = \\varphi(h(y))\\,h'(y)$.\n",
    "Overlaying the Gaussian MLE of the same data makes the mis-specification\n",
    "of the naive normal assumption visible."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "grid = np.linspace(support[0] + 1e-3, support[1] - 1e-3, 400)\n",
    "pdf_ctm = model.predict(grid, what=\"density\")\n",
    "pdf_gauss = stats.norm.pdf(grid, loc=mu, scale=sigma)\n",
    "\n",
    "fig, ax = plt.subplots(figsize=(6, 4))\n",
    "ax.hist(y, bins=25, density=True, alpha=0.4, color=\"0.6\", label=\"data\")\n",
    "ax.plot(grid, pdf_ctm, label=\"mltpy BoxCox\", lw=2)\n",
    "ax.plot(grid, pdf_gauss, \"--\", label=\"Gaussian MLE\", color=\"C3\")\n",
    "ax.set_xlabel(\"y\")\n",
    "ax.set_ylabel(\"density\")\n",
    "ax.set_title(\"Density estimate on the observed scale\")\n",
    "ax.legend()\n",
    "fig.tight_layout();"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ll_ctm = model.result_.log_likelihood\n",
    "ll_gauss = stats.norm.logpdf(y, loc=mu, scale=sigma).sum()\n",
    "print(f\"log-likelihood  mltpy BoxCox : {ll_ctm:9.2f}\")\n",
    "print(f\"log-likelihood  Gaussian MLE : {ll_gauss:9.2f}\")\n",
    "print(f\"improvement                  : {ll_ctm - ll_gauss:9.2f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Takeaways\n",
    "\n",
    "- The fitted $\\hat h(y)$ is visibly non-linear, concentrating resolution\n",
    "  where data are dense and flattening in the heavy right tail.\n",
    "- The Gaussian fit systematically under-predicts the mode and overstates\n",
    "  the left tail; the CTM density matches the histogram tightly.\n",
    "- The likelihood gain over the Gaussian MLE is large — the model\n",
    "  recovers a far better distributional description from the same data\n",
    "  while using only $k+1 = 7$ shape parameters.\n",
    "- The *same fitted object* yields CDF, quantile, density, and hazard\n",
    "  predictions through `model.predict(..., what=...)`."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}