{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Survival analysis with Coxph\n", "\n", "`mltpy.Coxph` fits a proportional-hazards model in which the baseline\n", "log-cumulative-hazard is modelled as a flexible, monotone Bernstein\n", "polynomial and the link function is the minimum-extreme-value distribution.\n", "With covariates entering linearly, this is equivalent to the classical\n", "Cox proportional-hazards model, but the baseline distribution is estimated\n", "smoothly rather than left fully non-parametric.\n", "\n", "This vignette reproduces the Cox-PH illustration from Hothorn (2020) on\n", "synthetic right-censored data and compares the estimated survival curve\n", "to the Kaplan-Meier estimate from [lifelines](https://lifelines.readthedocs.io/).\n", "\n", "> *lifelines is an optional dependency:* install with\n", "> `pip install \"mltpy[examples]\"`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "from lifelines import KaplanMeierFitter\n", "\n", "import mltpy\n", "\n", "rng = np.random.default_rng(0)\n", "plt.rcParams[\"figure.dpi\"] = 110" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data-generating process\n", "\n", "250 event times from $T\\sim\\text{Exp}(\\text{scale}=2)$, observed under\n", "independent uniform right-censoring. The censoring window is tuned to a\n", "~30 % censoring rate, a realistic setting for observational follow-up\n", "studies.\n", "\n", "mltpy expects a `CensoredData` container with the convention\n", "`censored=True` when the event is *not* observed." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "n = 250\n", "t_event = rng.exponential(scale=2.0, size=n)\n", "c = rng.uniform(0.8, 5.0, size=n)\n", "y_obs = np.minimum(t_event, c)\n", "event = t_event <= c # True -> exact observation\n", "censored = ~event # mltpy: True -> right-censored\n", "print(f\"censoring rate = {censored.mean():.0%}\")\n", "print(f\"observed time range: [{y_obs.min():.3f}, {y_obs.max():.3f}]\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cd = mltpy.CensoredData.right_censored(y_obs, censored)\n", "\n", "support = (float(y_obs.min()) * 0.5, float(y_obs.max()) + 0.5)\n", "model = mltpy.Coxph(support=support, order=6)\n", "model.fit(cd)\n", "print(model.summary())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Kaplan-Meier reference\n", "\n", "The non-parametric Kaplan-Meier estimator is the standard baseline for\n", "right-censored survival data. It serves as the reference against which\n", "the smooth Coxph estimate is compared. lifelines expects the *event*\n", "indicator — the complement of `censored`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "kmf = KaplanMeierFitter()\n", "kmf.fit(durations=y_obs, event_observed=event.astype(int), label=\"Kaplan-Meier\")\n", "km_median = kmf.median_survival_time_\n", "ctm_median = model.predict(np.array([0.5]), what=\"quantile\")[0]\n", "true_median = 2.0 * np.log(2)\n", "print(f\"true median : {true_median:.3f}\")\n", "print(f\"Kaplan-Meier : {km_median:.3f}\")\n", "print(f\"mltpy Coxph : {ctm_median:.3f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Survival and hazard curves" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "t_grid = np.linspace(support[0] + 1e-3, support[1] - 1e-3, 400)\n", "S_ctm = model.survival(t_grid)\n", "h_ctm = model.hazard(t_grid)\n", "# Truth: Exp(scale=2) has S(t) = exp(-t/2), hazard 0.5.\n", "S_true = np.exp(-t_grid / 2.0)\n", "h_true = np.full_like(t_grid, 0.5)\n", "\n", "fig, (ax_s, ax_h) = plt.subplots(1, 2, figsize=(11, 4))\n", "\n", "ax_s.plot(t_grid, S_true, \"--\", color=\"0.3\", label=\"true S(t) = exp(-t/2)\")\n", "kmf.plot_survival_function(ax=ax_s, color=\"C3\", ci_show=True)\n", "ax_s.plot(t_grid, S_ctm, label=\"mltpy Coxph\", lw=2, color=\"C0\")\n", "ax_s.set_xlabel(\"t\")\n", "ax_s.set_ylabel(\"S(t)\")\n", "ax_s.set_title(\"Survival function\")\n", "ax_s.legend()\n", "\n", "ax_h.plot(t_grid, h_true, \"--\", color=\"0.3\", label=\"true hazard = 0.5\")\n", "ax_h.plot(t_grid, h_ctm, label=\"mltpy Coxph\", lw=2, color=\"C0\")\n", "ax_h.set_xlabel(\"t\")\n", "ax_h.set_ylabel(\"h(t)\")\n", "ax_h.set_title(\"Hazard rate\")\n", "ax_h.set_ylim(0, 1.5)\n", "ax_h.legend()\n", "fig.tight_layout();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Takeaways\n", "\n", "- The Coxph estimate tracks the Kaplan-Meier survival curve closely, but\n", " is smooth — it does not have the staircase pattern of KM and returns\n", " finite values everywhere inside the support.\n", "- Because the fit is smooth, the *hazard rate* is directly available via\n", " `model.hazard(t)`. KM provides no hazard estimate without further\n", " kernel smoothing.\n", "- The median survival time recovered by mltpy sits between the truth\n", " $2\\ln 2 \\approx 1.386$ and the KM estimate — both are consistent at\n", " $n=250$ with ~30 % censoring.\n", "- `Coxph` accepts covariates through `fit(cd, X=X)` exactly like\n", " `MLT`; the hazard ratios appear in the trailing entries of `model.theta_`." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "name": "python", "pygments_lexer": "ipython3" } }, "nbformat": 4, "nbformat_minor": 5 }