--- title: "Getting started with clusteredMSM" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Getting started with clusteredMSM} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 4 ) ``` ## Overview `clusteredMSM` provides nonparametric analysis of clustered multistate process data, based on Bakoyannis (2021) and Bakoyannis & Bandyopadhyay (2022). It exposes one user-facing function, `patp()`, that: * Estimates the population-averaged transition probability P(X(t) = j | X(s) = h) using the working-independence Aalen-Johansen estimator. * Returns cluster-bootstrap pointwise confidence intervals and simultaneous bands. * Conducts a two-sample Kolmogorov-Smirnov-type test when the formula has a grouping variable on the right-hand side. * Supports a landmark variant for non-Markov processes. * Natively handles non-monotone (recovery) multistate models. The package depends only on `survival`. ## Input data format `clusteredMSM` expects data in **interval format**: one row per mutually-exclusive time interval per subject, with columns for interval start time (`Tstart`), end time (`Tstop`), starting state (`Sstart`), and ending state (`Sstop`). The column names are flexible — they are passed by name through the formula. Censoring is encoded as `Sstart == Sstop` on the **final row** of a subject's record. Within each subject, intervals must be temporally contiguous (`Tstop[k] == Tstart[k+1]`) and state contiguous (`Sstop[k] == Sstart[k+1]`). ## A progressive illness-death example ```{r setup} library(clusteredMSM) # 1=Healthy, 2=Ill, 3=Dead. Allowed: 1->2, 1->3, 2->3. tmat <- trans_mat( list(c(2, 3), 3, integer(0)), names = c("Healthy", "Ill", "Dead") ) tmat ``` A small example dataset in interval format: ```{r data} mydata <- data.frame( pid = c(1, 1, 2, 3, 4, 4), site = c(1, 1, 1, 2, 2, 2), treatment = c("A", "A", "A", "B", "B", "B"), t0 = c(0.0, 1.5, 0.0, 0.0, 0.0, 1.0), t1 = c(1.5, 3.0, 2.0, 1.0, 1.0, 2.5), s0 = c(1, 2, 1, 1, 1, 2), s1 = c(2, 3, 1, 3, 2, 3) ) mydata ``` Subject 1: H -> I -> D. Subject 2: censored healthy at t = 2.0 (`s0 == s1`). Subject 3: died at t = 1.0. Subject 4: H -> I -> D. ## One-sample estimation ```{r estimate, eval = FALSE} fit <- patp( msm(t0, t1, s0, s1) ~ 1, data = mydata, tmat = tmat, id = "pid", cluster = "site", h = 1, j = 2, s = 0, B = 1000, cband = TRUE, seed = 1 ) fit ``` `fit$curves` contains the time-indexed point estimate of \begin{eqnarray*} P(X(t) = j \,|\, X(s) = h) &=& P(X(t) = 2 \,|\, X(0) = 1) \\ &=& P(X(t) = 2), \end{eqnarray*} (second equality due to the fact that all start at state = 1 in the classical illness-death model), standard error, pointwise confidence interval, and (with `cband = TRUE`) simultaneous band limits. ## Two-sample comparison (estimation + test in one call) If the formula's right-hand side names a binary grouping variable, `patp()` automatically estimates both curves and conducts a Kolmogorov-Smirnov-type test: ```{r test, eval = FALSE} tt <- patp( msm(t0, t1, s0, s1) ~ treatment, data = mydata, tmat = tmat, id = "pid", cluster = "site", h = 1, j = 2, s = 0, B = 1000, seed = 1 ) tt ``` The `$curves` slot now has one row block per group. The `$test` slot contains the observed K-S statistic and the bootstrap p-value. ## Recovery models (non-monotone processes) For processes with cyclic transitions (e.g., illness-death with recovery), use the long event format directly. The same `patp()` call works: ```{r recovery, eval = FALSE} tmat_rec <- trans_mat( list(c(2, 3), c(1, 3), integer(0)), names = c("Healthy", "Ill", "Dead") ) # Subject who went Healthy -> Ill -> Healthy -> censored: recovery_data <- data.frame( pid = c(1, 1, 1), t0 = c(0.0, 1.0, 2.0), t1 = c(1.0, 2.0, 3.5), s0 = c(1, 2, 1), s1 = c(2, 1, 1) # last row: censored healthy ) patp(msm(t0, t1, s0, s1) ~ 1, data = recovery_data, tmat = tmat_rec, id = "pid", h = 1, j = 2, s = 0, B = 500, seed = 1) ``` ## Landmark variant When `s > 0` and the Markov assumption is questionable, use `LMAJ = TRUE`: ```{r lmaj, eval = FALSE} patp(msm(t0, t1, s0, s1) ~ 1, data = mydata, tmat = tmat, id = "pid", cluster = "site", h = 1, j = 2, s = 1.0, LMAJ = TRUE, B = 1000, seed = 1) ``` ## Inverse-cluster-size weighting When cluster size is potentially informative (e.g., sicker centers contribute more patients), use the weighted estimator: ```{r weighted, eval = FALSE} patp(msm(t0, t1, s0, s1) ~ 1, data = mydata, tmat = tmat, id = "pid", cluster = "site", h = 1, j = 2, s = 0, weighted = TRUE, B = 1000, seed = 1) ``` `weighted = TRUE` requires the `cluster` argument. ## What `B = 0` does Setting `B = 0` returns the point estimate without any bootstrap inference. This is permitted only for one-sample formulas; two-sample formulas always require `B > 0` (since the entire point of a two-sample analysis is the test). Use `B = 0` for fast exploratory work, then switch to `B = 1000` for inference. ## References Bakoyannis G (2021). Nonparametric analysis of nonhomogeneous multistate processes with clustered observations. *Biometrics* 77(2):533-546. [doi:10.1111/biom.13327](https://doi.org/10.1111/biom.13327) Bakoyannis G, Bandyopadhyay D (2022). Nonparametric tests for multistate processes with clustered data. *Annals of the Institute of Statistical Mathematics* 74(5):837-867. [doi:10.1007/s10463-021-00819-x](https://doi.org/10.1007/s10463-021-00819-x) Putter H, Spitoni C (2018). Non-parametric estimation of transition probabilities in non-Markov multi-state models: the landmark Aalen-Johansen estimator. *Statistical Methods in Medical Research* 27(7):2081-2092. [doi:10.1177/0962280216674497](https://doi.org/10.1177/0962280216674497)