--- title: "Introduction to zmctp: Zero-Modified Complex Tri-parametric Pearson Distribution" author: "Rasheedat Oladoja" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true toc_depth: 3 vignette: > %\VignetteIndexEntry{Introduction to zmctp: Zero-Modified Complex Tri-parametric Pearson Distribution} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 5, warning = FALSE, message = FALSE ) library(zmctp) # Flag: skip slow chunks on CRAN (they time out during R CMD build) NOT_CRAN <- !identical(Sys.getenv("NOT_CRAN"), "true") ``` # Abstract The **zmctp** package extends the Complex Tri-parametric Pearson (CTP) distribution with zero-modified versions for overdispersed count data. This package addresses limitations of existing implementations when the parameter $b$ approaches zero, providing distribution functions, maximum likelihood estimation, and diagnostic tools based on Rodríguez-Avi et al. (2003). # 1. Introduction ## 1.1 Background In count data analysis, the Poisson regression model is the standard approach when successive events occur independently at the same rate. A unique feature of the Poisson distribution is **equi-dispersion**: the mean and variance are equal. However, real-world data often violates this assumption: - **Overdispersion**: Variance > Mean (dispersion parameter > 1) - **Underdispersion**: Variance < Mean (dispersion parameter < 1) When overdispersion occurs, the Negative Binomial (NB) model is commonly recommended. For cases with excess zeros, Zero-Inflated models (ZIP, ZINB) are used. However, these models have limitations, particularly when handling complex dispersion patterns. ## 1.2 The Complex TriParametric Pearson (CTP) Distribution The CTP distribution, introduced by Rodríguez-Avi et al. (2003), belongs to the Pearson discrete distribution family and is derived from the Gaussian Hypergeometric Probability Distribution (GHPD). Unlike Poisson-based mixtures, the CTP provides a fundamentally different approach to modeling count data. ### Mathematical Definition The CTP distribution has probability mass function: $$f(x|a,b,\gamma) = f_0 \frac{\Gamma(a+ib+x) \Gamma(a-ib+x)}{\Gamma(\gamma+x) \cdot x!}, \quad x = 0, 1, 2, ...$$ where: - $a \in \mathbb{R}$ (real parameter, can be negative) - $b \geq 0$ (imaginary parameter) - $\gamma > 2a + 2$ (shape parameter, ensures variance exists) - $i$ is the imaginary unit - $f_0$ is the normalizing constant: $$f_0 = \frac{\Gamma(\gamma-a-ib)\Gamma(\gamma-a+ib)}{\Gamma(\gamma)\Gamma(\gamma-2a)}$$ ### Moments The distribution has closed-form expressions for its moments: **Mean:** $$\mu = \frac{a^2 + b^2}{\gamma - 2a - 1}$$ **Variance:** $$\sigma^2 = \frac{\mu(\mu + \gamma - 1)}{\gamma - 2a - 2}$$ **Dispersion:** - Equidispersed if $a = -\frac{\mu + 1}{2}$ - Underdispersed if $a < -\frac{\mu + 1}{2}$ - Overdispersed if $a > -\frac{\mu + 1}{2}$ Notably, the CTP is overdispersed whenever $a \geq 0$. ## 1.3 Why zmctp? ### Problem with Existing Implementations The `cpd` package (available on CRAN) implements the CTP distribution but has a significant limitation: **when fitting data with large sample sizes or zero-inflation, it often estimates $b \approx 0$**, which reduces the model to a simpler form and loses flexibility. ### The zmctp Solution This package addresses these limitations through: 1. **Robust optimization** with reparameterization ($\gamma = 2a + 2 + e^\eta$) 2. **Zero-Modified CTP** for excess/deficit zeros 3. **Better handling** of small $b$ values 4. **Comprehensive diagnostics** and visualization tools # 2. Basic Usage ## 2.1 Distribution Functions The package provides standard distribution functions following R conventions: ```{r basic-dist} # Probability mass function dctp(0:5, a = 1, b = 0.5, gama = 5) # Cumulative distribution function pctp(3, a = 1, b = 0.5, gama = 5) # Quantile function (inverse CDF) qctp(c(0.25, 0.5, 0.75), a = 1, b = 0.5, gama = 5) # Random generation set.seed(123) x <- rctp(30, a = 1, b = 0.5, gama = 5) cat("Sample mean:", mean(x), "\nSample variance:", var(x), "\n") ``` ## 2.2 Checking for Overdispersion ```{r check-dispersion} # Generate overdispersed data set.seed(456) x_over <- rctp(20, a = 1.2, b = 0.3, gama = 6) head(x_over) # Calculate dispersion index dispersion_index <- var(x_over) / mean(x_over) cat("Dispersion Index:", dispersion_index, "\n") if (dispersion_index > 1) { cat("Data is OVERDISPERSED\n") } else if (dispersion_index < 1) { cat("Data is UNDERDISPERSED\n") } else { cat("Data is EQUIDISPERSED\n") } ``` ## 2.3 Fitting the CTP Model ```{r fit-ctp, eval=FALSE} # Fit CTP model to moderate-sized data set.seed(456) x_over <- rctp(200, a = 1.2, b = 0.3, gama = 6) fit_ctp <- ctp.fit(x_over) print(fit_ctp) ``` ```{r fit-ctp-output, echo=FALSE, comment=""} cat("CTP Distribution - Maximum Likelihood Estimates =============================================== Parameter Estimates: Estimate Std.Error a 1.4593 0.3585 b 0.0011 0.0000 gama 8.5912 3.1878 Goodness-of-Fit Statistics: Log-Likelihood: -178.1003 AIC: 362.2006 BIC: 372.0956 Pearson Chi-sq: 19.6799 Wald Chi-sq: 189.8432 Convergence: YES") ``` ## 2.4 Diagnostic Plots ```{r plot-ctp, fig.height=6, eval=FALSE} plot(fit_ctp, type = "frequency") plot(fit_ctp, type = "cdf") plot(fit_ctp, type = "qq") ``` # 3. Zero-Modified CTP Distribution ## 3.1 When to Use ZM-CTP Use the Zero-Modified CTP when: - Data has **excess zeros** (zero-inflation) - Data has **deficit of zeros** (zero-deflation) - Standard CTP estimates $b \approx 0$ - Over-dispersion is caused by zero-inflation ## 3.2 Mathematical Definition The ZM-CTP probability mass function is: $$P(X=0) = \omega + (1-\omega) P_{CTP}(0)$$ $$P(X=k) = (1-\omega) P_{CTP}(k), \quad k > 0$$ where $0 < \omega < 1$ is the zero-modification parameter. **Interpretation:** - $\omega > 0.5$: Zero-inflation - $\omega < 0.5$: Zero-deflation - $\omega \approx 0$: Standard CTP is sufficient ## 3.3 Example: Zero-Inflated Data ```{r zm-ctp-example, eval=FALSE} # Generate zero-inflated data x_zi <- rzictp(300, a = 1, b = 0.5, gama = 6, omega = 0.3) cat("Proportion of zeros:", mean(x_zi == 0), "\n") cat("Expected P(X=0) under CTP:", dctp(0, 1, 0.5, 6), "\n") fit_ctp <- ctp.fit(x_zi) fit_zm <- zictp.fit(x_zi) summary(fit_zm) cat("Standard CTP AIC:", fit_ctp$AIC, "\n") cat("ZM-CTP AIC:", fit_zm$AIC, "\n") cat("Omega estimate:", fit_zm$estimates["omega"], "\n") ``` ```{r, echo=FALSE, comment=""} cat("Proportion of zeros: 0.7966667 Expected P(X=0) under CTP: 0.7570218 === Zero-Modified CTP Distribution Fit Summary === Zero-Modified CTP Distribution - Maximum Likelihood Estimates ============================================================= Parameter Estimates: Estimate Std.Error a 2.5646 NA b 0.0017 NA gama 16.8737 NA omega 0.4656 NA Goodness-of-Fit Statistics: Log-Likelihood: -216.5545 AIC: 441.1090 BIC: 455.9241 Pearson Chi-sq: 2.7242 Wald Chi-sq: 287.6122 Convergence: YES --- Moments --- Mean: 0.3271 (empirical: 0.3267) Variance: 0.6467 (empirical: 0.6220) P(X=0): 0.7967 (empirical: 0.7967) --- Observed vs Expected Frequencies --- x Freq expected 1 0 239 239.0001656 2 1 40 38.7105351 3 2 11 13.7597381 4 3 5 5.0633781 5 4 4 1.9722958 6 5 1 0.8143692 Standard CTP AIC: 440.03 ZM-CTP AIC: 441.109 Omega estimate: 0.4656324") ``` # 5. Comparison with cpd Package ## 5.1 The Problem with Large Sample Sizes From Oladoja (2021): > "Data sets previously utilized while fitting this model have sample sizes that are less than 2000. In this study, the CTP was fitted to different over-dispersed and under-dispersed count data sets with both small and large sample sizes." > "The 'cpd' package in R... estimated the parameter b as 0 for cases where the sample size n is large." ## 5.2 Demonstration ```{r cpd-comparison, eval=FALSE} # Install cpd if needed # install.packages("cpd") library(zmctp) # Generate data where cpd struggles set.seed(100) x_problem <- rzictp(2000, a = 1, b = 0.001, gama = 8, omega = 0.2) # Compare results cpd_fit <- cpd::fitctp(x_problem, astart=1, bstart=0.001, gammastart=8) zmctp_fit <- zictp.fit(x_problem) cat("cpd b estimate:", cpd_fit$coefficients[2], "\n") cat("zmctp b estimate:", zmctp_fit$estimates["b"], "\n") cat("zmctp omega estimate:", zmctp_fit$estimates["omega"], "\n") ``` ```{r cpd-comparison-result, echo=FALSE, comment=""} cat(" cpd b estimate: 8.072796e-07 zmctp b estimate: 0.0007942037 zmctp omega estimate: 0.04831814") ``` # 6. Model Selection Guide ## 6.1 Decision Tree ``` Is your data count data? ├─ NO → Use appropriate model for your data type └─ YES → Calculate Dispersion Index (DI = variance/mean) ├─ DI ≈ 1 → Poisson ├─ DI > 1 (Over-dispersed) │ ├─ Excess zeros? → Try ZIP, ZINB, ZM-CTP │ ├─ Excess non-zeros? → Try CTP, NB │ └─ Large n & b→0 in cpd? → Use zmctp! └─ DI < 1 (Under-dispersed) → Try CTP ``` # 7. Properties and Theoretical Results ## 7.1 Skewness The CTP distribution has **positive skewness**: $$\mu_3 = \frac{(a^2 + b^2)[4b^2 + (\gamma - 1)^2] + [b^2 + (\gamma - 1 - a)^2]}{(\gamma - 2a - 1)^3(\gamma - 2a - 2)(\gamma - 2a - 3)}$$ Since both numerator and denominator are always positive, the distribution is right-skewed. ## 7.2 Mode The distribution is uni-modal at: $$y = \left\lfloor \frac{(a-1)^2 + b^2}{\gamma - 2a + 1} \right\rfloor$$ ## 7.3 Convergence The CTP distribution converges to: - **Poisson** when $\gamma$ and $a^2 + b^2 \rightarrow \infty$ at the same rate - **Normal** when $\gamma$ and $\sqrt{a^2 + b^2}$ have the same order of convergence # 9. Conclusion The **zmctp** package provides a robust implementation of the CTP distribution that: ✅ Handles both over-dispersion and under-dispersion ✅ Works with large sample sizes (n > 2000) ✅ Addresses the b → 0 problem in existing implementations ✅ Provides zero-modified variants for zero-inflated data ✅ Includes comprehensive diagnostics and visualization ## 9.1 Key Findings from Oladoja (2021) > "The CTP produced similar fit to the NB model while correcting for over-dispersion caused by excess zeros but is a better fit than the NB when over-dispersion is caused by excessive non-zero values in a data set." > "The CTP model behaved better with under-dispersed data sets than the Poisson model." > "The package 'cpd' in R estimates b as 0 for zero-inflated data sets while the program utilized in this study works well." # References Böhning, D., Dietz, E., & Schlattmann, P. (1999). The zero-inflated Poisson model and the decayed, missing and filled teeth index in dental epidemiology. *Journal of the Royal Statistical Society: Series A*, 162(2), 195-209. Deb, P., & Trivedi, P. K. (1997). Demand for medical care by the elderly: a finite mixture approach. *Journal of Applied Econometrics*, 12(3), 313-336. Oladoja, R. O. (2021). Complex Tri-Parametric Pearson Distribution and Its Applications. *M.Sc. Thesis*, Kwara State University, Malete, Nigeria. Rodríguez-Avi, J., Conde-Sánchez, A., & Sáez-Castillo, A. J. (2003). A new class of discrete distributions with complex parameters. *Statistical Papers*, 44(1), 67-88. https://doi.org/10.1007/s00362-002-0134-7 # Session Info ```{r session-info} sessionInfo() ```