MixMatch

git repo

Estimate an approximating Gaussian mixture for observed data that maximizes the expected log pointwise predictive density.

Why?

Given data $X$ taken from an arbitrary distribution $\mathcal D = p(X|\theta)$, we are often are interested in the inverse problem of parameterizing $\mathcal D$. If we already know a priori what the underlying distribution should be, it is relatively straightforward to apply techniques like maximum likelihood estimation to determine the best parameters of $\mathcal D$. However, occasionally we are unsure what the data-generating distribution is. In this case it is necessary to make some assumptions about what the candidate models could be and then evaluate quality of these models using some information criterion.

In this package, we use a Gaussian mixture model (that is, a weighted sum of Gaussians) to estimate the posterior predictive distribution $\widetilde{\mathcal D}(x | X) = \int_\theta d\theta \, p(x|\theta) p(\theta | X)$. The optimal number of components is estimated using leave-one-out cross-validation.

This work is implemented in PyMC. Thus, rather than rely on the expectation-maximization algorithm, we estimate $\widetilde{\mathcal D}(x | X)$ through a fully Bayesian, MCMC-based approach.

Examples

Data-generating distribution is a Gaussian mixture

Suppose we have data generated from the Gaussian mixture $\mathcal D = \frac 12 \mathcal N(-5, 1) + \frac 12 \mathcal N(-5, 1)$.

import numpy as np
data = np.hstack([np.random.normal(-5, 1, 100),  np.random.normal(5, 1, 100)])
gmm data
 

To estimate the posterior predictive distribution, we need only the minimum and maximum number of components in our Gaussian mixture. The library will estimate these parameters for $N_\text{min}, \dots, N_\text{max}$ components before returning the model that maximizes the expected log pointwise predictive density.

from mixmatch import MixMatch

mix = MixMatch(data=data, min_components=1, max_components=4)
print(mix)

# Returns: 
# Best approximation [2 components]:
#  0.50 [-4.90(0.96)] + 0.50 [4.8(1.0)]

This results agrees with our expectations.

Finally, can also compare the performance of the different models as well as the highest-weight model against the data.

gmm comparison gmm ppc
   

We note that although there appears to be some discrepency between the observed data and the posterior predictive distribution, this apparent discrepency is a consequence of using kernel density estimation to plot the data; when plotting the true data generating function, the agreement is visually much better.

Data-generating distribution is a uniform distribution

Next we consider data drawn from the uniform distribution $\mathcal U(0, 1)$. Unlike the previous example, the data-generating and posterior predictive distributions have different supports. However, as we add components, we should generally expect our mixture model to improve.

import numpy as np
data = np.random.uniform(0, 1, 1000)
gmm data
 

Considering up to 6 components, we find

from mixmatch import MixMatch

mix = MixMatch(data=data, min_components=1, max_components=6)
print(mix)

# Returns: 
# Best approximation [6 components]:
#   0.08 [0.053(0.034)] + 0.18 [0.189(0.079)] + 0.33 [0.42(0.13)] + 0.25 [0.68(0.11)] + 0.11 [0.868(0.055)] + 0.05 [0.968(0.022)]
gmm comparison gmm ppc
   

As expected, adding more components causes the model to improve (albeit with diminishing returns).

2025

MixMatch

2 minute read

Estimate an approximating Gaussian mixture for observed data that maximizes the expected log pointwise predictive density.

mars

4 minute read

mars (model averaging by resampling stochastically) is a module for creating a bootstrap distribution of a model-averaged quantity, taking resampled (i.e....

Back to top ↑

2024

lsqfitics

2 minute read

Wrapper of lsqfit for computing various information criteria, particularly those listed in arXiv:2208.14983 [stat.ME], using vegas.

Back to top ↑

2023

Back to top ↑

2022

Back to top ↑

2021

lsqfit-gui

less than 1 minute read

Graphical user interface for lsqfit using dash.

Back to top ↑

2020

spacetime-plots

2 minute read

A python noteboook for plotting points and lines, expressly written for making spacetime diagrams. To get started with a tutorial, launch the binder inst...

Back to top ↑