Part of Advances in Neural Information Processing Systems 19 (NIPS 2006)
Kenichi Kurihara, Max Welling, Nikos Vlassis
Dirichlet Process (DP) mixture models are promising candidates for clustering applications where the number of clusters is unknown a priori. Due to compu- tational considerations these models are unfortunately unsuitable for large scale data-mining applications. We propose a class of deterministic accelerated DP mixture models that can routinely handle millions of data-cases. The speedup is achieved by incorporating kd-trees into a variational Bayesian algorithm for DP mixtures in the stick-breaking representation, similar to that of Blei and Jordan (2005). Our algorithm differs in the use of kd-trees and in the way we handle truncation: we only assume that the variational distributions are fixed at their pri- ors after a certain level. Experiments show that speedups relative to the standard variational algorithm can be significant.