Bayesian Nonparametric Clustering in SAS®

Hend Aljobaily
University of Northern Colorado


Abstract

Traditional parametric models use a fixed and finite number of parameters which cannot be used in data mining and machine learning. This is because they may result in the over or under fitting of data due to the complexity of the models used in data mining and machine learning. The Bayesian nonparametric approach is an alternative to the traditional parametric approach. Probabilistic models are appropriate nonparametric models for data mining and machine learning since they are data-driven. One example of a Bayesian nonparametric model is the Dirichlet Process model. The Dirichlet Process model is one of the most popular BNP models. For clustering, the Dirichlet Process in the Gaussian Mixture Model (GMM) is used to find the best number of clusters within the data using the gmm action in the CAS procedure within SAS®. It is able to add new clusters and remove existing clusters during the clustering process, thus finding the best number of clusters adaptively. In the gmm action, the Dirichlet Process serves as the prior for the proportion of the Gaussian mixture. In this study, a real-world example will be used to demonstrate the use of Dirichlet Process Gaussian Mixture Model for nonparametric clustering in SAS® to analyze a large dataset.