High Dimensional Data Clustering by means of Distributed Dirichlet Process Mixture Models

Khadidja Meguelati 1 Bénédicte Fontez 2 Nadine Hilgert 2 Florent Masseglia 1
1 ZENITH - Scientific Data Management
LIRMM - Laboratoire d'Informatique de Robotique et de Microélectronique de Montpellier, CRISAM - Inria Sophia Antipolis - Méditerranée
Abstract : Clustering is a data mining technique intensively used for data analytics, with applications to marketing, security, text/document analysis, or sciences like biology, astronomy, and many more. Dirichlet Process Mixture (DPM) is a model used for multivariate clustering with the advantage of discovering the number of clusters automatically and offering favorable characteristics. However, in the case of high dimensional data, it becomes an important challenge with numerical and theoretical pitfalls. The advantages of DPM come at the price of prohibitive running times, which impair its adoption and makes centralized DPM approaches inefficient, especially with high dimensional data. We propose HD4C (High Dimensional Data Distributed Dirichlet Clustering), a parallel clustering solution that addresses the curse of dimensionality by two means. First it gracefully scales to massive datasets by distributed computing, while remaining DPM-compliant. Second, it performs clustering of high dimensional data such as time series (as a function of time), hyperspectral data (as a function of wavelength) etc. Our experiments, on both synthetic and real world data, illustrate the high performance of our approach.
Document type :
Conference papers
Complete list of metadatas

Cited literature [44 references]  Display  Hide  Download

https://hal-lirmm.ccsd.cnrs.fr/lirmm-02364411
Contributor : Florent Masseglia <>
Submitted on : Saturday, November 16, 2019 - 10:58:26 PM
Last modification on : Tuesday, November 26, 2019 - 2:41:14 PM

File

IEEE_BigData_2019__HAL_.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : lirmm-02364411, version 1

Citation

Khadidja Meguelati, Bénédicte Fontez, Nadine Hilgert, Florent Masseglia. High Dimensional Data Clustering by means of Distributed Dirichlet Process Mixture Models. IEEE International Conference on Big Data (IEEE BigData), Dec 2019, Los-Angeles, United States. ⟨lirmm-02364411⟩

Share

Metrics

Record views

40

Files downloads

24