Scalable text clustering based on word embeddings and noise analysis

Dmytro Shutiak; Gleb Podkolzin; Oleksandr Pokhylenko

doi:10.20535/SRIT.2308-8893.2026.2.10

Authors

Dmytro Shutiak National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”, Kyiv, Ukraine https://orcid.org/0009-0008-6480-3706
Gleb Podkolzin National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”, Kyiv, Ukraine https://orcid.org/0000-0002-7120-2772
Oleksandr Pokhylenko National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”, Kyiv, Ukraine https://orcid.org/0000-0002-1562-2051

DOI:

https://doi.org/10.20535/SRIT.2308-8893.2026.2.10

Keywords:

text clustering, word embedding, large language models, machine learning, Python

Abstract

Text data clustering is a key component of unstructured text message analysis. To utilize these methods, text data must be converted into vector representations, i.e., word embeddings must be performed. This paper presents a modification of the HDBSCAN* clustering algorithm using custom distance metrics from the Minkowski family (L1, L2, L∞) and parameters specifically tailored for clustering unstructured text data. A major contribution is a novel evaluation metric based on the relative point density of identified clusters and surrounding noise formations (“clouds”). Beyond assessing overall clustering quality, this metric highlights problematic dense accumulations within the noise that require additional manual analysis. Experimental evaluation on the “20 Newsgroups” dataset demonstrated that clustering quality is independent of the α parameter but highly sensitive to the distance metric, with L∞ yielding the best results. The nomic-embedding-v1 model significantly outperformed gte-v1.5 in both the silhouette score and the proposed relative density metric.

References

A. Petukhova, J.P. Matos-Carvalho, N. Fachada, “Text Clustering with LLM Embeddings,” arXiv preprint, 2024. doi: https://doi.org/10.48550/arXiv.2403.15112

N. Muennighoff, N. Tazi, L. Magne, N. Reimers, “MTEB: Massive Text Embedding Benchmark,” arXiv preprint, 2022. doi: https://doi.org/10.48550/arXiv.2210.07316

Z. Nussbaum, J.X. Morris, B. Duderstadt, A. Mulyar, “Nomic Embed: Training a Reproducible Long Context Text Embedder,” arXiv preprint, 2024. doi: https://doi.org/10.48550/arXiv.2402.01613

L. Zehan, Z. Xin, Z. Yanzhao, L. Dingkun, X. Pengjun, Z. Meishan, “Towards General Text Embeddings with Multi-stage Contrastive Learning,” arXiv preprint, 2023. doi: https://doi.org/10.48550/arXiv.2308.03281

D. Zhang, J. Li, Z. Zeng, and F. Wang, “Jasper and Stella: distillation of SOTA embedding models,” arXiv preprint, 2025. doi: https://doi.org/10.48550/arXiv.2412.19048

C. Malzer, M. Baum, “A Hybrid Approach to Hierarchical Density-based Cluster Selection,” 2020 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems (MFI), Karlsruhe, Germany, 2020, pp. 223–228. doi: https://doi.org/10.1109/MFI49285.2020.9235263

Y. Feng, Z. Chen, Y. Zhang, W. Huang, X. Zhang, S. He, “BERTopic_Teen: A multi-module optimization approach for short text topic modeling in adolescent health,” Frontiers in Public Health, vol. 13, p. 1608241, Aug. 2025. doi: https://doi.org/10.3389/fpubh.2025.1608241

T.H. Cormen, C.E. Leiserson, R.L. Rivest, C. Stein, Introduction to Algorithms; third edition. MIT Press, 2009, pp. 631–638.

R.J.G.B. Campello, P. Kröger, J. Sander, A. Zimek, “Density-based clustering”, WIREs Data Mining and Knowledge Discovery, vol. 10, no. 2, p. e1343, 2020. doi: https://doi.org/10.1002/widm.1343

T. Mitchell, “Twenty Newsgroups,” UCI Machine Learning Repository. doi: https://doi.org/10.24432/C5C323

L. McInnes, J. Healy, J. Melville, “Umap: Uniform manifold approximation and projection for dimension reduction,” arXiv preprint, 2020. doi: https://doi.org/10.48550/arXiv.1802.03426

R. Saha, “Influence of various text embeddings on clustering performance in NLP,” arXiv preprint, 2023. doi: https://doi.org/10.48550/arXiv.2305.03144

Scalable text clustering based on word embeddings and noise analysis

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Issue

Section

License

Information

Make a Submission