Scalable text clustering based on word embeddings and noise analysis
DOI:
https://doi.org/10.20535/SRIT.2308-8893.2026.2.10Keywords:
text clustering, word embedding, large language models, machine learning, PythonAbstract
Text data clustering is a key component of unstructured text message analysis. To utilize these methods, text data must be converted into vector representations, i.e., word embeddings must be performed. This paper presents a modification of the HDBSCAN* clustering algorithm using custom distance metrics from the Minkowski family (L1, L2, L∞) and parameters specifically tailored for clustering unstructured text data. A major contribution is a novel evaluation metric based on the relative point density of identified clusters and surrounding noise formations (“clouds”). Beyond assessing overall clustering quality, this metric highlights problematic dense accumulations within the noise that require additional manual analysis. Experimental evaluation on the “20 Newsgroups” dataset demonstrated that clustering quality is independent of the α parameter but highly sensitive to the distance metric, with L∞ yielding the best results. The nomic-embedding-v1 model significantly outperformed gte-v1.5 in both the silhouette score and the proposed relative density metric.
References
A. Petukhova, J.P. Matos-Carvalho, N. Fachada, “Text Clustering with LLM Embeddings,” arXiv preprint, 2024. doi: https://doi.org/10.48550/arXiv.2403.15112
N. Muennighoff, N. Tazi, L. Magne, N. Reimers, “MTEB: Massive Text Embedding Benchmark,” arXiv preprint, 2022. doi: https://doi.org/10.48550/arXiv.2210.07316
Z. Nussbaum, J.X. Morris, B. Duderstadt, A. Mulyar, “Nomic Embed: Training a Reproducible Long Context Text Embedder,” arXiv preprint, 2024. doi: https://doi.org/10.48550/arXiv.2402.01613
L. Zehan, Z. Xin, Z. Yanzhao, L. Dingkun, X. Pengjun, Z. Meishan, “Towards General Text Embeddings with Multi-stage Contrastive Learning,” arXiv preprint, 2023. doi: https://doi.org/10.48550/arXiv.2308.03281
D. Zhang, J. Li, Z. Zeng, and F. Wang, “Jasper and Stella: distillation of SOTA embedding models,” arXiv preprint, 2025. doi: https://doi.org/10.48550/arXiv.2412.19048
C. Malzer, M. Baum, “A Hybrid Approach to Hierarchical Density-based Cluster Selection,” 2020 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems (MFI), Karlsruhe, Germany, 2020, pp. 223–228. doi: https://doi.org/10.1109/MFI49285.2020.9235263
Y. Feng, Z. Chen, Y. Zhang, W. Huang, X. Zhang, S. He, “BERTopic_Teen: A multi-module optimization approach for short text topic modeling in adolescent health,” Frontiers in Public Health, vol. 13, p. 1608241, Aug. 2025. doi: https://doi.org/10.3389/fpubh.2025.1608241
T.H. Cormen, C.E. Leiserson, R.L. Rivest, C. Stein, Introduction to Algorithms; third edition. MIT Press, 2009, pp. 631–638.
R.J.G.B. Campello, P. Kröger, J. Sander, A. Zimek, “Density-based clustering”, WIREs Data Mining and Knowledge Discovery, vol. 10, no. 2, p. e1343, 2020. doi: https://doi.org/10.1002/widm.1343
T. Mitchell, “Twenty Newsgroups,” UCI Machine Learning Repository. doi: https://doi.org/10.24432/C5C323
L. McInnes, J. Healy, J. Melville, “Umap: Uniform manifold approximation and projection for dimension reduction,” arXiv preprint, 2020. doi: https://doi.org/10.48550/arXiv.1802.03426
R. Saha, “Influence of various text embeddings on clustering performance in NLP,” arXiv preprint, 2023. doi: https://doi.org/10.48550/arXiv.2305.03144