Comparative analysis of modified semi-supervised learning algorithms on a small amount of labeled data




center of mass, clustering, distance function, medoids, nearest neighbor, semi-supervised learning


The paper is devoted to improving semi-supervised clustering methods and comparing their accuracy and robustness. The proposed approach is based on expanding a clustering algorithm for using an available set of labels by replacing the distance function. Using the distance function considers not only spatial data but also available labels. Moreover, the proposed distance function could be adopted for working with ordinal variables as labels. An extended approach is also considered, based on a combination of unsupervised k-medoids methods, modified for using only labeled data during the medoids calculation step, supervised method of k nearest neighbor, and unsupervised k-means. The learning algorithm uses information about the nearest points and classes’ centers of mass. The results demonstrate that even a small amount of labeled data allows us to use semi-supervised learning, and proposed modifications improve accuracy and algorithm performance, which was found during experiments.

Author Biographies

Leonid Lyubchyk, National Technical University “Kharkiv Polytechnic Institute”, Kharkiv

Professor, Doctor of Technical Sciences, a lecturer at the Department of Computer Mathematics and Data Analysis of the National Technical University “Kharkiv Polytechnic Institute”, Kharkiv, Ukraine.

Klym Yamkovyi, National Technical University “Kharkiv Polytechnic Institute”, Kharkiv

An assistant at the Department of Computer Mathematics and Data Analysis of the National Technical University “Kharkiv Polytechnic Institute”, Kharkiv, Ukraine.


L. Lyubchyk, A. Galuza, and G. Grinberg, “Semi-supervised Learning to Rank with Nonlinear Preference Model,” Recent Developments in Fuzzy Logic and Fuzzy Sets Studies in Fuzziness and Soft Computing, pp. 81–103, 2020.

J.E.V. Engelen and H.H. Hoos, “A survey on semi-supervised learning,” Machine Learning, vol. 109, no. 2, pp. 373–440, 2019.

Wikipedia contributors, “Semi-supervised learning”, in Wikipedia, The Free Encyclopedia. [Online]. Available:

E. Bair, “Semi-supervised clustering methods,” Wiley Interdisciplinary Reviews: Computational Statistics, vol. 5, no. 5, pp. 349–361, 2013.

A.S. Hadi, L. Kaufman, and P.J. Rousseeuw, “Finding Groups in Data: An Introduction to Cluster Analysis,” Technometrics, vol. 34, no. 1, pp. 111, 1992.

X. Jin and J. Han, “K-Medoids Clustering,” in Encyclopedia of Machine Learning. Boston, MA: Springer, 2011

M. Ester, H. Kriegel, J. Sander, and X. Xu, “A density-based algorithm for discovering clusters in large spatial databases with noise,” in Proceedings of the Second InternationalConference on Knowledge Discovery and Data Mining (KDD-96), Portland, Oregon, USA, E. Simoudis, J. Han, and U.M. Fayyad, Eds. AAAI Press, 1996, pp. 226–231. [Online]. Available:

Daniel Müllner, Modern hierarchical, agglomerative clustering algorithms. [Online]. Available:

T. Tullis and A. Bill, Measuring the User Experience: Collecting, Analyzing, and Presenting Usability Metrics. Elsevier/Morgan Kaufmann, 2013

T. Cover and P. Hart, “Nearest neighbor pattern classification,” IEEE Transactions on Information Theory, vol. 13, no. 1, pp. 21–27, 1967.

L. Lyubchyk, G. Grinberg, and K. Yamkovyi, “Integral Indicator for Complex System Building Based on Semi-Supervised Learning,” 2018 IEEE First International Conference on System Analysis & Intelligent Computing (SAIC), 2018.






Theoretical and applied problems of intelligent systems for decision making support