Efficiency comparison of missing data imputation methods in predictive model creation

Authors

  • Andrii Popov Educational and Research Institute for Applied System Analysis of the National Technical University of Ukraine "Igor Sikorsky Kyiv Polytechnic Institute", Kyiv, Ukraine https://orcid.org/0009-0001-4783-7401

DOI:

https://doi.org/10.20535/SRIT.2308-8893.2025.1.03

Keywords:

missing data, imputation methods, forecasting models, machine learning

Abstract

Missing data is a common issue in data analysis and machine learning. This article analyzes the impact of missing data imputation methods during the data preprocessing stage on the quality of forecasting models. Selected methods are listwise deletion, mean imputation, and two implementations of the multiple imputation method in Python and R languages. Selected classifiers are Logistic Regression, Random Forest, Support Vector Machine, and Light Gradient Boosting Machine. The performance quality of forecasting models is estimated using accuracy, precision, and recall metrics. Two datasets were used as binary classification problems with different target metrics. The highest performance was achieved when the R implementation of the multiple imputation method was combined with RF and LGBM classifiers.

Author Biography

Andrii Popov, Educational and Research Institute for Applied System Analysis of the National Technical University of Ukraine "Igor Sikorsky Kyiv Polytechnic Institute", Kyiv

Ph.D. student at the Department of Mathematical Methods of System Analysis of Educational and Research Institute for Applied System Analysis of the National Technical University of Ukraine "Igor Sikorsky Kyiv Polytechnic Institute", Kyiv, Ukraine.

References

Donald B. Rubin, “Inference and Missing Data,” Biometrika, vol. 63, no. 3, pp. 581–592, 1976.

Craig K. Enders, Applied Missing Data Analysis; 1 ed. The Guilford Press, 2010, 377 p.

Therese D. Pigott, “A review of methods for missing data,” Educational Research and Evaluation, vol. 7, no. 4, pp. 353–383, 2001.

Luke Oluwaseye Joel, Wesley Doorsamy, and Babu Sena Paul, “A Review of Missing Data Handling Techniques for Machine Learning,” International Journal of Innovative Technology and Interdisciplinary Sciences (IJITIS), vol. 5, no. 3, pp. 971–1005, 2022. doi: https://doi.org/10.15157/IJITIS.2022.5.3.971-1005

Helen Bridge, Thomas Schindler, “The perils of the unknown: Missing data in clinical studies,” Medical Writing, 27(1), pp. 56–59, 2018.

Tlamelo Emmanuel, Thabiso Maupong, Dimane Mpoeleng, Thabo Semong, Banyatsang Mphago, and Oteng Tabona, “A survey on missing data in machine learning,” Journal of Big Data, 8(1), article no. 140, 2021. doi: 10.1186/s40537-021-00516-9

Hyun Kang, “The prevention and handling of the missing data,” Korean Journal of Anesthesiology, 64(5), pp. 402–406, 2013. doi: 10.4097/kjae.2013.64.5.402

Tolou Shadbahr et al., “The impact of imputation quality on machine learning classifiers for datasets with missing values,” Communications medicine, vol. 3, article no. 139, 2023. doi: 10.1038/s43856-023-00356-z

Jale Bektas, Turgay Ibrikci, and Ismail Ozcan, “The impact of imputation procedures with machine learning methods on the performance of classifiers: An application to coronary artery disease data including missing values,” Biomedical Research, 29(13), pp. 2780–2785, 2018. doi: 10.4066/biomedicalresearch.29-18-199

George Paterakis, Stefanos Fafalios, Paulos Charonyktakis, Vassilis Christophides, and Ioannis Tsamardinos, “Do we really need imputation in AutoML predictive modeling?” ACM Transactions on Knowledge Discovery from Data, 18(6), 2024. doi: 10.1145/3643643

Katarzyna Woźnica, Przemyslaw Biecek, Does imputation matter? Benchmark for predictive models, 2020. doi: 10.48550/arXiv.2007.02837

A. Popov, O. Makarenko, and P. Bidyuk, “Rozv’iazannia zadachi zapovnennia propuskiv danykh alternatyvnymy metodamy pry stvorenni prohnoznykh modelei [Solving missing data imputation problem using alternative methods in predictive model creation],” Proceedings of the II All-Ukrainian Scientific and Practical Conference "System Sciences and Informatics", December 4–8, 2023, Kyiv: National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”, pp. 201–206.

Downloads

Published

2025-03-28

Issue

Section

Mathematical methods, models, problems and technologies for complex systems research