Algorithms of statistical anomalies clearing for data science applications

Authors

DOI:

https://doi.org/10.20535/SRIT.2308-8893.2023.1.06

Keywords:

anomaly removal, anomaly detection, noise removal, statistical techniques, data analysis, big data, data cleaning

Abstract

The paper considers the nature of input data used by Data Science algorithms of modern-day application domains. It then proposes three algorithms designed to remove statistical anomalies from datasets as a part of the Data Science pipeline. The main advantages of given algorithms are their relative simplicity and a small number of configurable parameters. Parameters are determined by machine learning with respect to the properties of input data. These algorithms are flexible and have no strict dependency on the nature and origin of data. The efficiency of the proposed approaches is verified with a modeling experiment conducted using algorithms implemented in Python. The results are illustrated with plots built using raw and processed datasets. The algorithms application is analyzed, and results are compared.

Author Biographies

Oleksii Pysarchuk, National Technical University of Ukraine "Igor Sikorsky Kyiv Polytechnic Institute", Kyiv

Doctor of Technical Sciences, a professor at the Department of Computer Engineering of the Faculty of Informatics and Computer Engineering of National Technical University of Ukraine "Igor Sikorsky Kyiv Polytechnic Institute", Kyiv, Ukraine.

Areas of interest: system analysis, modeling, optimization, information technologies, information security.

Danylo Baran, Codeimpact B.V., Kyiv

Senior software engineer at Codeimpact B.V., Kyiv, Ukraine.

Areas of interest: statistical analysis, big data, machine learning, information security.

Yurii Mironov, National Aviation University, Kyiv

Ph.D. student at the Software Engineering Department of the Faculty of Cybersecurity, Computer and Software Engineering of the National Aviation University, Kyiv, Ukraine.

Areas of interest: computer vision, software modeling, multi-criteria decision making systems, decision support systems.

Illya Pysarchuk, National Technical University of Ukraine "Igor Sikorsky Kyiv Polytechnic Institute", Kyiv

Student at the Department of Computer Engineering of the Faculty of Informatics and Computer Engineering of National Technical University of Ukraine "Igor Sikorsky Kyiv Polytechnic Institute", Kyiv, Ukraine.

Areas of interest: system analysis.

References

F. Provost and T. Fawcett, Data Science for Business. USA: O’Reilly Media, Inc, 2013, 409 p.

D. Dietrich, Data Science & Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data. Indianapolis, Indiana, USA: John Wiley & Sons, Inc, 2015, 420 p. doi: 10.1002/9781119183686.ch1.

O. Pysarchuk and V. Kharchenko, Nonlinear multi-criteria process modeling in traffic management systems, (in Ukrainian). Kyiv: Institute of Gifted Child, 2015, 248 p.

S. Kovbasiuk, O. Pysarchuk, and M. Rakushev, Least Squares Method and its practical applications, (in Ukrainian). Zhytomyr: Zhytomyr Military Institute, 2008, 228 p.

S. Raschka, Y. Liu, and V. Mirjalili, Machine Learning with PyTorch and Scikit-Learn: Develop machine learning and deep learning models with Python. Birmingham: Packt, 2022.

P. Joshi, Artificial Intelligence with Python. Birmingham: Packt, 2017.

G. Kishan, K. Chilukuri, and H. HuaMing, Anomaly Detection Principles and Algorithms. Switzerland, Springer, 2017, 229 p. doi: 10.1007/978-3-319-67526-8.

O. Pysarchuk and Y. Mironov, Chromosome Feature Extraction and Ideogram-Powered Chromosome Categorization. Switzerland, Springer, 2022. doi: 10.1007/978-3-031-04812-8_36.

H. Blomquist and J. Möller, Anomaly detection with Machine learning. Quality assurance of statistical data in the Aid community. Uppsala: Uppsala University, 2015, 60 p.

S. Thudumu, P. Branch, J. Jin, and J. Singh, A comprehensive survey of anomaly detection techniques for high dimensional big data. Switzerland, Springer, 2017, 30 p. doi: 10.1186/s40537-020-00320-x.

Downloads

Published

2023-03-30

Issue

Section

Problem- and function-oriented computer systems and networks