Design and evaluation of a quality assurance method for commit messages in version control systems using Word2Vec, FastText, and GloVe embeddings
DOI:
https://doi.org/10.20535/SRIT.2308-8893.2026.2.01Keywords:
AdamW algorithm, commit message, GitHub REST API, GRU layer, MLP, RNN, source code, software, change description, repository, harmonic mean, version control systemAbstract
This paper substantiates the relevance of addressing the problem of ensuring the quality of change descriptions in source code files within version control systems. To filter commit messages, machine learning methods are employed, including neural networks of various architectures. The use of neural networks is justified by the need to identify descriptions that accurately reflect the intent of the changes. A comparative analysis of word embedding methods (Word2Vec, FastText, and GloVe) was conducted, along with their application in binary classifiers such as MLP and RNN for filtering code changes. The models were trained on a dataset of change descriptions collected via the GitHub REST API. Model performance was evaluated using Accuracy and F1-score metrics. The effectiveness of the Google Colab environment for prototyping machine learning models was also confirmed.
References
S. Jiang, A. Armaly, C. McMillan, “Automatically generating commit messages from diffs using neural machine translation,” in Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering, (ASE), Urbana, IL, USA, 2017, pp. 135–146. doi: https://doi.org/10.1109/ASE.2017.8115626
R. Buse, W. Weimer, “Automatically documenting program changes,” in ASE '10: Proceedings of the 25th IEEE/ACM International Conference on Automated Software Engineering, pp. 33–42, 2010. doi: https://doi.org/10.1145/1858996.1859005
P. Xue et al., “Automated Commit Message Generation with Large Language Models: An Empirical Study and Beyond,” IEEE Transactions on Software Engineering, vol. 50, no. 12, pp. 3208–3224, 2024. doi: https://doi.org/10.1109/TSE.2024.3478317
“How to Write a Git Commit Message,” cbeams. Accessed on: Oct. 07, 2024. [Online]. Available: https://cbea.ms/git-commit/#seven-rules
S. Pogorilyy, B. Semonov, “The Implementation of a Commit Messages Filter for Software Version Control Systems,” in The 9th International Conference on Control and Optimization with Industrial Applications, 2024, pp. 175–179. Available: https://coia-conf.org/upload/editor/files/Proc_COIA2024.pdf
“GitHub REST API,” GitHub Docs. Accessed on: Oct. 07, 2024. [Online]. Available: https://docs.github.com/en/rest?apiVersion=2022-11-28
N.V. Otten, “How To Use Text Normalization Techniques In NLP With Python [9 Ways],” Spot Intelligence. Accessed on: Oct. 07, 2024. [Online]. Available: https://spotintelligence.com/2023/01/25/text-normalization-techniques-nlp/
A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, “Bag of Tricks for Efficient Text Classification,” 2016. [Online]. Available: https://arxiv.org/abs/1607.01759
“TensorFlow,” TensorFlow. Accessed on: Oct. 07, 2024. [Online]. Available: https://www.tensorflow.org/
I. Loshchilov, F. Hutter, “Decoupled weight decay regularization,” in Proceedings of the ICLR, 2019. Available: https://arxiv.org/pdf/1711.05101
N.K. Nissa, “Text Messages Classification using LSTM, Bi-LSTM, and GRU,” Medium. Accessed on: Oct. 07, 2024. [Online]. Available: https://nzlul.medium.com/the-classification-of-text-messages-using-lstm-bi-lstm-and-gru-f79b207f90ad
“Word2vec embeddings,” Gensim. Accessed on: Jan. 09, 2025. [Online]. Available: https://radimrehurek.com/gensim/models/word2vec.html
“Text classification,” fastText. Accessed on: Jan. 09, 2025. [Online]. Available: https://fasttext.cc/docs/en/supervised-tutorial.html
Ellie Arbab, “Global Vectors for Word Representation,” Medium. Accessed on: Jan. 18, 2025. [Online]. Available: https://medium.com/@ellie.arbab/glove-8849a40c08bc
V. Efstathiou, C. Chatzilenas, D. Spinellis, “Word Embeddings for the Software Engineering Domain,” in 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR), 2018, pp. 38–41. doi: https://doi.org/10.1145/3196398.3196448
V. Efstathiou, C. Chatzilenas, D. Spinellis, “Word Embeddings for the Software Engineering Domain (dataset),” Zenodo, Mar. 2018. doi: https://doi.org/10.5281/zenodo.1199620
Sciforce, “Word Vectors in Natural Language Processing: Global Vectors (GloVe),” Medium. Accessed on: Jan. 19, 2025. [Online]. Available: https://medium.com/sciforce/word-vectors-in-natural-language-processing-global-vectors-glove-51339db89639
K. Shridhar, H. Jain, A. Agarwal, D. Kleyko, “End to End Binarized Neural Networks for Text Classification,” Proceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing, pp. 29–34. doi: https://doi.org/10.18653/v1/2020.sustainlp-1.4
“PyTorch documentation — PyTorch master documentation,” Pytorch.org. Accessed on: Oct. 07, 2024. [Online]. Available: https://pytorch.org/docs/stable/index.html
“Home - Keras Documentation,” Keras.io. Accessed on: Oct. 07, 2024. [Online]. Available: https://keras.io/
“scikit-learn: machine learning in Python — scikit-learn 0.16.1 documentation,” Scikit-learn.org. Accessed on: Oct. 07, 2024. [Online]. Available: https://scikit-learn.org/
“Colaboratory – Google,” research.google.com. Accessed on: Oct. 07, 2024. [Online]. Available: https://research.google.com/colaboratory/faq.html