Topic modeling of researchers based on their interests from Google Scholar




topic modeling, categorization, Google Scholar, Dimensions, ANZSRC, researcher’s profile, research interests, Czekanowski metric, Jaccard index


The article proposes an algorithm for topic modeling of researchers based on their interests from Google Scholar profiles. The algorithm uses the set of fields of research from research classification system ANZSRC. An information resource for topic modeling is a corpus of categorized publications from Dimensions. Interests from researchers’ profiles are used as search queries to Dimensions that outputs distributions of documents over categories. To reduce information noise these distributions are taken through a few stages of processing. The article also compares the results of topic modeling based on interests from Google Scholar profiles and based on a categorized list of publications from Dimensions. The comparison is done using modified Czekanowski metric that takes into account the similarity between categories. The results of comparing the topic modeling outputs based on different information sources show a good match.

Author Biographies

Serhiy Shtovba, Vasyl Stus Donetsk National University, Vinnytsia

Serhiy D. Shtovba,

Doctor of Technical Sciences, a professor at the Department of Information Technology of Vasyl Stus Donetsk National University, Vinnytsia, Ukraine.

Mykola Petrychko, Vinnytsia National Technical University, Vinnytsia

Mykola V. Petrychko,

a Ph.D. student at the Department of Computer Control Systems of Vinnytsia National Technical University, Vinnytsia, Ukraine.


E. Delgado López-Cózar, E. Orduña-Malea, A. Martín-Martín, and J.M. Ayllón, “Google Scholar: the big data bibliographic tool”, in Research analytics: boosting university productivity and competitiveness through scientometrics. CRC Press (Tay-lor & Francis), pp. 59–80, 2017. doi: 10.1201/9781315155890-4.

A. Martín-Martín, M. Thelwall, E. Orduna-Malea, and E.D. López-Cózar, “Google Scholar, Microsoft Academic, Scopus, Dimensions, Web of Science, and OpenCita-tions’ COCI: a multidisciplinary comparison of coverage via citations”, Scientomet-rics, 126, pp. 871–906, 2021. doi: 10.1007/s11192-020-03690-4.

A.-W. Harzing and S. Alakangas, “Google Scholar, Scopus and the Web of Science: A longitudinal and cross-disciplinary comparison”, Scientometrics, 106(2), pp. 787–804, 2016. doi: 10.1007/s11192-015-1798-9.

B. Rahdari et al., “Grapevine: A profile-based exploratory search and recommenda-tion system for finding research advisors”, Proceedings of the Association for Infor-mation Science and Technology, 57(1), e271, 2020. doi: 10.1002/pra2.271.

J. Saad-Falcon, O. Shaikh, Z.J. Wang, A.P. Wright, S. Richardson, and D.H. Chau, “PeopleMap: Visualization Tool for Mapping Out Researchers using Natural Lan-guage Processing”, arXiv preprint, arXiv:2006.06105 (2020).

M. Rosen-Zvi, T. Griffiths, M. Steyvers, and P. Smith, “The author-topic model for authors and documents”, in Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, AUAI Press, pp. 487–494, 2004.

D. Blei, A. Ng., and M. Jordan, “Latent Dirichlet allocation”, Journal of Machine Learning Research, 3, pp. 993–1022, 2003.

J. Jian, G. Qian, M. Haikun, and C. Chong, “Author–Subject–Topic model for Re-viewer Recommendation”, JIS-Journal of Information Science, SAGE, pp. 1–16, 2018. doi: 10.1177/0165551518806116.

D. Mimno and A. McCallum, “Expertise modeling for matching papers with review-ers”, in KDD’07 proceedings of the 13th ACMSIGKDD international conference on knowledge discovery and data mining, New York: ACM, pp. 500–509, 2007. doi: 10.1145/1281192.1281247.

N. Kawamae, “Author interest topic model”, in SIGIR’10 proceeding of the 33rd in-ternational ACM SIGIR conference on research and development in information re-trieval, New York: ACM, pp. 887–888, 2010. doi: 10.1145/1835449.1835666.

C. Sun, T.J. King, P. Henville, and R. Marchant, “Hierarchical Word Mover Dis-tance for Collaboration Recommender System”, Australasian Conference on Data Mining. Communications in Computer and Information Science, Springer 996, pp. 289–302, 2018. doi: 10.1007/978-981-13-6661-1_23.

K. Xiangjie, J. Huizhen, Y. Zhuo, Y. Zhuo, Y. Zhuo, and A. Tolba, “Exploiting Pub-lication Contents and Collaboration Networks for Collaborator Recommendation”, PlosOne, 11(2), e0148492, 2016. doi: 10.1371/journal.pone.0148492

Y. Zhao, J. Tang, and Z. Du, “EFCNN: A Restricted Convolutional Neural Network for Expert Finding”, in Advances in Knowledge Discovery and Data Mining. PAKDD 2019. Lecture Notes in Computer Science, vol. 11440, Springer, Cham, 2019. doi: 10.1007/978-3-030-16145-3_8.

A. Omer, G. Hongyu, B. Suma, H. Wen-Mei, and X. JinJun, “PaRe: A Paper Re-viewer Matching Approach Using a Common Topic Space”, in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), pp. 518–528, 2019. doi: 10.18653/v1/D19-1049.

T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean, “Distributed repre-sentations of words and phrases and their compositionality”, in Proceedings of the 26th International Conference on Neural Information Processing Systems 2, pp. 3111–3119, 2013.

T. Hofmann, “Probabilistic latent semantic indexing”, in Proc. 22nd annual interna-tional ACM SIGIR conference on Research and development in information re-trieval, pp. 50–57, 1999. doi: 10.1145/312624.312649.

S. Shtovba and M. Petrychko, “Jaccard Index-Based Assessing the Similarity of Re-search Fields in Dimensions”, CEUR Workshop Proceedings, vol. 2533 “Proc. of the First International Workshop on Digital Content & Smart Multimedia”, pp. 117–128, 2019.





Problem- and function-oriented computer systems and networks