The model of derived uncorrelated semantic fields for text data analysis
Abstract
The model of derived uncorrelated semantic fields generated by the method of principal components and singular decomposition of the matrix of semantic fields frequencies has been considered. This model describes a new semantic space with orthonormal basis of displaying text documents. The dimension of the space of derived semantic fields is significantly less than the dimension of the space of initial semantic fields as a result of replacement of interconnected components by uncorrelated semantic characteristics. The analysis of the test sample of text documents showed the possibility to take into consideration only those components of secondary semantic fields which are described by the first singular numbers. The use of the low-dimension orthonormal basis of derived semantic fields can be effective in the problems of the text data classification and clustering.
References
Pantel P., Peter D. Turney. From Frequency to Meaning: Vector Space Models of Semantics // Journal of Artificial Intelligence Research. — 2010. — 37. —Р. 141–188.
Pavlyshenko B.M. Iyerarkhichna klasteryzatsiya tekstovykh dokumentiv u vektornomu prostori semantychnykh poliv // Elektronika ta informatsiyni tekhnolohiyi. —2011. — Vypusk 1. — S. 212–222.
Pavlyshenko B.M. Model' semantychnoho kontekstu v alhorytmakh intelektual'noho analizu tekstiv // Komp"yutynh. — 2011. — Tom 10, vypusk 3. — S. 216–222.
Pavlyshenko B.M. Vykorystannya kontseptsiyi semantychnoho polya u vektorniy modeli tekstovykh dokumentiv // Skhidno-Yevropeys'kyy zhurnal peredovykh tekhnolohiy. — 2011. — # 6/2 (54). — S. 7–11.
Pavlyshenko B.M. Synhulyarna dekompozytsiya matrytsi semantychnykh oznak v alhorytmi iyerarkhichnoyi klasteryzatsiyi tekstovykh masyviv // Matematychni mashyny i systemy. — 2012. — # 1. — S. 69–76.
Levitskiy V.V., Sternin I.А. Eksperimental’nyye metody v semasiologii. — Voronezh: Izd-vo VGU, 1989. — 192s.
Verdiyeva Z.N. Semanticheskiye polya v sovermennom angliyskom yazyke. — M.: Vysshaya shkola, 1986. — 120 s.
Brasegyan А.А., Kupriyanov M.S., KHolod I.I., Tess M.D., Elizarov S.I. Аnaliz dannykh i protsessov: ucheb. posobiye. — SPb.: BKHV-Peterburg, 2009. — 512 s.
Jolliffe I.T. Principal Component Analysis. — Series: Springer Series in Statistics, 2nd ed. — Springer, NY, 2002, XXIX — 487 p.
Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, Richard Harshman. Deerwester Scott Indexing by Latent Semantic Analysis // Journal of the American Society for Information Science. — 1990. — 41, Issue 6. — P. 391–407.
Mirzal Andri. Clustering and Latent Semantic Indexing Aspects of the Singular Value Decomposition. — http://arxiv.org/abs/1011.4104v2.