Today I submitted a project proposal to the Swedish Research Council (Vetenskapsrådet) with the title “VESPTEC – Vector space representations of textual content”. Collaborating with me on this proposal are Magnus Rosell and Viggo Kann at KTH CSC as well as Jussi Karlgren at SICS and Hercules Dalianis here at DSV.
Since the 1960s vector space models have been used extensively for representation of semantics, especially in information-retrieval systems such as Google. These vector spaces are usually multi-dimensional and the terms and documents are represented by very large matrices. There is no greater regard to context. For instance, how a term occurs in a document is almost completely disregarded. Texts are thus viewed as mere bags-of-words. Much of the research so far has either focused on the application of these representations on specific tasks, or on the efficiency of this application by reducing the dimensionality of the original space in some way. This project proposes the study of vector space representations of textual content in a more systematic manner.
We have identified two main tasks. One is to explore the notion of intrinsic dimensionality and the spatial metaphor often used in describing “likeness” between documents. The other, and perhaps more intriguing task is that of moving from a bag-of-words representation to a more informed document space, modeling more than just the cooccurrence of lexical items within documents. These models will be systematically validated on a diverse array of text processing tasks and well established test sets with built-in success criteria. A better representation of textual content is interesting in itself, but will also lead to better underlying models that will improve applications, such as search engines and text summarization.