Project proposal: Vector space representations of textual content

Today I handed in a project proposal to the Swedish Research Council (Vetenskapsrådet). Collaborating with me on this proposal are Magnus Rosell and Viggo Kann at KTH CSC as well as Magnus Sahlgren, Jussi Karlgren and Oscar Täckström at SICS. The title of the project is Vector space representations of textual content.

Abstract: Since the 1960s vector space models have been used extensively for representation of semantics, especially in information-retrieval systems such as Google. These vector spaces are usually multi-dimensional and the terms and documents are represented by very large matrices. There is no greater regard to context. For instance, how a term occurs in other documents is almost completely disregarded. Texts are thus viewed as mere bags-of-words.

Much of the research so far has either focused on the application of these representations on specific tasks, or on the efficiency of this application by, for example, reducing the dimensionality of the original space in some way. This project proposes the study of vector space representations of textual content in a more systematic manner.

We have identified two main tasks. One is to explore the notion of intrinsic dimensionality and the spatial metaphor often used in describing “likeness” between terms and documents. The other task is that of moving from a bag-of-words representation to a more informed document space, modeling more than just the cooccurrence of lexical items within documents. These models will be systematically validated on a diverse array of tasks and well established test sets with built-in success criteria.

A better representation of textual content is interesting in itself, but will also lead to better underlying models that will improve useful applications, such as search engines and text summarization.