I will defend my licentiate thesis “Swedish Health Data – Information
Access and Representation” on October 6th, at the Department of Computer and Systems Sciences, Stockholm University.
Swedish Health Data – Information Access and Representation
Time: Tuesday, October 6th 2009 at 10.00
Room: Sal C, floor 4, Department of Computer and Systems Sciences,
Respondent: Sumithra Velupillai
Opponent: Dr. Dimitrios Kokkinakis, Gothenburg University
Examiner: Prof. Louise Yngström, DSV, Stockholm University
Supervisors: Assoc. Prof. Hercules Dalianis and Dr. Martin Hassel, DSV, Stockholm University
Health related research is an interdisciplinary, broad and growing research area. With the growth of digitalised systems that simplify and make work processes more efficient in many companies and organisations, the amount of available data is now immense. The information contained in health related digital data sets could be used for further research and also, in the long run, for improving health care, health care processes, and public health.
A large amount of the information contained in these data sets is often in unstructured, free text. Health related texts can comprise various types of text, such as scientific articles, questionnaire answers, (electronic) health records, information on web sites, and e-mail. What these texts all have in common is above all the use of a domain-specific vocabulary.
Information access methods applied to textual data require a language model. Many human language technology tools have been developed in order to improve and simplify representation models, primarily for English, and predominantly for general language use. For Swedish, several human language technology tools have been developed. How these tools work on domain-specific data such as health data, is still a relatively unchartered research area. We have investigated what properties the language use in Swedish electronic health records have compared to a large, general-purpose Swedish corpus, in order to identify if and where adaptation is necessary. We have also created a representation model based on phrases instead of words for Swedish scientific medical text.
Health related texts also contain a potentially large amount of previously unknown information, which could be valuable to exploit in further research. We have developed an iterative and interactive method for exploring large text sets, based on document clustering, where both structured and unstructured information is used for generating hypotheses from epidemiological questionnaire data and electronic health records.
One of the most important factors that influence the possibilities of performing research on health related data sets is availability. Although digital information is easy to store and obtain automatically, this type of data often contains sensitive and private information that makes it impossible to distribute for further research, unless identifiable information is deleted or replaced. We have initiated work on automatic de-identification for Swedish and created a manually annotated gold standard, which could be used both for evaluating de-identification systems as well as for training new systems.