New Project Proposal to Vinnova – IMAIL-Intelligent e-mail answering service for eGovernment
We, Martin Hassel, Eriks Sneiders, Tessy Ceratto, Ola Knutsson (CSC), Viggo Kann (CSC) and Magnus Rosell (CSC) are preparing an application to Vinnova – Deadline sept 2, 2008: Title: IMAIL-Intelligent mail answering service for eGovernment, other partners Försäkringskassan (Swedish Social Insurance Agency) and Euroling AB
Abstract
The project vision is to design and develop eGovernment services that facilitate efficient communication between government agencies and citizens and companies, which will lead to a transformed and improved government.
The overall goal of the demonstrator is to show how further development of today´s tools and technologies can improve the communication between large organizations and people. The demonstrator will run on Försäkringskassan and help to automate the communication between these organizations and the people by processing text-based inquiries, primarily e-mail based queries.
Our tools and technologies will
1. automate answering of a large part of the incoming e-mail flow,
2. improve right-on-time answers to inquiries asked through electronic devices.
Two year project = 4.6 million SEK
Talk: “A new lemmatizer that handles morphological changes in pre- in- and suffixes alike” by Bart Jongejan
A new lemmatizer that handles morphological changes in pre- in- and suffixes alike
talk by Bart Jongejan, CST, University of Copenhagen, Tuesday, May 6, 2008, at 13.00-14.45, sammanträdesrummet 7501, Forum, DSV, Kista.
In some Indo-European languages like English and the North Germanic languages, most words can be lemmatized by removing or replacing a suffix. In languages like German and Dutch, on the other hand, lemmatization often proceeds regularly by removing, adding or replacing other types of affixes and even by combinations of such string operations.
The rules for the new lemmatizer are created by automatic training on a large sample set of full form – lemma pairs. An attempt was made to allow a rule-based attribution of a word to more than one lemma (when appropriate), but this had to be given up. The current implementation produces one lemma per word when the lemmatization rules are applied and relies on an optional built-in dictionary to produce additional correct lemmas of known words only.
The first results of tests show that the new lemmatizer probably has a higher accuracy than the former CSTlemma software, even with languages that have mainly suffix morphology, but that the errors it makes sometimes may be “more wrong” than the errors made by the old CSTlemma software.
AVID-Deidentifying Swedish Medical Records for Better Health Care submitted April 15, 2008 to Vetenskapsrådet
Within hospital care there has been an explosion in the production of medical record data. A large amount of this data is unstructured free-text that is almost never reused. Our research group will soon have access to more than one million medical records from the Stockholm City Council. Currently, we already have access to 5 000 medical records within rheumatology. Unfortunately the free-text of the medical records very often contains misspellings, syntactical errors as well as plenty of unknown abbreviations and is therefore difficult to process by computers. In order to use the free-text corpus for research purposes it is also necessary to deidentify the texts since they typically contain information that can identify the individual patient. In this project we will therefore normalise and deidentify the medical records and we expect to reach 99 percent deidentification. When this is carried out we and the research community have the possibility to use human language technology tools such as text mining and text extraction methods to find previously unchartered relations between diseases, medical treatment, age, occupation, social situation, etc. One primary goal with this project is thus to make it possible for researchers in medicine to use the abundant digital textual information that is available in medical records. Such research has never previously been carried out in Sweden, and is unique due to the kind of and large amount of textual data being used.
Popular scientific description in Swedish:
Avidentifiering av patientjournaler fo?r ba?ttre ha?lsova?rd
Inom sjukva?rden produceras ett mycket stort antal digitala patientjournaler av la?kare och sjuksko?terskor. Journalerna inneha?ller information om patientens allma?ntillsta?nd, symptom, diagnos och behandling. Dessa patientjournaler inneha?ller tillsammans va?rdefull information och sa?rskilt delar i fritext som inte alls utnyttjas i den medicinska forskningen. Vi har tidigare gjort experiment pa? 5 000 avidentifierade patientjournaler inom reumatologi och hittat tva? problem:
Ett problem a?r att journalerna trots att de har avidentifierats fo?r att kunna utnyttjas i forskningen fortfarande inneha?ller information som kan go?ra att patienterna kan identifieras eftersom det bland annat refereras till patientens yrke (VD-position pa? Alfa Laval), eller familjemedlemmar och telefonnummer (patientens man Bengt-A?ke na?s pa? telefonummer 08-123 4567). Det andra problemet a?r att journaltexterna inneha?ller ma?nga felstavningar och grammatiska fel men a?ven tvetydiga fo?rkortningar som go?r dem sva?ra att bearbetas av dataprogram.
Vi a?mnar da?rfo?r i detta forskningsprojekt dels ordna att dessa patientjournalerna korrigeras fra?n felstavningar och fa?r en enhetlig stavning av begrepp och dels att texten avidentifieras. Ba?de ra?ttstavning och avidentifiering av texterna kommer att ske med helt automatiska spra?kteknologiska metoder. Vi kommer att utga? fra?n drygt en miljon patientjournaler som vi snart kommer att fa? tillga?ng till genom Stockholms la?ns landsting.
Dessa patientjournaltexter a?r det material vi kommer att la?ta va?ra system tra?nas upp pa? sa? att de la?r sig att ka?nna igen nya begrepp. De automatiska metoderna fo?r automatisk namnigenka?nning och da?rmed avidentifiering kan skapas antingen genom regelbaserade eller statistikbaserade metoder. Med dessa metoder kan man sedan automatiskt ka?nna igen personnamn, yrken, platser, organisationer, mm. Na?r detta a?r utfo?rt kommer vi med att ha ett stort antal patientjournaler med kanske upp till 99 procent helt avidentifierat inneha?ll som mo?jliggo?r forskning pa? ett unikt material. Vi hoppas kunna tillga?ngliggo?ra va?r rentva?ttade patientjournalkorpus och va?ra framtagna spra?kteknologiska verktyg till Svensk Nationell Datatja?nst (SND) fo?r att a?stadkomma vidare spridning.
Det automatiska ra?ttstavningssystemet bygger pa? regler fo?r hur felstavade ord i en text kan korrigeras. Ra?ttstavningsystemet anva?nder sig av ba?de lexikon och fo?rkortningslistor och kommer att korrigera de felstavade orden i patientjournalerna, men vi kommer a?ven att anva?nda oss av speciella medicinska ordlistor som t.ex. FASS-listor med la?kemedelsnamn. Patientjournaltexterna med o?ver en miljon patientjournaler ga?r ocksa? att utnyttja fo?r att ta fram nya doma?nspecifika ordlistor, da? kan man la?ta de vanligaste stavningarna av orden “vinna o?ver” de ovanligare stavningarna av orden.
Forskningen som kan go?ras pa? dessa patientjournaler a?r ba?de traditionell so?kning inom en individs samlade journaltext men ocksa? bland flera individer. Viktigast av allt a?r att man kommer att ha ett stort material som samlar va?rdefull information om ett stort antal patienter, som man kan utnyttja fo?r att extrahera ny information och kunskap.
Projektet har tva? ma?l: dels att skapa en stor avidentifierad patientjournalskorpus pa? svenska fo?r forskningssa?ndama?l, och dels ge forskarva?rlden tillga?ng till i projektet framtagna spra?kteknologiska verktyg fo?r avidentifiering och arbete med liknande textma?ngder. I och med detta kommer man i framtiden enkelt kunna skapa nya avidentifierade textma?ngder och arbeta med stora, informationsta?ta
textma?ngder.
Va?rt projekt a?r unikt sa?tillvida att det a?r fo?rsta ga?ngen na?gon kommer att genomfo?ra avidentifiering och rentva?ttning av drygt en miljon patientjournaltexter (pa? svenska). Tidigare arbete har oftast ro?rt sig om ho?gst na?gra fa? tusen patientjournaler pa? engelska. Denna forskning a?r mycket relevant eftersom den kommer att bidra till att ha?lsova?rden kommer att kunna utnyttja alla de samlade kunskaperna som finns skrivna i fri text tillsammans med mer “ha?rda” ma?tva?rden och genom detta kunna hitta ny kunskap fo?r ba?ttre ha?lsova?rd.
DSV to China in Roadshow to recruit master students 18-29 Oct 2007
DSV is represented by Hercules Dalianis in the Road Show delegation with 60 professors from Stockholm University and 14 other Swedish universities travelling to Beijing and Shanghai in China to recruit master students, and to show Swedish research in a lot of areas.
We have been visiting education fairs, Chinese Academy of Social Sciences, Ministry of Education, Peking University, Renmin University, Bei Hang University all in Beijing. The students are enthusiastic and are eager to start master studies and even PhD studies. Now we have arrived in Shanghai and continue with meeting with Jiatong university, Fudan university and Tongji university.
Find more photos like this on .
Read more in Swedish: RoadshowKina18-29okt2007_Hercules.pdfDSV is represented by Hercules Dalianis in the Road Show delegation with 60 professors from Stockholm University and 14 other Swedish universities travelling to Beijing and Shanghai in China to recruit master students, and to show Swedish research in a lot of areas.
We have been visiting education fairs, Chinese Academy of Social Sciences, Ministry of Education, Peking University, Renmin University, Bei Hang University all in Beijing. The students are enthusiastic and are eager to start master studies and even PhD studies. Now we have arrived in Shanghai and continue with meeting with Jiatong university, Fudan university and Tongji university.
Download RoadshowKina18-29okt2007_Hercules.pdf
Find more photos like this on .DSV is represented by Hercules Dalianis in the Road Show delegation with 60 professors from Stockholm University and 14 other Swedish universities travelling to Beijing and Shanghai in China to recruit master students, and to show Swedish research in a lot of areas.
We have been visiting education fairs, Chinese Academy of Social Sciences, Ministry of Education, Peking University, Renmin University, Bei Hang University all in Beijing. The students are enthusiastic and are eager to start master studies and even PhD studies. Now we have arrived in Shanghai and continue with meeting with Jiatong university, Fudan university and Tongji university.
Download RoadshowKina18-29okt2007_Hercules.pdf
Nodalida May 25-26, 2007, Tartu, Estonia.
Konstantinos Charitakis presented his EMIS master thesis in form of a scientific paper with the title “Using Parallel Corpora to Create a Greek-English Dictionary with UPLUG” at Nodalida 2007, the 16th Nordic Conference of Computational Linguistics, at University of Tartu, Estonia. Sumithra and Hercules joined him. Martin Hassel, CSC-KTH that soon will join our department presented also two papers “Widening the HolSum Search Scope” co-author Jonas Sjöbergh and “Linguistically Fuelled Text Similarity” co-author Björn Andrist, see also photos.
Paper by EMIS-student accepted to Nodalida 2007, Tartu, Estonia!
Konstantinos Charitakis that just finalized his EMIS-master and rewrote his master thesis to a paper submission with the title: Using a parallel corpora to create a Greek-English dictionary with UPLUG, submitted it to Nodalida 2007 and got accepted. We congratulate Konstantinos!
Abstract This paper presents the construction of a Greek-English bilingual dictionary from parallel corpora that were created manually by collected documents retrieved from the Internet. The parallel corpora processing was performed by the Uplug word alignment system without the use of language specific information. A sample was extracted from the population of suggested translations and was included in questionnaires that were sent out to Greek-English speakers who evaluated the sample based on the quality of the translation pairs. For the suggested translation pairs of the sample belonging to the stratum with the higher frequency of occurrence, 67.11% correct translations were achieved. With an overall 50.63% of correct translations of the sample, the results were promising considering the minimal optimisation of the corpus and the differences between the two languages.
Structuring unstructured data using automatic text processing
igital information is becoming more and more abundant in all areas, for example news are today available in many different languages from various sources. Other examples are business systems that today are consolidated and integrated and, therefore, produce a lot of unstructured data, for example medical patient records. How can we process this information so we can easily navigate through it? Can we present it in a way that is easily understandable? Can we also extract gold nuggets with information that is not visible for the bare eye?
These are some of the research queries we are treating in our research in human language technology, using automatic summarization of text, named entity recognition, cross language information retrieval, text clustering, and text- and data mining.
Please do not hesitate to contact Dr. Hercules Dalianis if you have any questions around this area.
Try our Automatic text summarizer, SweSum.
Read more about our research project Knowledge Extraction Agent.