Eva Söderström at University of Skövde and I are editors of a new book, Information Systems Engineering: From Data Analysis to Process Networks, published by IGI Global. The book presents the most current research on existing and emergent trends on conceptual modeling and information systems engineering, bridging the gap between research and practice by providing a much-needed reference point on the design of software systems that evolve seamlessly to adapt to rapidly changing business and organizational practices. The chapters in the book, written by acknowledged experts in the field, cover all of the issues introduced above ranging from analysis of data models over methods for participative modeling to the design of process and value networks. The book is dedicated to Benkt Wangler, at his retirement from University of Skövde, who has been a portal figure of the information systems area in Sweden for decades.
Paper accepted to UCCTS 2008, Hangzhou, China
The paper Improving Dictionary Construction by Automatic Identification of Parallel Text Pairs which I’ve written together with Martin Hassel and Hercules Dalianis has been accepted to the international symposium on Using Corpora in Contrastive and Translation Studies (UCCST) which will be held in Hangzhou, China, 25th–27th September 2008. The experiments described in the paper have been part of the TvärSök project.
Abstract:
When creating dictionaries for use in e.g. cross-language search engines, parallel or comparable text pairs are needed. For instance, a website, Hallå Norden, containing information regarding mobility between the Nordic countries has information in five languages; Swedish, Danish, Norwegian, Icelandic and Finnish. Working with these texts we discovered two main problems: the parallel corpus was very sparse containing on average less than 80 000 words per language pair, and it was difficult to automatically detect parallel text pairs. Creating dictionaries with the word aligner Uplug gave in average 213 new dictionary entries. Combinations with Finnish, which belongs to a different language family, had a higher error rate, 33%, whereas the combinations of the Scandinavian languages only yielded on average 9% errors. Despite the corpus sparseness the results were surprisingly good compared to other experiments with larger corpora.
Following this work, we made two sets of experiments on automatic identification of parallel text pairs. The first experiment utilized the frequency distribution of word initial letters in order to map a text in one language to a corresponding text in another in the JRC-Acquis corpus (European Council legal texts). Using English and Swedish as language pair, and running a ten-fold random pairing, the algorithm made 87% correct matches (baseline-random 50%). Attempting to map the correct text among 9 randomly chosen false matches and one true yielded a success rate of 68%. In another experiment features such as word, sentence and paragraph frequencies were extracted from a subset of the JRC-Acquis corpus and used with memory-based learning on Swedish-Danish, Swedish-Finnish and Finnish-Danish, respectively, achieving a pair-wise success rate of 93%. We believe such methods will improve automatic bilingual dictionary construction from unstructured corpora and our experiments will be further developed and evaluated.
The full paper will be completed during this summer.
Talk: “A new lemmatizer that handles morphological changes in pre- in- and suffixes alike” by Bart Jongejan
A new lemmatizer that handles morphological changes in pre- in- and suffixes alike
talk by Bart Jongejan, CST, University of Copenhagen, Tuesday, May 6, 2008, at 13.00-14.45, sammanträdesrummet 7501, Forum, DSV, Kista.
In some Indo-European languages like English and the North Germanic languages, most words can be lemmatized by removing or replacing a suffix. In languages like German and Dutch, on the other hand, lemmatization often proceeds regularly by removing, adding or replacing other types of affixes and even by combinations of such string operations.
The rules for the new lemmatizer are created by automatic training on a large sample set of full form – lemma pairs. An attempt was made to allow a rule-based attribution of a word to more than one lemma (when appropriate), but this had to be given up. The current implementation produces one lemma per word when the lemmatization rules are applied and relies on an optional built-in dictionary to produce additional correct lemmas of known words only.
The first results of tests show that the new lemmatizer probably has a higher accuracy than the former CSTlemma software, even with languages that have mainly suffix morphology, but that the errors it makes sometimes may be “more wrong” than the errors made by the old CSTlemma software.
Visit by Dr. Guido Governatori from University of Queensland
Last Thursday, 25th of April 2008, Dr. Guido Governatori from University of Queensland (UQ) in Brisbane, Australia, visited DSV. Guido gave a presentation on Compliance Checking between Business Process and Business Contracts.
Abstract: It is a typical scenario that many organisations have their business processes specified independently of their business contracts. This is because of the lack of guidelines and tools that facilitate derivation of processes from contracts but also because of the traditional mindset of treating contracts separately from business processes. This talk will provide a solution to one specific problem that arises from this situation, namely the lack of mechanisms to check whether business processes are compliant with business contracts. The central part of this talk focuses on a logic based formalism for describing both the semantics of contract and the semantics of compliance checking procedures.
Thanks Guido for your visit and for an interesting presentation.
Research Project Application: A Universal Repository of Process Models, submitted to Vetenskapsrådet, April 15, 2008
The rapid development of the Internet during the last decade has supported enterprises in building novel infrastructures, setting up virtual organisations, and operating in larger geographical spaces. To manage this new environment, enterprises need to align their IT infrastructures to the business processes. Therefore, the interest in business process management using Process Aware Information Systems (PAIS) has been rapidly increasing. Solutions implemented in PAISs are often complex and time-consuming to develop. One way to address this problem is to utilize repositories of reusable process models. However, while repositories have proved to be successful within object-oriented and component-based development, similar success has not yet been achieved in the area of PAIS. This is because we still lack the critical mass of process models within a single repository and we lack transparency between different repositories. The main goal of this research is, therefore, to design the architecture of a universal process repository, i.e. a repository that is independent of process modelling languages, comprises a large number of existing process repositories, and is open for change and growth by any potential user. The long term goal of the research is to lay the foundations for a Business Process Management Wikipedia, which will become a universal knowledge resource on process models that can be used by researchers for empirical investigations in the business process management area.
Paper accepted to LREC 2008, Marrakech, Morocco
The paper Revealing Relations between Open and Closed Answers in Questionnaires through Text Clustering Evaluation which I’ve written together with Magnus Rosell (PhD student at CSC, KTH) has been accepted to the 6th edition of the Language Resources and Evaluation Conference (LREC) which will be held in Marrakech, Morocco 26th May – 1st June 2008. The conference is one of the major events on Language Resources (LRs) and Evaluation for Human Language Technologies (HLT). The experiments described in the paper will be developed further and applied on other data sets as part of my PhD studies.
Abstract:
Open answers in questionnaires contain valuable information that is very time-consuming to analyze manually. We present a method for hypothesis generation from questionnaires based on text clustering. Text clustering is used interactively on the open answers, and the user can explore the cluster contents. The exploration is guided by automatic evaluation of the clusters against a closed answer regarded as a categorization. This simplifies the process of selecting interesting clusters. The user formulates a hypothesis from the relation between the cluster content and the closed answer categorization. We have applied our method on an open answer regarding occupation compared to a closed answer on smoking habits. With no prior knowledge of smoking habits in different occupation groups we have generated the hypothesis that farmers smoke less than the average. The hypothesis is supported by several separate surveys. Closed answers are easy to analyze automatically but are restricted and may miss valuable aspects. Open answers, on the other hand, fully capture the dynamics and diversity of possible outcomes. With our method the process of analyzing open answers becomes feasible.
The full paper can be found here:
http://people.dsv.su.se/~sumithra/publications/LREC08/rosellvelupillai08.pdf
AVID-Deidentifying Swedish Medical Records for Better Health Care submitted April 15, 2008 to Vetenskapsrådet
Within hospital care there has been an explosion in the production of medical record data. A large amount of this data is unstructured free-text that is almost never reused. Our research group will soon have access to more than one million medical records from the Stockholm City Council. Currently, we already have access to 5 000 medical records within rheumatology. Unfortunately the free-text of the medical records very often contains misspellings, syntactical errors as well as plenty of unknown abbreviations and is therefore difficult to process by computers. In order to use the free-text corpus for research purposes it is also necessary to deidentify the texts since they typically contain information that can identify the individual patient. In this project we will therefore normalise and deidentify the medical records and we expect to reach 99 percent deidentification. When this is carried out we and the research community have the possibility to use human language technology tools such as text mining and text extraction methods to find previously unchartered relations between diseases, medical treatment, age, occupation, social situation, etc. One primary goal with this project is thus to make it possible for researchers in medicine to use the abundant digital textual information that is available in medical records. Such research has never previously been carried out in Sweden, and is unique due to the kind of and large amount of textual data being used.
Popular scientific description in Swedish:
Avidentifiering av patientjournaler fo?r ba?ttre ha?lsova?rd
Inom sjukva?rden produceras ett mycket stort antal digitala patientjournaler av la?kare och sjuksko?terskor. Journalerna inneha?ller information om patientens allma?ntillsta?nd, symptom, diagnos och behandling. Dessa patientjournaler inneha?ller tillsammans va?rdefull information och sa?rskilt delar i fritext som inte alls utnyttjas i den medicinska forskningen. Vi har tidigare gjort experiment pa? 5 000 avidentifierade patientjournaler inom reumatologi och hittat tva? problem:
Ett problem a?r att journalerna trots att de har avidentifierats fo?r att kunna utnyttjas i forskningen fortfarande inneha?ller information som kan go?ra att patienterna kan identifieras eftersom det bland annat refereras till patientens yrke (VD-position pa? Alfa Laval), eller familjemedlemmar och telefonnummer (patientens man Bengt-A?ke na?s pa? telefonummer 08-123 4567). Det andra problemet a?r att journaltexterna inneha?ller ma?nga felstavningar och grammatiska fel men a?ven tvetydiga fo?rkortningar som go?r dem sva?ra att bearbetas av dataprogram.
Vi a?mnar da?rfo?r i detta forskningsprojekt dels ordna att dessa patientjournalerna korrigeras fra?n felstavningar och fa?r en enhetlig stavning av begrepp och dels att texten avidentifieras. Ba?de ra?ttstavning och avidentifiering av texterna kommer att ske med helt automatiska spra?kteknologiska metoder. Vi kommer att utga? fra?n drygt en miljon patientjournaler som vi snart kommer att fa? tillga?ng till genom Stockholms la?ns landsting.
Dessa patientjournaltexter a?r det material vi kommer att la?ta va?ra system tra?nas upp pa? sa? att de la?r sig att ka?nna igen nya begrepp. De automatiska metoderna fo?r automatisk namnigenka?nning och da?rmed avidentifiering kan skapas antingen genom regelbaserade eller statistikbaserade metoder. Med dessa metoder kan man sedan automatiskt ka?nna igen personnamn, yrken, platser, organisationer, mm. Na?r detta a?r utfo?rt kommer vi med att ha ett stort antal patientjournaler med kanske upp till 99 procent helt avidentifierat inneha?ll som mo?jliggo?r forskning pa? ett unikt material. Vi hoppas kunna tillga?ngliggo?ra va?r rentva?ttade patientjournalkorpus och va?ra framtagna spra?kteknologiska verktyg till Svensk Nationell Datatja?nst (SND) fo?r att a?stadkomma vidare spridning.
Det automatiska ra?ttstavningssystemet bygger pa? regler fo?r hur felstavade ord i en text kan korrigeras. Ra?ttstavningsystemet anva?nder sig av ba?de lexikon och fo?rkortningslistor och kommer att korrigera de felstavade orden i patientjournalerna, men vi kommer a?ven att anva?nda oss av speciella medicinska ordlistor som t.ex. FASS-listor med la?kemedelsnamn. Patientjournaltexterna med o?ver en miljon patientjournaler ga?r ocksa? att utnyttja fo?r att ta fram nya doma?nspecifika ordlistor, da? kan man la?ta de vanligaste stavningarna av orden “vinna o?ver” de ovanligare stavningarna av orden.
Forskningen som kan go?ras pa? dessa patientjournaler a?r ba?de traditionell so?kning inom en individs samlade journaltext men ocksa? bland flera individer. Viktigast av allt a?r att man kommer att ha ett stort material som samlar va?rdefull information om ett stort antal patienter, som man kan utnyttja fo?r att extrahera ny information och kunskap.
Projektet har tva? ma?l: dels att skapa en stor avidentifierad patientjournalskorpus pa? svenska fo?r forskningssa?ndama?l, och dels ge forskarva?rlden tillga?ng till i projektet framtagna spra?kteknologiska verktyg fo?r avidentifiering och arbete med liknande textma?ngder. I och med detta kommer man i framtiden enkelt kunna skapa nya avidentifierade textma?ngder och arbeta med stora, informationsta?ta
textma?ngder.
Va?rt projekt a?r unikt sa?tillvida att det a?r fo?rsta ga?ngen na?gon kommer att genomfo?ra avidentifiering och rentva?ttning av drygt en miljon patientjournaltexter (pa? svenska). Tidigare arbete har oftast ro?rt sig om ho?gst na?gra fa? tusen patientjournaler pa? engelska. Denna forskning a?r mycket relevant eftersom den kommer att bidra till att ha?lsova?rden kommer att kunna utnyttja alla de samlade kunskaperna som finns skrivna i fri text tillsammans med mer “ha?rda” ma?tva?rden och genom detta kunna hitta ny kunskap fo?r ba?ttre ha?lsova?rd.
Project Proposal concerning “A methodology for assessing the risk exposure to support IT outsourcing decisions”
On April 15, 2008 I have submitted together with Georg Hodosi and Harald Kjellin a project proposal to the Swedish Research Council having the title: “A methodology for assessing the risk exposure to support IT outsourcing decisions”.
Abstract
The decision process for companies to outsource information technology (IT) or not is a substantial business change and in many cases the competence to take such decisions is limited and has major consequences for the company future performance. Therefore in the project we have proposed the development of a methodology that can be used for assessing the risk exposure in support of IT outsourcing (ITO) decisions that will be carried out through:
1. A detailed analyses and a comparative study of different approaches and methods using transaction cost theory for assessing the risk exposure in IT outsourcing decisions.
2. The development of a new algorithm to evaluate the risk exposure in support of IT outsourcing decisions.
3. The empirically validation of the current methodology for evaluating the risk exposure in IT outsourcing decisions through a case study research approach.
In the development of the methodology a preliminary research has been done in 2007 and a software decision tool was developed based on the Transaction Cost Theory (TCT) in order to give us a quantitative measure of the IT outsourcing risks.
The results of this research we believe that it will bring a lot of benefits like a better understanding of the involved decision parameters and their impact on ITO decisions to those who are working in ITO research, and also to IT executives or IT consultants. The new methodology will be based on using TCT for assessing the risk exposure (including the average losses) together with a set of best practices. Moreover the methodology will improve the way how the risk exposure is evaluated now in an area where the loss is unpredictable. Furthermore the implementation of the methodology in a software decision tool could have a high usage in the industry and we believe that this “methodology platform” will enable further the development and optimization of ITO decision process.