Data and Text Mining

Thomas Vakili – Congrats för scholarship to Universidad de Chile!

February 7, 2023 Hercules Health Informatics, SYSLAB, Visit

Congratulations for the Phd stipend from the PhD Visiting Program 2023 from Center for Mathematical Modeling (CMM) at the University of Chile, (Universidad de Chile) in Santiago, Chile. This will make it possible for Thomas Vakili to visit the center during three months the fall of 2023 and work with privacy preserving methods for Chilean patient records jointly with Dr. Jocelyn Dunstan that invited Thomas.

Early sepsis detection – Best paper award – ICTAI 2022

November 21, 2022 anastasios Award, Research Paper, SYSLAB

The 34^th IEEE International Conference on Tools with Artificial Intelligence was held virtually from 31^sof October to the 2^nd of November. With Aron Henriksson and in collaboration with Karolinska Institutet we presented our paper “Improving the Timeliness of Early Prediction Models for Sepsis through Utility Optimization” and we are very happy to announce that we received the best paper award. In the paper that will be published in the proceedings of the conference, we explore the capabilities of using custom objective functions to develop a machine learning model that can perform sepsis prediction over time in a manner that will be useful for practitioners in assisting them to perform timely intervention and initiate treatment early, which is key to survival.

Master thesis presentation at SHI 2022, Tromsø

August 29, 2022 Hercules Health Informatics, Presentation, Publication, Research Paper, SYSLAB

Alexander Dolk presented his and Hjalmar Davidsen master thesis in form of a scientific paper with the title Evaluation of LIME and SHAP in Explaining Automatic ICD-10 Classifications of Swedish Gastrointestinal Discharge Summaries at the 18th Scandinavian Conference on Health Informatics, SHI 2022, 22-23 Aug, 2022 i Tromsø, Norway, both supervisor Thomas Vakili and I were also part of the paper.

The research work were part of the ClinCode project in Tromsø. At the conference another paper also from the ClinCode project was presented with title The Influence of NegEx on ICD-10 Code Prediction in Swedish: How is the Performance of BERT and SVM Models Affected by Negations? by Andrius Budrionis, Taridzo Chomutare, Therese Olsen Svenning and Hercules Dalianis.

There is a conference report from SHI 2022 available upon request to Hercules.

LREC 2022 in Marseille, France

July 5, 2022 thomasvakili Health Informatics, Presentation, Research Paper, SYSLAB

The 13th Language Resources and Evaluation Conference (LREC 2022) was held in Marseille, France with over 1000 participants. Four of us from DSV were there to present our recent findings and learn about the state of the NLP field. Anastasios Lamproudis, Aron Henriksson, Hercules Dalianis and I (Thomas Vakili) had a total of four papers for the conference and its workshops.

All four of us presented a paper about continued pre-training BERT models using automatically de-identified clinical data. We showed that pre-training with safer de-identified clinical data works just as well as using sensitive data. During the conference, we also received ethical approval to share one of the models with academic researchers.

I also presented two workshop papers co-written with researchers from Linköping University, Linköping University Hospital and RISE. The first paper was about using a clinical BERT model to conduct terminology extraction to find terms associated with medical implants in electronic health records. The other paper investigated how well the de-identification system developed at DSV using the Health Bank performs on data from clinics not present in our datasets.

Anastasios, Aron and Hercules presented a paper in which they evaluated various strategies for creating clinical BERT models. They compared initializing the model from a general-domain model versus pre-training from scratch, and whether adapting the general-domain vocabulary to the clinical domain helps or not. They found that all strategies lead to improvements on clinical tasks, but that all strategies ultimately lead to similarly performing models. However, initializing from a general-domain model decreased the amount of training needed.

We had many fruitful discussions and returned home full of ideas to try out. If you are interested in seeing our posters, then you can find them here and here.

Paper at ACL 2022 workshop: BioNLP

June 3, 2022 thomasvakili Health Informatics, Information Systems, Publication, Research Paper, SYSLAB

I had the pleasure of presenting a poster of a paper by Hercules Dalianis and me: Utility Preservation of Clinical Text After De-Identification. The paper investigates how automatic de-identification, a necessarily imperfect process, impacts the quality of the resulting texts. When a de-identification system incorrectly class a word as sensitive, the data will be slightly corrupted. Many researchers have been worried that this would make the data less useful, and we investigate this issue.

The impact of automatic de-identification on quality is evaluated using both qualitative and quantitative (machine learning) methods. We find no losses in utility for clinical NLP on three downstream clinical tasks. In fact, the machine learning models trained using automatic de-identification seem to work just as well as those trained using sensitive data. We also find that the experts in our study think the de-identification works well.

Participating in the 60th ACL conference was a great experience. I learned a lot from our global NLP community and met many researchers interested in our work at DSV. You can find the paper here, and the poster I presented here.

AAAI Fall Symposium and EMNLP – November 2021

November 23, 2021 thomasvakili Health Informatics, Presentation, Publication, Research Paper, SYSLAB, Visit

Professor Hercules Dalianis and I got a paper about the privacy preserving qualities of BERT accepted to the AAAI Fall Symposium on Human Partnership with Medical Artificial Intelligence! The paper is titled Are Clinical BERT Models Privacy Preserving? The Difficulty of Extracting Patient-Condition Associations. Our results strongly suggest that BERT’s poor generative capabilities makes it resistant to training data extraction attacks. Other models, such as GPT-2, have been shown to be susceptible to these attacks. From a privacy perspective, being a poor generator may be a feature!

Later in the same week, I flew from Stockholm to Punta Cana in the Dominican Republic to participate at EMNLP 2021. Almost 500 participants were there, with the total number of participants exceeding 4,000. There were many interesting presentations regarding NLP in general, but also some that were specifically about the privacy aspects of NLP. It was a great experience to learn where the field is headed and also to get to know many talented researchers. I have written a summary of some of the interesting papers – reach out if you are interested in it.

DSV at the First ClinCode Conference in Tromsø, Norway

October 6, 2021 thomasvakili Health Informatics, Presentation, SYSLAB, Visit

Professor Hercules Dalianis, Sonja Remmer and myself represented DSV at the First ClinCode Conference. The conference gathered experts in medicine and computer science from across the Nordics and took place at the University Hospital of North Norway (UNN) in Tromsø.

The conference was chaired by Hercules, who is also a guest professor at the Norwegian Centre for E-health Research. Sonja shared her work on automatic ICD-10 classification using BERT and I spoke about the difficulty of extracting training data from clinical BERT models.

Several participants had an industry or medical background. This provided valuable insights into how our research at DSV may be used in practice and what challenges are most important. It also highlighted the great potential that can be unlocked by continuing to investigate ICD-10 classification and other medical NLP problems.

Many excellent ideas were hatched in the discussions, and it was lovely to visit the beautiful polar city of Tromsø. Personally, I really look forward to future iterations of the conference!

Press release

2018 Conference on Empirical Methods in Natural Language Processing (EMNLP 2018)

November 13, 2018 Mahbub Ul Alam Health Informatics

Hello everyone. I hope you are all fine and enjoying this beautiful winter. 🙂

Recently I have attended the EMNLP 2018 conference. It was held in Brussels, Belgium at the city center from 31 October to 04 November.

Just a few statistics about the conference,

EMNLP 2018 had 14 workshops, six tutorials, three invited speakers, 351 long paper presentations, 198 short paper presentations, 10 TACL paper presentations, and 29 demos. It received 2,231 valid submissions, a 48% increase over EMNLP 2017.

I attended the ”LOUHI 2018”, the Ninth International Workshop on Health Text Mining and Information Analysis and the BlackboxNLP: Analyzing and interpreting neural networks for NLP workshop. The ”LOUHI 2018”, the health text mining and information analysis workshop provided an excellent overview of the current trend in this area. The total number of attendee was close to a hundred. I have observed that most of the works there are based on deep neural networks, and they have particular importance on interpretability. The BlackboxNLP workshop was focused on this interpretability issue, and I have learned some recent trends there. I have written a short report about some papers of the the conference that can be sent on request.

Brussels is a magnificent city. During the lunchtime, I tried to visit as much as possible the city center. I tried the Belgian chocolate, french fries, and the famous waffle. They were mouth watering delicious.

Overall it was an excellent experience for me. I met a lot of new people who are working in the same direction as mine. I learned some new concepts. Most importantly, it gave me immense strength and assurance that, I am not alone in this journey. There are a lot of researchers in this area. I can always learn from them.