The 13th Language Resources and Evaluation Conference (LREC 2022) was held in Marseille, France with over 1000 participants. Four of us from DSV were there to present our recent findings and learn about the state of the NLP field. Anastasios Lamproudis, Aron Henriksson, Hercules Dalianis and I (Thomas Vakili) had a total of four papers for the conference and its workshops.
All four of us presented a paper about continued pre-training BERT models using automatically de-identified clinical data. We showed that pre-training with safer de-identified clinical data works just as well as using sensitive data. During the conference, we also received ethical approval to share one of the models with academic researchers.
I also presented two workshop papers co-written with researchers from Linköping University, Linköping University Hospital and RISE. The first paper was about using a clinical BERT model to conduct terminology extraction to find terms associated with medical implants in electronic health records. The other paper investigated how well the de-identification system developed at DSV using the Health Bank performs on data from clinics not present in our datasets.
Anastasios, Aron and Hercules presented a paper in which they evaluated various strategies for creating clinical BERT models. They compared initializing the model from a general-domain model versus pre-training from scratch, and whether adapting the general-domain vocabulary to the clinical domain helps or not. They found that all strategies lead to improvements on clinical tasks, but that all strategies ultimately lead to similarly performing models. However, initializing from a general-domain model decreased the amount of training needed.
We had many fruitful discussions and returned home full of ideas to try out. If you are interested in seeing our posters, then you can find them here and here.