I had the pleasure of presenting a poster of a paper by Hercules Dalianis and me: Utility Preservation of Clinical Text After De-Identification. The paper investigates how automatic de-identification, a necessarily imperfect process, impacts the quality of the resulting texts. When a de-identification system incorrectly class a word as sensitive, the data will be slightly corrupted. Many researchers have been worried that this would make the data less useful, and we investigate this issue.
The impact of automatic de-identification on quality is evaluated using both qualitative and quantitative (machine learning) methods. We find no losses in utility for clinical NLP on three downstream clinical tasks. In fact, the machine learning models trained using automatic de-identification seem to work just as well as those trained using sensitive data. We also find that the experts in our study think the de-identification works well.
Participating in the 60th ACL conference was a great experience. I learned a lot from our global NLP community and met many researchers interested in our work at DSV. You can find the paper here, and the poster I presented here.