Learning from Clinical Text: From Weak Supervision to Generative Models

Electronic Health Records contain a large amount of clinically relevant information in the form of unstructured textual documents such as discharge letters or medical referrals. Extracting structured information from these sources is challenging due to the scarcity of annotated data and the complexity and heterogeneity of clinical language.
In the first part of this talk, I will present an overview of my doctoral research, which focused on developing machine learning methods to extract information from clinical documents in settings characterized by label scarcity. In particular, I will briefly discuss a clustering pipeline for short clinical texts and a weakly-supervised classification framework built upon it and illustrate their use in addressing real-world healthcare problems.
In the second part of the talk, I will present my current research, which builds upon and extends my doctoral thesis in several directions. These include the development of methods to improve the interpretability of language models through latent-variable approaches based on variational autoencoders, the use of large language models to generate synthetic clinical documents for data augmentation, and the application of Transformer architectures to model longitudinal sequences of patient events in electronic health records.
Together, these research directions aim to develop scalable and interpretable machine learning methods that enable the effective use of clinical text and patient records in healthcare research and decision support.
Contatto:
andrea1.manzoni@polimi.it