Skip to yearly menu bar Skip to main content


Poster
in
Affinity Workshop: New In ML

Evaluating Multimodal Approaches to Medical Report Generation and Introducing a Better Metric for Generated Medical Reports


Abstract:

As a result of recent advancements in foundation models, including large vision-language models, several researchers have explored methods of combining multiple modalities of data as inputs for visual question answering. One key application of visual question answering in the context of the healthcare domain is automated medical report generation, where X-ray images and text-based symptom data for a patient might be provided as inputs, with the intention of generating a relevant medical report as an output. However, very few studies analyze the performance of these models alongside unimodal fine-tuned LLMs, and even fewer compare the performance of these multimodal models depending on whether they are provided symptom information as an input. In this paper, we compare the performance of a variety of approaches for generating medical reports on a dataset of Chest X-Ray medical reports, including a unimodal fine-tuned medical LLM, a multimodal model without symptom data, and a multimodal model with symptom data. Second, we introduce a new metric for evaluating the similarity between generated and reference medical reports, which we call "sentence pairs". Our results show that multimodal approaches to medical report generation far outperform unimodal approaches, and providing symptom data slightly improves accuracy for generated medical reports. We also find that our "sentence pairs" evaluation metric more closely measures similarity between generated and reference medical reports than standard techniques.

Chat is not available.