Evaluating Fairness in Clinical Large Language Models through Vignette Generation

Researcher(s)

  • Benjamin Azevedo, Computer Engineering, University of Delaware

Faculty Mentor(s)

  • Rahmatollah Beheshti, Computer & Information Sciences, University of Delaware

Abstract

Large Language Models have proven to be useful in various disciplines, and there are many foreseen benefits to their application in the clinical field. However, there are some precautions that must be taken for their safe deployment, such as ensuring that the model is both fair and unbiased in its responses regarding different sensitive groups (e.g. gender, ethnicity, age). Models’ biases can exacerbate health disparities. To address these concerns, some benchmarks have been written to check if large language models appropriately take into account demographics across various clinical tasks and settings. The problem is that these datasets are manually implemented and often pertain to specific circumstances, making it difficult to establish a universal evaluation process. Our project aims to develop an automated evaluation method that generates clinical vignettes, or medical scenarios, that can be used as benchmarks to test clinical large language models. By using external resources such as PubMed, a medical knowledge base, we can retrieve information that is relevant to existing biases within healthcare in order to generate vignettes that take this information into consideration. Our project uses a PubMed knowledge graph, which represents PubMed data so that the relationships between biomedical entities are apparent, allowing for comprehensive searches. Using the article abstracts that are also contained in this knowledge graph, I explored the ways in which they could be used to enhance our evaluation pipeline. In addition, I assisted in developing metrics to evaluate the vignettes that our system generates so that we can provide high-quality benchmarks.