Skip to main navigation Skip to search Skip to main content

Evaluation of large language models within GenAI in qualitative research

  • Rush University
  • University of Illinois at Chicago
  • Nyanza Reproductive Health Society

Research output: Contribution to journalArticlepeer-review

2 Citations (Scopus)

Abstract

Large language models (LLMs) perform tasks such as summarizing information and analyzing sentiment to generate meaningful and natural responses. The application of GenAI incorporating LLMs raises potential utilities for conducting qualitative research. Using a qualitative study that assessed the impact of the COVID-19 pandemic on the sexual and reproductive health of adolescent girls and young women (AGYW) in rural western Kenya: our objective was to compare thematic analyses conducted by GenAI using LLM to qualitative analysis conducted by humans, with regards to major themes identified, selection of supportive quotes, and quality of quotes; and secondarily to explore quantitative and qualitative sentiment analysis conducted by the GenAI. We interfaced with GPT-4o through google colaboratory. After inputting the transcripts and pre-processing, we constructed a standardized task prompt. Two investigators independently reviewed the GenAI product using a rubric based on qualitative research standards. When compared to human-derived themes, we did not find disagreement with the sub-themes raised by GenAI, but did not consider some to rise to level of a theme. Performance was low and variable with regards to selection of quotes that were consistent with and strongly supportive of thematic and sentiment analysis. Hallucinations ranged from a single word or phrase change to truncation or combinations of text that led to modified meaning. GenAI identified numerous and relevant biases, primarily related to the underlying training data and its lack of cultural understanding. Few prior studies have directly compared LLM-driven thematic coding with human coding in qualitative analysis, and our study - grounded in qualitative study rigor - allowed for a thorough evaluation. GenAI implemented in GPT-4o was unable to provide a thematic analysis that is indistinguishable from a human analysis. We recommend that it can currently be used as an aid in identifying themes, keywords, and basic narrative, and potentially as a check for human error or bias. However, until it can eliminate hallucinations, provide better contextual understanding of quotes and undertake a deeper scrutiny of data, it is not reliable or sophisticated enough to undertake a rigorous thematic analysis equal in quality to experienced qualitative researchers.

Original languageEnglish
Article number34993
JournalScientific Reports
Volume15
Issue number1
DOIs
Publication statusPublished - 7 Oct 2025

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

  1. SDG 3 - Good Health and Well-being
    SDG 3 Good Health and Well-being

Fingerprint

Dive into the research topics of 'Evaluation of large language models within GenAI in qualitative research'. Together they form a unique fingerprint.

Cite this