AI tools struggle with layperson medical symptom queries

Tue, 5th Nov 2024

ConfidenceClub has released the findings of a study examining the accuracy of artificial intelligence (AI) tools in diagnosing medical symptoms, highlighting significant limitations for everyday users.

The study by the health and wellness brand tested the capabilities of five AI language models, replacing traditional online search methods commonly referred to as "Doctor Google". The AI tools evaluated were ChatGPT 4 from OpenAI, DxGPT by Foundation 29, Co-Pilot by Microsoft, Gemini from Google, and Grok from X, the platform previously known as Twitter.

Each tool was tasked with 40 questions sourced from a medical practice exam. The first 20 questions were presented with exact wording from the exam, while the remaining 20 were translated into layperson language to simulate descriptions from individuals lacking medical expertise.

The study scored the AI models based on two criteria: accuracy in answering the questions and whether they advised consulting a medical professional. Overall, the models struggled when interpreting layperson prompts, achieving accuracy below 50% on average. Conversely, the tools showed strong performance when handling technical prompts, with an average accuracy score of 89%.

The results, detailed in a table provided by ConfidenceClub, revealed specific gaps in the capability of each AI tool. ChatGPT 4 achieved 100% accuracy on technical prompts, referring users to medical professionals 70% of the time, but only managed correct answers for layperson prompts 45% of the time. DxGPT also showed problems, with zero professional referrals for both prompt types and only slightly surpassing ChatGPT 4 in layperson accuracy at 55%.

Despite these findings, Grok from X excelled in professional referral accuracy, consistently referring users to professionals across all prompts, but shared similar difficulties in layperson comprehension with an accuracy of 45%.

The study emphasised that the AI tools demonstrated a troubling trend where users with technical knowledge received fewer referrals to healthcare professionals, potentially fostering overconfidence in self-diagnosis amongst users familiar with medical terminology.

Garron Lipschitz, Co-founder at ConfidenceClub, expressed concerns over the study outcomes while acknowledging the potential of AI in aiding medical professionals. "As a business committed to helping people take control of their health and well-being, we were eager to see how reliable AI could be in supporting those efforts. Our study found that while AI tools excel at processing complex medical terminology, they struggle to communicate effectively with everyday users. This gap is concerning, especially as more people turn to AI for symptom checks," he stated.

ConfidenceClub urges caution in relying on AI tools for self-diagnosis and warns of the risks it presents to uninformed users. "What stood out even more was that AI tools were less likely to recommend professional help when presented with technically correct prompts, which could lead to overconfidence in self-diagnosis if someone has a grasp of the terminology. We hope this study highlights an important point: while AI has incredible potential, it's no substitute for professional medical advice - especially for people who aren't familiar with medical jargon," Lipschitz commented.

As AI technology progresses, ConfidenceClub advises developers to ensure these diagnostic tools are improved to become more accessible and reliable, emphasising their role as a supplemental aid rather than a replacement for professional medical advice.

Share on: