The Potential and Pitfalls of using a LLM as a Clinical Assistant
Recent studies have shown promising performance of ChatGPT and GPT-4 in various medical tasks. However, their utility in large-scale real-world electronic health record database and providing comprehensive diagnostic assistance has not been fully assessed. Despite our analysis using ChatGPT and GPT-4 revealed high F1 scores (up to 96%) for disease classification tasks and accurate diagnosis (75%), there were factually incorrect statements, overlooking critical medical findings and recommendations for unnecessary investigations and overtreatment. These issues coupled with privacy concerns limit the real-world clinical use of such models. Nonetheless, their potential for scalability in healthcare applications is notable as less data and prompt engineering is required compared to conventional machine learning workflows.