In this study, medical guidelines20 and agents based on GPT-4 were used to answer questions related to TBI rehabilitation. This system automatically evaluates the correctness of the answers, simultaneously providing relevant content from the medical guidelines to enhance explainability. The evaluation revealed that the responses generated by the guideline-based GPT-agents performed better in terms of accuracy, explainability, and empathy than those obtained by directly querying GPT-4.
Brain rehabilitation is a comprehensive and lengthy treatment process involving a variety of aspects including physical therapy, speech therapy, cognitive training, and psychological support21,22. The LLM acquires knowledge from various professional disciplines during training, making it highly suitable for assisting with brain rehabilitation.
Currently, the LLM has demonstrated potential in the medical field23,24, owing to its powerful natural language processing and generation capabilities25,26. However, the direct use of LMM is still limited by certain challenges, such as inaccurate responses or the generation of hallucinations. Agents based on LLM for complex task processing have shown significant advantages. For example, humans typically perform autonomous programming or automation of certain real-world tasks using computers or smartphones. Agents can also be employed in medical tasks such as for dermatological patient-doctor conversations and assessments27. The GPT-agents constructed in this study involved multiple API calls, which results in the generation of lengthier answers, but can increase the response time. Overall, it was found that the GPT-agents had an extended response time compared to GPT-4, but could still provide answers within an average range of 1–2 min, generating an output with a word count between 300 and 700 words (in Chinese). This speed is acceptable for clinical counseling, as it is much shorter than the real-world waiting time in hospitals for treatment.
Traditional direct question-answering systems such as ChatGPT have been found to be limited potential issues related to accuracy28,29 and the generation of hallucinatory responses for medical queries30,31. Medical guidelines and expert consensus thus serve as the cornerstone of clinical practice. GPT-4 has powerful summarization capabilities29, making it a potential tool for guideline classification. In the present study, we observed that after inputting guideline information into the GPT-4, its medical role was significantly activated, leading to improved response accuracy. We further found that the inclusion of guidelines did not directly restrict the agents’ responses. Overall, our GPT-agents could provide suggestions during result evaluation, which offers an alternative when there is no answer available based on the guidelines.
Several studies have previously attempted to improve the accuracy and completeness of LLM by including prompt engineering, fine-tuning, and retraining29,32. Considering the high cost of fine-tuning and retraining, this study focused instead on prompt engineering techniques. By utilizing guideline-based agents to process the guidelines and input them as prompts to the GPT, the accuracy of the agents’ responses improved significantly. This improvement could be attributed to the prompt use of medical guidelines, which better set the context and cultural positioning of GPT. Guidelines are commonly modified to suit the specific healthcare environments in a particular region. Thus, different healthcare environments and conditions may implement slightly different approaches for the same medical issue. For example, Traditional Chinese Medicine is often incorporated into medical guidelines and consensus in China20. This study followed a logical chain of thinking, incorporating knowledge from medical guidelines, and employed multiple evaluative agents to assess the questions and answers. We believe that providing professional medical guidelines and utilizing evaluative agents are superior strategies for enhancing response quality.
Completeness is defined as the accumulation of experience in long-term clinical work involving insights and reflections on multiple dimensions of illness. In the present study, we found that both GPT-agents and GPT-4 were lacking in terms of completeness, indicating that their ability to answer medical questions is still in the early stages of development. Further research should explore whether combining fine-tuned teleology can improve completeness.
Explainability is an important criterion when evaluating the current use of AI in medicine14,33. Because of their large number of parameters, LLMs are inherently difficult to explain. In the present study, the explainability of the results was assessed by referencing the original text of the guidelines. After the answer was evaluated as “correct” or “incorrect”, the related original text of the referenced guideline content was output by the final agent. This significantly increases the explanatory power of the results.
Patients with brain injury often require a lengthy recovery period, and rely on their families for reintegration into society. Empathy can help family members to understand and motivate patients, thus boosting their confidence in treatment. The GPT-4 itself seems to have an advantage over clinical doctors in terms of empathy34,35. In the present study, we found that GPT-agents had significantly enhanced empathy compared to the base GPT-4. This may be attributed to the inclusion of more medical information, which provided the GPT with more precise positioning and allowed it to generate words associated with empathy.
Although this study found that GPT-agents based on medical guidelines could significantly improve medical responses, there are still some limitations which should be considered. First, the use of GPT-agents results in an increase in the cost time. Overall, we found an average increase of 1 min in response time for GPT-agents in our study. However, this may be affected by different areas and Internet environments. Secondly, there is the issue of incomplete answers. Clinical practice is complex and involves multiple disciplines. However, no single guideline can adequately address these complex clinical issues. Guidelines are constantly evolving, and may not always align with the most advanced treatment approaches. As such, these guidelines must be critically evaluated. Incorporating a wide and non-duplicate summary guideline can help to overcome this problem. Third, this study did not employ random double-blinding owing to the inclusion of guideline references in the GPT-agents’ responses, making it impossible to implement blinding on assessors, which could have led to subjectivity in the results. Finally, the actual medical environments in hospitals are complex and variable, involving individual patient situations, medical histories, and symptoms. Additionally, ethical and medical regulations differ across regions. ChatGPT may not have fully considered these factors when answering questions, thus limiting the applicability of its responses. As such, when using the GPT, healthcare professionals and clinical teams must maintain professional judgment, integrate GPT responses with specific patient contexts, and develop the best diagnosis and treatment plans accordingly.
In future research, optimization could be continued through several approaches. First, it will be necessary to further refine the foundational large models, particularly by upgrading them to multimodal models. This is crucial, as many patients with clinical brain injury may not be able to complete typing or speaking tasks. Utilizing various input modes (such as voice and images) can help to broaden accessibility. Second, further studies should explore whether agents based on medical guidelines exhibit common patterns in other conditions, such as rare diseases or critical illnesses. It is essential to determine whether employing guideline-based agents can enhance the responses of LLMs. Finally, as various diseases and medical guidelines intersect, research on recommendation algorithms will be necessary. This algorithm should accurately assess and rank diverse search contents, discerning patients’ true intentions, as different diseases involve varying guidelines, and a single condition may have multiple treatment guidelines.
Despite these limitations, our research showed that GPT-agents that rely on medical guidelines hold significant promise for various medical applications. By integrating evidence-based guidelines, these agents can utilize the wealth of knowledge and expertise accumulated through extensive clinical practice and research. This integration not only improves the reliability of the generated responses, but also ensures their alignment with established medical standards and best practices.
Overall, the results of this study showed that GPT-agents have enhanced the accuracy and empathy of responses to TBI rehabilitation questions. This study provides guideline references and demonstrates improved clinical explainability. Compared to the direct use of GPT-4, GPT-agents based on medical guidelines showed improved performance, despite the slight increase in response time. With advances in technology, this delay is expected to be minimized. However, further validation through multicenter trials in a clinical setting is necessary. Overall, this study offers practical insights and establishes the groundwork for the potential theoretical integration of LLM-agents in the field of medicine.
link