System-Based Comparison of the Knowledge Level of Popular AI Chatbots on Human Anatomy: A Multiple-Choice Exam Analysis of GPT-4.1, Deepseek, Co-Pilot, and Gemini Models

Nahir, Mert; Kasap, Abdulkerim; Sahin, Bunyamin

doi:10.1007/s00276-025-03769-8

Publication:
System-Based Comparison of the Knowledge Level of Popular AI Chatbots on Human Anatomy: A Multiple-Choice Exam Analysis of GPT-4.1, Deepseek, Co-Pilot, and Gemini Models

dc.authorscopusid	57196096636
dc.authorscopusid	59309256100
dc.authorscopusid	7103170409
dc.authorwosid	Nahir, Mert/E-8120-2019
dc.contributor.author	Nahir, Mert
dc.contributor.author	Kasap, Abdulkerim
dc.contributor.author	Sahin, Bunyamin
dc.date.accessioned	2025-12-11T00:39:11Z
dc.date.issued	2025
dc.department	Ondokuz Mayıs Üniversitesi	en_US
dc.department-temp	[Nahir, Mert; Kasap, Abdulkerim] Tokat Gaziosmanpasa Univ, Fac Med, Dept Anat, Tokat, Turkiye; [Sahin, Bunyamin] Ondokuz Mayis Univ, Fac Med, Dept Anat, Samsun, Turkiye	en_US
dc.description.abstract	Purpose This study aimed to comparatively assess the knowledge level of AI-based chatbots on human anatomy systems using multiple-choice questions, and to analyze their potential contribution to medical education. Methods Seventy multiple-choice questions covering seven major anatomical systems (musculoskeletal, respiratory, circulatory, digestive, urinary, genital, and nervous systems) were translated in accordance with Terminologia Anatomica and presented to GPT-4.1, DeepSeek, Co-Pilot, and Gemini. Questions were selected from first-and second-year medical student exams and distributed based on item difficulty index (Pj). All bots were tested under the same conditions to minimize bias. Success rates and statistical differences were evaluated using Kruskal-Wallis and Cochran's Q tests. The relationship with item difficulty was assessed using point biserial correlation. Results GPT-4.1 showed the highest accuracy (95.7%), followed by Co-Pilot (94.3%), DeepSeek (92.9%), and Gemini (91.4%). System-based results showed Co-Pilot reached 100% in musculoskeletal, and GPT-4.1 reached 90% in nervous system questions. All bots scored 100% in respiratory and circulatory systems. In other systems, success rates ranged from 80 to 100%. No significant correlation was found between item difficulty and chatbot accuracy. Conclusion Chatbots achieved high accuracy in anatomy questions, but there were notable differences across systems and areas. While their supportive role in medical education is increasing, expert supervision is still recommended. These results show that AI-based systems can serve as complementary educational tools, but further improvement is needed for complete reliability.	en_US
dc.description.woscitationindex	Science Citation Index Expanded
dc.identifier.doi	10.1007/s00276-025-03769-8
dc.identifier.issn	0930-1038
dc.identifier.issn	1279-8517
dc.identifier.issue	1	en_US
dc.identifier.pmid	41233619
dc.identifier.scopus	2-s2.0-105021544685
dc.identifier.scopusquality	Q3
dc.identifier.uri	https://doi.org/10.1007/s00276-025-03769-8
dc.identifier.uri	https://hdl.handle.net/20.500.12712/38262
dc.identifier.volume	48	en_US
dc.identifier.wos	WOS:001615961700002
dc.identifier.wosquality	Q3
dc.language.iso	en	en_US
dc.publisher	Springer France	en_US
dc.relation.ispartof	Surgical and Radiologic Anatomy	en_US
dc.relation.publicationcategory	Makale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanı	en_US
dc.rights	info:eu-repo/semantics/closedAccess	en_US
dc.subject	Anatomy	en_US
dc.subject	Artificial Intelligence	en_US
dc.subject	Chatbot	en_US
dc.subject	Exam Evaluation	en_US
dc.subject	Medical Education	en_US
dc.title	System-Based Comparison of the Knowledge Level of Popular AI Chatbots on Human Anatomy: A Multiple-Choice Exam Analysis of GPT-4.1, Deepseek, Co-Pilot, and Gemini Models	en_US
dc.type	Article	en_US
dspace.entity.type	Publication

Collections

WoS İndeksli Yayınlar Koleksiyonu
PubMed İndeksli Yayınlar Koleksiyonu
Scopus İndeksli Yayınlar Koleksiyonu

Publication: System-Based Comparison of the Knowledge Level of Popular AI Chatbots on Human Anatomy: A Multiple-Choice Exam Analysis of GPT-4.1, Deepseek, Co-Pilot, and Gemini Models

Files

Collections

Publication:
System-Based Comparison of the Knowledge Level of Popular AI Chatbots on Human Anatomy: A Multiple-Choice Exam Analysis of GPT-4.1, Deepseek, Co-Pilot, and Gemini Models