Publication:
System-Based Comparison of the Knowledge Level of Popular AI Chatbots on Human Anatomy: A Multiple-Choice Exam Analysis of GPT-4.1, Deepseek, Co-Pilot, and Gemini Models

dc.authorscopusid57196096636
dc.authorscopusid59309256100
dc.authorscopusid7103170409
dc.authorwosidNahir, Mert/E-8120-2019
dc.contributor.authorNahir, Mert
dc.contributor.authorKasap, Abdulkerim
dc.contributor.authorSahin, Bunyamin
dc.date.accessioned2025-12-11T00:39:11Z
dc.date.issued2025
dc.departmentOndokuz Mayıs Üniversitesien_US
dc.department-temp[Nahir, Mert; Kasap, Abdulkerim] Tokat Gaziosmanpasa Univ, Fac Med, Dept Anat, Tokat, Turkiye; [Sahin, Bunyamin] Ondokuz Mayis Univ, Fac Med, Dept Anat, Samsun, Turkiyeen_US
dc.description.abstractPurpose This study aimed to comparatively assess the knowledge level of AI-based chatbots on human anatomy systems using multiple-choice questions, and to analyze their potential contribution to medical education. Methods Seventy multiple-choice questions covering seven major anatomical systems (musculoskeletal, respiratory, circulatory, digestive, urinary, genital, and nervous systems) were translated in accordance with Terminologia Anatomica and presented to GPT-4.1, DeepSeek, Co-Pilot, and Gemini. Questions were selected from first-and second-year medical student exams and distributed based on item difficulty index (Pj). All bots were tested under the same conditions to minimize bias. Success rates and statistical differences were evaluated using Kruskal-Wallis and Cochran's Q tests. The relationship with item difficulty was assessed using point biserial correlation. Results GPT-4.1 showed the highest accuracy (95.7%), followed by Co-Pilot (94.3%), DeepSeek (92.9%), and Gemini (91.4%). System-based results showed Co-Pilot reached 100% in musculoskeletal, and GPT-4.1 reached 90% in nervous system questions. All bots scored 100% in respiratory and circulatory systems. In other systems, success rates ranged from 80 to 100%. No significant correlation was found between item difficulty and chatbot accuracy. Conclusion Chatbots achieved high accuracy in anatomy questions, but there were notable differences across systems and areas. While their supportive role in medical education is increasing, expert supervision is still recommended. These results show that AI-based systems can serve as complementary educational tools, but further improvement is needed for complete reliability.en_US
dc.description.woscitationindexScience Citation Index Expanded
dc.identifier.doi10.1007/s00276-025-03769-8
dc.identifier.issn0930-1038
dc.identifier.issn1279-8517
dc.identifier.issue1en_US
dc.identifier.pmid41233619
dc.identifier.scopus2-s2.0-105021544685
dc.identifier.scopusqualityQ3
dc.identifier.urihttps://doi.org/10.1007/s00276-025-03769-8
dc.identifier.urihttps://hdl.handle.net/20.500.12712/38262
dc.identifier.volume48en_US
dc.identifier.wosWOS:001615961700002
dc.identifier.wosqualityQ3
dc.language.isoenen_US
dc.publisherSpringer Franceen_US
dc.relation.ispartofSurgical and Radiologic Anatomyen_US
dc.relation.publicationcategoryMakale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanıen_US
dc.rightsinfo:eu-repo/semantics/closedAccessen_US
dc.subjectAnatomyen_US
dc.subjectArtificial Intelligenceen_US
dc.subjectChatboten_US
dc.subjectExam Evaluationen_US
dc.subjectMedical Educationen_US
dc.titleSystem-Based Comparison of the Knowledge Level of Popular AI Chatbots on Human Anatomy: A Multiple-Choice Exam Analysis of GPT-4.1, Deepseek, Co-Pilot, and Gemini Modelsen_US
dc.typeArticleen_US
dspace.entity.typePublication

Files