Performance evaluation and comparison of domestic large language models in multi-hop reasoning tasks for burn injury diagnosis and treatment assistance
-
Graphical Abstract
-
Abstract
Background Battlefield burn care demands rapid integration of multidimensional clinical information to support accurate decision-making, and multi-hop reasoning plays a key role in this process. Objective To evaluate the performance differences of four domestic large language models—DeepSeek R1, DeepSeek V3, DouBao, and KiMi—in multi-hop reasoning tasks for burn-assisted diagnosis and treatment, and provide theoretical reference for model selection and optimization in clinical and field emergency environments.Methods Thirty burn cases discharged from a tertiary hospital from January 2023 to February 2025 were randomly selected to determine the disease diagnosis. Three burn-care experts performed a blind evaluation using a 5-point Likert scale to assess the accuracy of the diagnostic results. Overall comparisons were analyzed using randomized block ANOVA, subgroup analyses (question word count, burn site, area, severity) employed the Mann–Whitney U test, and mixed-effects models were used to assess the interaction between major language models and subgroup factors. Results The experts' consensus score Cronbach's Alpha reached 0.809. DeepSeek R1 achieved a mean score of (4.2 ± 0.62), significantly outperforming DeepSeek V3 (2.4 ± 1.06), Doubao (3.2±1.31) and KiMi (1.6 ± 0.86) (P < 0.001). Subgroup analysis revealed DeepSeek-R1 consistently demonstrated superior performance metrics across all defined subpopulations: cases with word counts ≤2000 versus ≥2000, single-site versus multi-site burn injuries, total body surface area (TBSA) involvement <10% versus ≥10%, and burn severity below deep partial-thickness versus deep partial-thickness or greater. Mixed-effects modeling revealed significant interactions between model score and prompt length (P=0.006), number of burn sites (P=0.007), and burn area (P=0.001).Conclusion Significant performance differences exist among domestic large language models on multi-hop reasoning tasks for burn-care diagnostic support, with DeepSeek R1 demonstrating superior capability. These findings underscore the promise of multi-hop reasoning techniques for integrating complex clinical data and facilitating rapid decision-making, and they offer important guidance for optimizing large models in battlefield and emergency burn-care settings.
-
-