国内大语言模型在烧伤辅助诊疗多跳推理任务中的性能评估与比较

Performance evaluation and comparison of domestic large language models in multi-hop reasoning tasks for burn injury diagnosis and treatment assistance

  • 摘要: 背景 战场烧伤救治要求迅速整合多维临床信息以支持准确决策,多跳推理技术在这一过程中扮演关键角色。目的 评估DeepSeek R1、DeepSeek V3、豆包和KiMi四种国内大语言模型在烧伤辅助诊疗多跳推理任务中的性能差异,为临床和野战急救环境中大模型的选型与优化提供理论依据。方法 从某医院2023 年1 月至2025 年2 月出院的烧伤病例中随机抽取30 例。统一输入患者基本信息、主诉、现病史、既往史、个人史及体格与辅助检查数据,通过四类模型确定疾病诊断。3 名专家采用Likert 5 分盲评对诊断结果的准确性进行评估。总体比较采用配伍组方差分析,亚组(问题字数、烧伤部位、面积、严重程度)采用Mann-Whitney U检验,并利用混合效应模型评估各大语言模型与亚组因素的交互作用。结果 专家一致性评分Cronbach’s Alpha 达0.809。总体上,DeepSeek R1 得分为(4.2±0.62),显著高于DeepSeek V3(2.4±1.06)、豆包(3.2±1.31)和KiMi(1.6±0.86)(P<0.001)。亚组分析显示,2000 字以下和2000 字及以上组、单部位烧伤和多部位烧伤组、10%以下烧伤和10%以上烧伤面积组、深Ⅱ度以下和深Ⅱ度以上烧伤程度组中DeepSeek R1 均表现优异;混合效应模型表明,问题字数、烧伤部位数量与烧伤面积对模型得分存在显著交互效应(分别P=0.006、0.007、0.001)。结论 国内大模型在烧伤辅助诊疗多跳推理任务中存在显著性能差异,其中DeepSeek R1 表现最佳,凸显了多跳推理技术在复杂临床信息整合与快速决策中的应用前景,为野战及临床急救中大模型的优化提供了重要参考。

     

    Abstract: Background Battlefield burn care demands rapid integration of multidimensional clinical information to support accurate decision-making, and multi-hop reasoning plays a key role in this process. Objective To evaluate the performance differences of four domestic large language models—DeepSeek R1, DeepSeek V3, DouBao, and KiMi—in multi-hop reasoning tasks for burn-assisted diagnosis and treatment, and provide theoretical reference for model selection and optimization in clinical and field emergency environments.Methods Thirty burn cases discharged from a tertiary hospital from January 2023 to February 2025 were randomly selected to determine the disease diagnosis. Three burn-care experts performed a blind evaluation using a 5-point Likert scale to assess the accuracy of the diagnostic results. Overall comparisons were analyzed using randomized block ANOVA, subgroup analyses (question word count, burn site, area, severity) employed the Mann–Whitney U test, and mixed-effects models were used to assess the interaction between major language models and subgroup factors. Results The experts' consensus score Cronbach's Alpha reached 0.809. DeepSeek R1 achieved a mean score of (4.2 ± 0.62), significantly outperforming DeepSeek V3 (2.4 ± 1.06), Doubao (3.2±1.31) and KiMi (1.6 ± 0.86) (P < 0.001). Subgroup analysis revealed DeepSeek-R1 consistently demonstrated superior performance metrics across all defined subpopulations: cases with word counts ≤2000 versus ≥2000, single-site versus multi-site burn injuries, total body surface area (TBSA) involvement <10% versus ≥10%, and burn severity below deep partial-thickness versus deep partial-thickness or greater. Mixed-effects modeling revealed significant interactions between model score and prompt length (P=0.006), number of burn sites (P=0.007), and burn area (P=0.001).Conclusion Significant performance differences exist among domestic large language models on multi-hop reasoning tasks for burn-care diagnostic support, with DeepSeek R1 demonstrating superior capability. These findings underscore the promise of multi-hop reasoning techniques for integrating complex clinical data and facilitating rapid decision-making, and they offer important guidance for optimizing large models in battlefield and emergency burn-care settings.

     

/

返回文章
返回