常见皮肤肿瘤的图像—文本多模态智能分类模型构建及应用评价

Construction and evaluation of an image-text multimodal intelligent classification model for common skin tumors

  • 摘要: 背景 皮肤肿瘤临床表型重叠度高,单一图像或文本模态难以满足精准辅助诊断需求。现有多模态研究多缺乏规范的性能报告,且泛化性验证不足。目的 构建融合皮肤图像与电子病历文本的多模态分类模型,系统评估其在12 类常见皮肤肿瘤分类中的效能,并明确模态互补价值。方法 回顾性纳入2019 — 2024 年解放军总医院第一医学中心、第九医学中心及北京中医医院经病理确诊的15 925 例患者(12 类疾病),按病例分层以4∶1 划分为训练集(含同一病例多视角图像)与独立测试集(每病例仅1 张图像)。图像分支采用ImageNet 预训练DenseNet-201 提取特征,文本分支采用Qwen3-embedding-8B生成语义嵌入并结合极端梯度提升算法(eXtreme Gradient Boosting,XGBoost)分类;多模态阶段通过决策层级联融合,采用最大值规则输出最终分类。主要评价指标为加权平均F1 值,次要评价指标为包括精确率、召回率、曲线下面积(area under the curve,AUC);同时分析模态互补性(文本补偿图像值)与错误分类模式。结果 多模态模型加权平均精确率、召回率、F1 值分别为0.821、0.812、0.816,均优于单一图像模态(0.811、0.796、0.799)与单一文本模态(0.745、0.742、0.732)。多模态模型在12 类中F1 值均不低于至少一种单模态;黑素瘤召回率从文本模态的0.521 提升至0.739,黄色肉芽肿召回率从0.230 提升至0.769;鲍温病F1 值(0.470)高于图像模态(0.375)但低于文本模态(0.551)。模态互补性分析显示,基底细胞癌、痣和血管瘤的文本补偿图像值分别为0.145、0.114 和0.058,提示文本信息可在部分图像误判病例中提供补充线索。结论 图像-文本多模态融合可提升皮肤肿瘤分类稳定性与综合判别能力,尤其降低高风险病变漏诊风险,但存在类别不平衡与泛化性局限。

     

    Abstract: Background Skin tumors exhibit substantial clinical phenotypic overlap, and either skin images or clinical text alone may be insufficient for precise computer-aided diagnosis. Existing multimodal studies often lack standardized performance reporting and adequate validation of generalizability. Objective To develop a multimodal classification model integrating skin images and electronic medical record (EMR) text, systematically evaluate its performance in classifying 12 common skin tumor categories, and clarify the complementary value of different modalities. Methods This retrospective study included 15 925 pathologically confirmed patients with 12 disease categories from the First Medical Center of PLA General Hospital, the Ninth Medical Center of PLA General Hospital and Beijing Hospital of Traditional Chinese Medicine from 2019 to 2024. Cases were stratified and divided at the patient level into a training set and an independent test set at a ratio of 4:1. The training set included multi-view images from the same lesion, whereas the independent test set contained only one image per patient. The image branch used an ImageNet-pretrained DenseNet-201 to extract visual features, while the text branch used Qwen3-embedding-8B to generate semantic embeddings, which were further combined with eXtreme Gradient Boosting (XGBoost) for classification. In the multimodal stage, decision-level cascaded fusion was performed, and the final classification was determined using the maximum rule. The primary evaluation metric was weighted-average F1-score, and the secondary metrics included precision, recall, and area under the curve (AUC). Modality complementarity, represented by the text-compensating-image value, and error classification patterns were also analyzed. Results The multimodal model achieved weighted-average precision, recall, and F1-score of 0.821, 0.812, and 0.816, respectively, outperforming both the image-only model (0.811, 0.796, and 0.799) and the text-only model (0.745, 0.742, and 0.732). Across all 12 categories, the F1-score of the multimodal model was not lower than that of at least one unimodal model. The recall for melanoma increased from 0.521 in the text-only model to 0.739 in the multimodal model, and the recall for xanthogranuloma increased from 0.230 to 0.769. For Bowen's disease, the F1-score of the multimodal model (0.470) was higher than that of the image-only model (0.375) but lower than that of the text-only model (0.551). Modality complementarity analysis showed that the text-compensating-image values for basal cell carcinoma, nevus, and hemangioma were 0.145, 0.114, and 0.058, respectively, indicating that text information could provide complementary cues in some image-misclassified cases. Conclusion  Image-text multimodal fusion can improve the stability and overall discriminative ability of skin tumor classification, particularly by reducing missed detection of high-risk lesions. However, limitations related to class imbalance and generalizability remain.

     

/

返回文章
返回