Abstract:
Background Skin tumors exhibit substantial clinical phenotypic overlap, and either skin images or clinical text alone may be insufficient for precise computer-aided diagnosis. Existing multimodal studies often lack standardized performance reporting and adequate validation of generalizability. Objective To develop a multimodal classification model integrating skin images and electronic medical record (EMR) text, systematically evaluate its performance in classifying 12 common skin tumor categories, and clarify the complementary value of different modalities. Methods This retrospective study included 15 925 pathologically confirmed patients with 12 disease categories from the First Medical Center of PLA General Hospital, the Ninth Medical Center of PLA General Hospital and Beijing Hospital of Traditional Chinese Medicine from 2019 to 2024. Cases were stratified and divided at the patient level into a training set and an independent test set at a ratio of 4:1. The training set included multi-view images from the same lesion, whereas the independent test set contained only one image per patient. The image branch used an ImageNet-pretrained DenseNet-201 to extract visual features, while the text branch used Qwen3-embedding-8B to generate semantic embeddings, which were further combined with eXtreme Gradient Boosting (XGBoost) for classification. In the multimodal stage, decision-level cascaded fusion was performed, and the final classification was determined using the maximum rule. The primary evaluation metric was weighted-average F1-score, and the secondary metrics included precision, recall, and area under the curve (AUC). Modality complementarity, represented by the text-compensating-image value, and error classification patterns were also analyzed. Results The multimodal model achieved weighted-average precision, recall, and F1-score of 0.821, 0.812, and 0.816, respectively, outperforming both the image-only model (0.811, 0.796, and 0.799) and the text-only model (0.745, 0.742, and 0.732). Across all 12 categories, the F1-score of the multimodal model was not lower than that of at least one unimodal model. The recall for melanoma increased from 0.521 in the text-only model to 0.739 in the multimodal model, and the recall for xanthogranuloma increased from 0.230 to 0.769. For Bowen's disease, the F1-score of the multimodal model (0.470) was higher than that of the image-only model (0.375) but lower than that of the text-only model (0.551). Modality complementarity analysis showed that the text-compensating-image values for basal cell carcinoma, nevus, and hemangioma were 0.145, 0.114, and 0.058, respectively, indicating that text information could provide complementary cues in some image-misclassified cases. Conclusion Image-text multimodal fusion can improve the stability and overall discriminative ability of skin tumor classification, particularly by reducing missed detection of high-risk lesions. However, limitations related to class imbalance and generalizability remain.