亚实性结节的IA期肺癌气道播散预测模型的开发和验证

李洪海; 张泽瑾; 南昊宁; 孙亚菲; 赵明; 陈思禹; 仇永辉; 王钰琦

doi:10.12435/j.issn.2095-5227.25110401

亚实性结节的IA期肺癌气道播散预测模型的开发和验证

Development and validation of a predict model for STAS in IA stage lung cancer presenting as subsolid nodules

摘要

摘要: 背景　对于CT 影像表现为亚实性结节(subsolid nodule，SSN)的IA 期肺癌，如果能术前精准评估发生气道播散(spread through air spaces，STAS)的风险，可以为优化手术方案、改善患者预后提供有力的支撑。目的　应用CT数据和实验室检查结果，开发并验证基于机器学习的临床预测模型，用于识别表现为亚实性结节的IA 期肺癌患者STAS风险。方法　数据来源于2021 年5 月至2025 年9 月解放军总医院第一医学中心和第四医学中心的2 047 例CT表现为亚实性结节的IA期肺癌患者，其中第一医学中心1 600 例使用随机数法按照7∶3 的比例分为训练集(n=1 120)和内部验证集(n=480)，第四医学中心的447 例为外部验证集。使用单因素逻辑回归、最小绝对收缩和选择算子(least absolute shrinkage and selection operator，LASSO)回归和多因素逻辑回归进行特征选择，随后开发朴素贝叶斯(naive bayes，NB)、逻辑回归(logistic regression，LR)、K近邻(k-nearest neighbors，KNN)、随机森林(random forest，RF)、单层神经网络(single-layer neural network，SLNN)、极端梯度增强(extreme gradient boosting，XGBoost)和轻梯度增强机(light gradient boosting machine，LightGBM)7 个机器学习模型。使用受试者工作特征(receiver operating characteristic，ROC)曲线、校准曲线和决策曲线分析(decision curve analysis，DCA)评估模型的性能。模型可视化采用Shapley 加性解释(SHAP)，根据SHAP的特征重要性排序构建预测列线图并部署网页。结果　在开发的7 个机器学习模型中，随机森林模型(random forest，RF)表现出最好的预测性能，其曲线下面积(area under the curve，AUC)值在训练队列中为0.934(95% CI：0.902 ~ 0.966)，在内部验证队列中为0.929(95% CI：0.900 ~ 0.958)，在外部验证队列中为0.873(95% CI：0.837 ~ 0.909)。校准曲线表明，模型的预测结果与实际结果吻合较好。DCA曲线显示该模型具有较高的临床净收益。经过SHAP 分析，确定了预测CT 影像表现为亚实性结节IA 期肺癌STAS 的最关键因素为：实性成分占比(0.189)，肿瘤最大径(0.079)，毛刺征(0.037)和瘤肺界面(0.035)。结论　基于术前影像语义特征及临床指标的RF模型能精准预测以亚实性结节为表现的IA 期肺癌发生STAS的风险，在外部验证中展现了较好的泛化能力。该模型有望通过术前甄别IA期肺癌发生STAS的预测因子，在指导治疗方案决策、改善患者预后等方面起到一定的临床辅助决策价值。

Abstract: Background　The accurate preoperative risk assessment of spread through air spaces (STAS) for stage IA lung cancer manifesting as subsolid nodules (SSNs) on CT scans provides crucial support for optimizing surgical strategies and enhancing patient outcomes. Objective　To develop and validate a machine learning-based clinical predict model using CT data and clinical test results for identifying the risk of STAS in patients with stage IA lung cancer presenting as subsolid nodules. Methods　A retrospective analysis was performed on 2 047 patients with stage IA lung cancer showing subsolid nodules on CT imaging, who were treated at the First and the Fourth Medical Center of PLA General Hospital from May 2021 to September 2025. The 1 600 cases from the First Medical Center of PLA General Hospital were randomly divided into the training set (n=1 120) and the internal validation set (n=480) at a ratio of 7:3 using the random number method. The 447 cases from the Fourth Medical Center of PLA General Hospital were the external validation set. Feature selection was conducted using univariable logistic regression, least absolute shrinkage and selection operator (LASSO) regression, and multivariable logistic regression. Subsequently, seven machine learning models were developed, namely Naive Bayes (NB), Logistic Regression (LR), K-Nearest Neighbors (KNN), Random Forest (RF), Single-Layer Neural Network (SLNN), Extreme Gradient Boosting (XGBoost), and Light Gradient Boosting Machine (LightGBM). Model performance was evaluated using receiver operating characteristic (ROC) curves, calibration curves, and decision curve analysis (DCA) to assess discrimination, calibration, and clinical utility. Model interpretation was achieved using Shapley Additive exPlanations (SHAP). Based on the SHAP-derived feature importance ranking, a predict nomogram was constructed and deployed as a web-based application. Results　Among the seven machine learning models developed, the Random Forest (RF) model demonstrated the best predict performance, with area under the curve (AUC) values of 0.934 (95% CI: 0.902 - 0.966) in the training cohort, 0.929 (95% CI: 0.900 - 0.958) in the internal validation cohort, and 0.873 (95% CI: 0.837 - 0.909) in the external validation cohort. The calibration curve indicated good agreement between the model's predictions and actual outcomes. DCA showed that the model provided high clinical net benefit. SHAP analysis identified the following key factors for predicting STAS in IA-stage lung cancer presenting as subsolid nodules on CT: consolidation-to-tumor ratio (0.189), maximum tumor diameter (0.079), spiculation (0.037), and tumor-lung interface (0.035). Conclusion　The RF model based on preoperative image semantic features and clinical indicators can accurately predict the risk of STAS in IA-stage lung cancer presenting as subsolid nodules. Its clinical utility lies in screening predictor patients preoperatively, aiding in therapeutic decision-making, and contributing to improved prognosis.

HTML全文

参考文献(0)

施引文献

资源附件(0)