Abstract:
Background The accurate preoperative risk assessment of spread through air spaces (STAS) for stage IA lung cancer manifesting as subsolid nodules (SSNs) on CT scans provides crucial support for optimizing surgical strategies and enhancing patient outcomes. Objective To develop and validate a machine learning-based clinical predict model using CT data and clinical test results for identifying the risk of STAS in patients with stage IA lung cancer presenting as subsolid nodules. Methods A retrospective analysis was performed on 2 047 patients with stage IA lung cancer showing subsolid nodules on CT imaging, who were treated at the First and the Fourth Medical Center of PLA General Hospital from May 2021 to September 2025. The 1 600 cases from the First Medical Center of PLA General Hospital were randomly divided into the training set (n=1 120) and the internal validation set (n=480) at a ratio of 7:3 using the random number method. The 447 cases from the Fourth Medical Center of PLA General Hospital were the external validation set. Feature selection was conducted using univariable logistic regression, least absolute shrinkage and selection operator (LASSO) regression, and multivariable logistic regression. Subsequently, seven machine learning models were developed, namely Naive Bayes (NB), Logistic Regression (LR), K-Nearest Neighbors (KNN), Random Forest (RF), Single-Layer Neural Network (SLNN), Extreme Gradient Boosting (XGBoost), and Light Gradient Boosting Machine (LightGBM). Model performance was evaluated using receiver operating characteristic (ROC) curves, calibration curves, and decision curve analysis (DCA) to assess discrimination, calibration, and clinical utility. Model interpretation was achieved using Shapley Additive exPlanations (SHAP). Based on the SHAP-derived feature importance ranking, a predict nomogram was constructed and deployed as a web-based application. Results Among the seven machine learning models developed, the Random Forest (RF) model demonstrated the best predict performance, with area under the curve (AUC) values of 0.934 (95% CI: 0.902 - 0.966) in the training cohort, 0.929 (95% CI: 0.900 - 0.958) in the internal validation cohort, and 0.873 (95% CI: 0.837 - 0.909) in the external validation cohort. The calibration curve indicated good agreement between the model's predictions and actual outcomes. DCA showed that the model provided high clinical net benefit. SHAP analysis identified the following key factors for predicting STAS in IA-stage lung cancer presenting as subsolid nodules on CT: consolidation-to-tumor ratio (0.189), maximum tumor diameter (0.079), spiculation (0.037), and tumor-lung interface (0.035). Conclusion The RF model based on preoperative image semantic features and clinical indicators can accurately predict the risk of STAS in IA-stage lung cancer presenting as subsolid nodules. Its clinical utility lies in screening predictor patients preoperatively, aiding in therapeutic decision-making, and contributing to improved prognosis.