Abstract:
Background Prostate cancer is one of the most common malignant tumors of the male genitourinary system. The fast differentiation between prostate cancer and prostatic hyperplasia is one of the clinical problems that requires reliable methods for differentiation and prediction.
Objective To construct a classification model for prostate cancer and prostate hyperplasia based on XGBoost algorithm, identify the risk indicators of prostate cancer and evaluate their values in clinical application.
Methods Clinical data about patients with prostate cancer or prostate hyperplasia were obtained from “Prostate Cancer Dataset” (provided by the National Clinical Medical Science Data Center of Chinese PLA General Hospital and National Population and Health Science Data Sharing Platform in 2019). After data preprocessing, the data set was divided into training set and test set with the ratio of 0.7 and 0.3. XGBoost algorithm was used to construct a classification model of prostate cancer and prostate hyperplasia using the training set, and the effectiveness of the model was verified based on the test set. Finally, the characteristics of the model were explained using SHapley Additive exPlanations (SHAP) analysis method.
Results Totally 1 224 patients with prostate cancer (average age of 65.86 years) and 1 255 patients with prostatic hyperplasia (average age of 67.70 years) were included. Twenty-three characteristics including age, BMI, prostate specific antigen (PSA) series indicators and other biochemical test indicators were selected to construct the classification model. The AUC, accuracy, recall, precision and F1 of the model was 0.81, 0.74, 0.70, 0.72, 0.74 respectively. Free-PSA/total-PSA, total PSA, inorganic phosphorus, and free PSA were the four most important factors in early prostate diagnosis. SHAP analysis results showed that Free-PSA/total-PSA ≤ 0.132 and inorganic phosphorus ≥ 1.09 mmol/L was the cut-off value that needed attention in the diagnosis of prostate cancer.
Conclusion XGBoost algorithm can help to construct an effective model to classify prostate cancer patients and prostate hyperplasia patients, and the cut off values of important risk indicators using SHAP analysis provide a certain reference for the early diagnosis of prostate cancer.