Abstract:
Background Diabetic retinopathy (DR) is one of the main complications in patients with diabetes. The progressive development of DR can lead to visual impairment and even blindness. It is of great significance to explore the clinical factors affecting the progress of DR for its prevention, control and management in diabetic patients.
Objective To explore the risk factors of diabetic retinopathy (DR) in patients with type 2 diabetes mellitus by machine learning algorithms and SHAP analysis.
Methods A retrospective analysis was performed for the clinical data about 3000 patients with type 2 diabetes mellitus in the early warning data set of diabetes complications of Chinese PLA General Hospital published by ‘The national population and health science data sharing platform’, baseline analysis and difference tests were carried out for 58 observed variables between non diabetic retinopathy (NDR) group and DR group. Three machine learning algorithms including XGBoost, random forest and logistic regression were evaluated. Recursive feature elimination (RFE) and XGBoost, were employed to rank the characteristic weight values of the optimal variables. The risk factors of the model were explained and analyzed by the method of SHAP.
Results The incidences or index levels of hypertension (systolic/diastolic blood pressure), glycosylated hemoglobin (HbA1c), blood lipid level (total cholesterol, low density lipoprotein), stroke, kidney disease (blood urea, serum creatinine, serum uric acid), renal failure, lower extremity artery disease in DR group were higher than those in NDR group (all P < 0.05); while the average age and incidences of coronary heart disease, myocardial infarction, hyperlipidemia, atherosclerosis were lower than those in NDR group (P < 0.05). The top ten important distinguishing features of XGBoost model were kidney disease, coronary heart disease, lower extremity artery disease, height, other tumors, HbA1c, blood urea, serum albumin, renal failure and hyperlipidemia. XGBoost model was better than other models. The importance of variables in XGBoost model was explained by SHAP integrated scatter diagram: the SHAP values were > 0 and the mean absolute values were higher in HbA1c (0.59), nephropathy (0.44), blood urea (0.32) and lower extremity arterial disease (0.25), and the distribution of SHAP values showed obvious classification, suggesting that they were the significant risk factors of DR. HbA1c, kidney disease and blood urea had potential interaction on the development of DR, and the risk of DR was significantly increased when blood urea was > 5 mmol/L.
Conclusion XGBoost algorithm and SHAP model perform well in predicting the risk factors of DR in patients with diabetes and in explaining the interaction between characteristic variables, suggesting that HbA1c, nephropathy and blood urea level are predictive indicators of DR.