基于RTMPose 和PatchTST 的帕金森病和特发性震颤的视频鉴别诊断研究

Video-Based differential diagnosis of Parkinson's disease and essential tremor using RTMPose and PatchTST

  • 摘要: 背景 帕金森病(Parkinson's disease,PD)与特发性震颤(essential tremor,ET)临床表现相似,当前诊断依赖神经科医生主观量表评估,耗时长且一致性有限。目的 开发一种基于视频分析的智能分类模型,结合深度学习实现PD与ET的高效自动鉴别,为无创诊断提供新思路。方法 纳入2021 — 2024 年解放军总医院门诊14 例PD患者与63 例ET患者,采集其执行三种标准化上肢运动任务(手指指鼻、翻手掌、握拳张开)的1 136 段视频。基于MMPose框架的RTMPose 模型提取手腕及手指关键点坐标序列,计算位移、速度及加速度等运动学特征,构建包含时空轨迹与统计学特征数据集。以Transformer 架构建立PatchTST 模型 (输入特征序列按时间窗口分块处理,融合全局注意力机制),并与逻辑回归、XGBoost、随机森林、支持向量机、Informer 及长短期记忆网络进行对比。结果 PatchTST 模型在融合关键点坐标与运动学特征时的平均模型性能最优,其在手指指鼻任务中的准确度最佳(AUC=0.957),三种运动任务的平均AUC达到了0.897。在全部21 种模型组合中,仅纳入运动学特征的LSTM模型性能最差,三种运动任务的平均AUC仅为0.691。结论 基于视频的PD与ET智能鉴别诊断方法依托人体姿态估计与深度学习技术,能够以高精度、高效率实现无接触的远程诊断,为运动障碍疾病的早期诊断与管理提供了参考。

     

    Abstract: Background Parkinson's disease (PD) and essential tremor (ET) share overlapping clinical manifestations. Current diagnostic approaches rely on subjective rating scales by neurologists, which are time-consuming and limited inter-rater consistency. Objective To develop an intelligent classification model based on video analysis, integrating deep learning to achieve efficient and automated differentiation between PD and ET, thereby offering a novel approach for non-invasive diagnosis. Methods A total of 14 PD and 63 ET patients from the Outpatient Department of Chinese PLA General Hospital (2021 — 2024) were enrolled. A dataset comprising 1 136 video clips was collected during the performance of three standardized upper limb motor tasks: finger-tonose, hand pronation-supination, and fist opening-closing. Using the RTMPose model within the MMPose framework, keypoint coordinates of the wrist and fingers were extracted. Kinematic features such as displacement, velocity, and acceleration were computed to construct a dataset of spatiotemporal trajectories and statistical descriptors. A Transformer-based PatchTST model was developed, in which temporal sequences were segmented into patches and processed via global attention mechanisms. Model performance was compared against logistic regression, XGBoost, random forest, support vector machine, Informer, and long shortterm memory (LSTM) networks. Results The PatchTST model achieved the best average performance when combining keypoint coordinates with kinematic features. The highest accuracy was observed in the finger-to-nose task (AUC = 0.957), with an overall average AUC of 0.897 among all three tasks. Among the 21 model-feature combinations, the LSTM model using only kinematic features performed the worst, with an average AUC of 0.691. Conclusion The video-based intelligent diagnostic method for PD and ET differentiation leverages human pose estimation and deep learning technologies, enabling high-precision and efficient contactless remote diagnosis, which provides a reference for early diagnosis and management of movement disorders.

     

/

返回文章
返回