Rapid human motion pose tracking has extensive applications in fields such as motion capture, intelligent monitoring, sports training, and physical health management. It can provide accurate data support, enhance safety monitoring, optimize training outcomes, and promote physical health. Traditional human pose tracking methods predominantly rely on either sensors or images for tracking, which often results in issues like low tracking accuracy and slow tracking speed. To address these problems, a rapid human motion pose tracking method based on improved deep reinforcement learning and multimodal fusion is proposed. First, this paper designs an overall architecture for rapid human motion pose tracking and utilizes a combination of monocular vision and sensors to extract and collect human motion data. Second, it constructs a complementary filter-based multimodal data fusion method to merge the multimodal data and extract the fused features. Finally, a multi-level attention network is employed to enhance the deep reinforcement learning network, using the fused features as input for training to achieve rapid human motion pose tracking. The results show that the proposed method can achieve efficient and stable human motion pose tracking in complex scenes, with a tracking accuracy of up to 85% and a shortest tracking time of 72ms, which has practical application value.