3D human pose estimation is a major focus area in the field of computer vision, which plays an important role in practical applications. This article summarizes the framework and research progress related to the estimation of monocular RGB images and videos. An overall perspective of methods integrated with deep learning is introduced. Novel image-based and video-based inputs are proposed as the analysis framework. From this viewpoint, common problems are discussed. The diversity of human postures usually leads to problems such as occlusion and ambiguity, and the lack of training datasets often results in poor generalization ability of the model. Regression methods are crucial for solving such problems. Considering image-based input, the multi-view method is commonly used to solve occlusion problems. Here, the multi-view method is analyzed comprehensively. By referring to video-based input, the human prior knowledge of restricted motion is used to predict human postures. In addition, structural constraints are widely used as prior knowledge. Furthermore, weakly supervised learning methods are studied and discussed for these two types of inputs to improve the model generalization ability. The problem of insufficient training datasets must also be considered, especially because 3D datasets are usually biased and limited. Finally, emerging and popular datasets and evaluation indicators are discussed. The characteristics of the datasets and the relationships of the indicators are explained and highlighted. Thus, this article can be useful and instructive for researchers who are lacking in experience and find this field confusing. In addition, by providing an overview of 3D human pose estimation, this article sorts and refines recent studies on 3D human pose estimation. It describes kernel problems and common useful methods, and discusses the scope for further research.
Human pose estimation is a research hotspot in the field of computer vision, similar to face recognition [
In particular, 3D pose estimation has witnessed rapid development. It involves estimating the 3D joint positions of the human body from a single view. Compared with 2D detection, 3D detection includes depth information, which is used to calculate the 3D coordinates of the human joint positions. Therefore, the precision of 3D human pose estimation is higher than that of 2D human pose estimation [
Currently, many breakthroughs have been achieved in studies on 2D human pose estimation [
As deep neural networks have good feature extraction capabilities, many methods [
Existing studies on 3D pose estimation with image-based input are mainly conducted from two aspects. One is to solve the problem of human occlusion, and the other is to improve the generalization ability of the model [
The detection-based model directly returns the key points of the human body via 2D joint detection, and the performance of this method has been verified. In a 3D space, owing to its high degree of nonlinearity, the output space is large, which leads to challenges in the detection of human joints. At present, the regression-based model is highly popular. The regression task is to estimate the position of a joint relative to the root joint. It makes the relative position relationship between the parent and child joints easier to obtain. This method usually employs a multi-task training framework by combining detection and regression. For example, in reference [
The occlusion problem is essentially a problem of missing human key points. Because the pose estimation information of a monocular image is not perfect, the estimation algorithm can usually obtain only the relative coordinates of the human body key points (not the absolute coordinates). To address this problem, related studies have adopted a multi-view method in order to obtain the absolute position of the 3D pose of the human body. The multi-view method uses monocular cameras to capture the same object from different viewpoints at the same time. This process uses a 2D detector to locate the 2D joints in each view and then performs robust triangulation for 2D detection in each view to obtain the 3D joint positions. Some researchers have adopted multi-view modeling to overcome the problem of relying on single-view RGB images to train the network [
In contrast to the traditional multi-view method, it has been assumed in reference [
Other excellent methods [
In the case of severe occlusion, researchers have proposed an innovative multi-view method to construct a voxel-based expression of the scene (including people), as shown in
Supervised learning models are strongly dependent on labeled samples and suffer from dataset bias. To improve their generalization ability, some recent studies [
In general, the representative datasets of existing machine vision techniques are uniformly distributed; however, there is usually a large amount of long-tailed data in actual training. This disagreement in the distribution results in data imbalance. Previous studies have adopted a rebalancing strategy to adjust the network training. Thus, they can make expectations closer to the distribution of tests by resampling samples in small batches or by using reweighted sample losses. However, this method affects the learning of deep features to a certain extent, such as by increasing the risk of over-fitting and under-fitting [
In existing methods [
In reference [
Although 3D human pose estimation has achieved considerable success, the 3D annotation for RGB images is a labor-intensive, time-consuming, and expensive process. Existing datasets are biased, and most of them are indoor datasets. Moreover, the actions included only cover a few selected daily behaviors. In reference [
One of the problems in the process of 3D human pose estimation is that different 3D poses may exhibit similar 2D projections. Human motion in the real world follows the laws of kinematics, including static/dynamic structures. Many studies [
Initially, perspective projection is used to correct the 2D input containing noise (represented as red dots) and the 2D joints. Then, the joint motion and the human topology are explicitly decomposed. Finally, the unreliable 3D poses (represented by red crosses) are eliminated to complete the entire task. The three aforementioned steps are seamlessly integrated into the deep neural model to form a deep kinematics analysis pipeline that considers the static/dynamic structures of the 2D input and 3D output simultaneously. This experiment pioneered the use of perspective projection to refine 2D joints.
As shown in
Protocol 1: MPJPE | Dir. | Disc. | Eat | Greet | Phone | Photo | Pose | Purch | Sit | SitD. | Smoke | Wait | WalkD. | Walk | WalkT. | Avg. |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Martinez et al. |
51.8 | 56.2 | 58.1 | 59.0 | 69.5 | 78.4 | 55.2 | 58.1 | 74.0 | 94.6 | 63.4 | 59.1 | 65.1 | 49.5 | 52.4 | 62.9 |
Luvizon et al. CVPR’18 [ |
49.2 | 51.6 | 47.6 | 50.5 | 51.8 | 60.3 | 48.5 | 51.7 | 61.5 | 70.9 | 53.7 | 48.9 | 57.9 | 44.4 | 48.9 | 53.2 |
Hossain et al. ECCV’18(T = 5) [ |
48.4 | 50.7 | 57.2 | 55.2 | 63.1 | 72.6 | 53.0 | 51.7 | 66.1 | 80.9 | 59.0 | 57.3 | 62.4 | 46.6 | 49.6 | 58.3 |
Lee et al. ECCV’18 (T = 5) [ |
40.2 | 49.2 | 47.8 | 52.6 | 50.1 | 75.0 | 50.2 | 43.0 | 55.8 | 73.9 | 54.1 | 55.6 | 58.2 | 43.3 | 43.3 | 52.8 |
Pavllo et al. CVPR’19 (T = 1) [ |
47.1 | 50.6 | 49.0 | 51.8 | 53.6 | 61.4 | 49.4 | 47.4 | 59.3 | 67.4 | 52.4 | 49.5 | 55.3 | 39.5 | 42.7 | 51.8 |
Pavllo et al. CVPR’19 (T = 9) [ |
- | - | - | - | - | - | - | - | - | - | - | - | - | - | - | 49.8 |
Protocol 2: PA-MPJPE | Dir. | Disc. | Eat | Greet | Phone | Photo | Pose | Purch | Sit | SitD. | Smoke | Wait | WalkD. | Walk | WalkT. | Avg. |
Sun et al. ICCV’17 (T = 1) [ |
42.1 | 44.3 | 45.0 | 45.4 | 51.5 | 53.0 | 43.2 | 41.3 | 59.3 | 73.3 | 51.0 | 44.0 | 48.0 | 38.3 | 44.8 | 48.3 |
Fang et al. AAAI’18 (T = 1) [ |
38.2 | 41.7 | 43.7 | 44.9 | 48.5 | 55.3 | 40.2 | 38.2 | 54.5 | 64.4 | 47.2 | 44.3 | 47.3 | 36.7 | 41.7 | 45.7 |
Pavlakos et al. CVPR’18 (T = 1) [ |
34.7 | 39.8 | 41.8 | 38.6 | 42.5 | 47.5 | 38.0 | 36.6 | 50.7 | 56.8 | 42.6 | 39.6 | 43.9 | 32.1. | 36.5 | 41.8 |
Hossain et al. ECCV’18(T = 5) [ |
35.7 | 39.3 | 44.6 | 43.0 | 47.2 | 54.0 | 38.3 | 37.5 | 51.6 | 61.3 | 46.5 | 41.4 | 47.3 | 34.2 | 39.4 | 44.1 |
Pavllo et al. CVPR’19 (T = 1) [ |
36.0 | 38.7 | 38.0 | 41.7 | 40.1 | 45.9 | 37.1 | 35.4 | 46.8 | 53.4 | 41.4 | 36.9 | 43.1 | 30.3 | 34.8 | 40.0 |
The latest study [
A graph convolutional network (GCN) is a deep-learning-based method that performs convolution operations on graphs. Compared with the traditional CNN, GCNs have a unique convolution operator for irregular data structures. GCNs can be categorized into two types: Spectral-based GCN [
Previous studies have used only the first-level edge of each node. This limits the receptive field to one dimension, and it is not conducive to learning global features. In reference [
Protocol 1 #1 | Direct | Discuss | Eating | Greet | Phone | Photo | Pose | Purch. | Sitting | SittingD | Smoke | Wait | WalkD | Walk | WalkT | Avg. |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Lonesdtal et al. [ |
132.7 | 183.6 | 132.3 | 164.4 | 162.1 | 205.9 | 150.6 | 171.3 | 151.6 | 243.0 | 162.1 | 170.7 | 177.1 | 96.6 | 127.9 | 162.1 |
Tekin et al. CVPR’16 [ |
102.4 | 147.2 | 88.8 | 125.3 | 118.0 | 182.7 | 112.4 | 129.2 | 138.9 | 224.9 | 118.4 | 138.8 | 126.3 | 55.1 | 65.8 | 125.0 |
Zhou et al. CVPR’16 [ |
87.4 | 109.3 | 87.1 | 103.2 | 116.2 | 143.3 | 106.9 | 99.8 | 124.5 | 199.2 | 107.4 | 118.1 | 114.2 | 79.4 | 97.7 | 113.0 |
Du et al. ECCV’16 [ |
85.1 | 112.7 | 104.9 | 122.1 | 139.1 | 135.9 | 105.9 | 166.2 | 117.5 | 226.9 | 120.0 | 117.7 | 137.4 | 99.3 | 106.5 | 126.5 |
Chen et al. CVPR’17 [ |
89.9 | 97.6 | 89.9 | 107.9 | 107.3 | 139.2 | 93.6 | 136.0 | 133.1 | 240.1 | 106.6 | 106.2 | 87.0 | 114.0 | 90.5 | 114.1 |
Pavlakos et al. [ |
67.4 | 71.9 | 66.7 | 69.1 | 72.0 | 77.0 | 65.0 | 68.3 | 83.7 | 96.5 | 71.7 | 65.8 | 74.9 | 59.1 | 63.2 | 71.9 |
Mehta et al. [ |
52.6 | 64.1 | 55.2 | 62.2 | 71.6 | 79.5 | 52.8 | 68.6 | 91.8 | 118.4 | 65.7 | 63.5 | 49.4 | 76.4 | 53.5 | 68.6 |
Zhou et al. [ |
54.8 | 60.7 | 58.2 | 71.4 | 62.0 | 65.5 | 53.8 | 55.6 | 75.2 | 111.6 | 64.1 | 66.0 | 51.4 | 63.2 | 55.3 | 64.9 |
Martinez et al. ICCV’17 [ |
51.8 | 56.2 | 58.1 | 59.0 | 69.5 | 78.4 | 55.2 | 58.1 | 74.0 | 94.6 | 62.3 | 59.1 | 65.1 | 49.5 | 52.4 | 62.9 |
Sun et al. [ |
52.8 | 54.8 | 54.2 | 61.8 | 53.1 | 53.6 | 71.7 | 86.7 | 61.5 | 67.2 | 47.1 | 61.6 | 53.4 | 59.1 | ||
Fang et al. AAAI’18 [ |
50.1 | 54.3 | 57.0 | 57.1 | 66.6 | 73.3 | 53.4 | 55.7 | 72.8 | 88.6 | 60.3 | 57.7 | 62.7 | 47.5 | 50.6 | 60.4 |
Yang et al. CVPR’18 [ |
51.5 | 58.9 | 57.0 | 62.1 | 65.4 | 49.8 | 52.7 | 69.2 | 85.2 | 58.4 | 43.6 | 60.1 | 47.7 | 58.6 | ||
Hossain et al. [ |
48.4 | 57.2 | 55.2 | 63.1 | 72.6 | 53.0 | 80.9 | 59.0 | 57.3 | 62.4 | 49.6 | 58.3 | ||||
SemCN (HG) | 48.2 | 60.8 | 51.8 | 64.0 | 64.6 | 53.6 | 51.1 | 67.4 | 88.7 | 57.7 | 73.2 | 65.6 | 48.9 | 64.8 | 51.9 | 60.8 |
SemCN (RN w/FP) | 60.7 | 51.4 | 60.5 | 68.1 | 86.2 | 67.8 | 61.0 | 60.6 | ||||||||
SemCN (GT) | 37.8 | 49.4 | 37.6 | 40.9 | 45.1 | 41.4 | 40.1 | 48.3 | 50.1 | 42.2 | 53.5 | 44.3 | 40.5 | 47.3 | 39.0 | 43.8 |
In contrast to the study of the unified GCN [
A study [
Owing to the small amount of human posture data compared with other studies as well as the complexity of data annotation, the data richness that is available for training is not sufficient. By contrast, unsupervised learning can directly derive the properties of the data from the data itself and then summarize them, thereby enabling researchers to use these properties to make data-driven decisions. Therefore, some unsupervised-learning-based human pose estimation methods have been studied extensively [
In reference [
The 3D pose estimation of monocular video has attracted considerable attention in recent decades. It focuses on how to explore the temporal information from the video to generate more stable predictions and reduce the sensitivity to noise. In particular, it involves the estimation of human key point trajectories in a 3D space. Owing to motion blur and self-occlusion in video sequences, 2D detection is usually noisy and unreliable. The essence of video-based 3D pose estimation is how to use spatio-temporal information. One method is to use RNN or LSTM [
Recently, temporal information in monocular video has attracted increasing attention [
Existing methods take advantage of simple structural constraints. They employ symmetrical bone length [
The 3D pose can be effectively predicted in videos using a fully convolutional model based on the 2D joint positions in the dilated temporal domain. In reference [
In reference [
MPJPE/GLE (in mm) and 3DPCK/AUC (in %) on S1 | ||||
---|---|---|---|---|
Method | GLE↓ | 3DPCK↑ | AUC↑ | MPJPE↓ |
VNect | - | 66.06 | 28.02 | 77.19 |
HMR | - | 82.39 | 43.61 | 72.61 |
HMMR | - | 87.48 | 45.33 | 72.40 |
LiveCap | 317.01 | 71.13 | 37.90 | 92.84 |
MVBL | 76.03 | 99.17 | 57.79 | 45.44 |
Currently, the Human3.6 M dataset is the most widely used datin 3D pose estimation. It includes 3.6 million images captured by four cameras from different perspectives (50 fps video). It covers 15 actions: Directions, discussion, eating, greeting, phoning, posing, purchases, sitting, sitting down, smoking, taking photos, waiting, walking, walking dogs, and walking together. Further, it contains 11 individuals, 7 of which have 3D tags. Therefore, S1, S5, S6, S7, and S8 are generally used as training sets, while S9 and S11 are used as test sets. The dataset is available at
MPI-INF-3DHP was developed by the Max Planck Institute for Informatics. It is a 3D human pose estimation dataset composed of constrained indoor and complex outdoor scenes. It covers 8 actors performing 8 activities from 14 camera views, including walking and sitting postures as well as complex sports postures and dynamic actions. It contains 1.3 million frames, covering more posture categories than Human3.6 M. This dataset uses multiple unmarked cameras to capture actors in a green screen studio. By calculating the masks of different areas and independently synthesizing different textures of the background, chair, upper body, and lower body areas, the captured image can be enhanced. The camera captures the actions of wearing everyday clothes, and each actor has two sets of clothes to wear in the activity: One set of clothing is casual daily clothing, while the other set is plain color clothing. In contrast to the existing dataset, this dataset allows automatic segmentation and expansion and provides true 3D annotation as well as a universal skeleton compatible with Human3.6 M. Compared with the unmarked records proposed by Joo et al. [
Owing to the high cost of collecting a large-scale multi-person 3D pose estimation dataset, MPI-INF-3DHP is synthesized using the MuCo-3D-HP dataset with data enhancement. The data enhancement methods include background enhancement and perceptual shadow enhancement of human contours. This method uses single-person image data of real people in MuCo-3D-HP to synthesize numerous multi-person interactive images under the control of the user. It also includes 3D pose annotations. A new shooting (non-synthesis) multi-person test set is proposed, including 20 general real-world scenes with ground-truth 3D poses, which can be obtained by up to three subjects using a multi-view unmarked motion capture system. In addition, occlusion instructions are provided for each joint. This set of scenes includes 5 indoor and 15 outdoor scenes. The background includes trees, office buildings, roads, people, vehicles, and other fixed and moving entities. The new test set is called MuPoTS-3D [
CMU Panopoptic: This point cloud dataset is captured by 10 synchronous Kinect (Kinoptic Studio) devices installed using Panoptic Studio. The Kinect data are captured by more than 500 RGB cameras. They share temporal-spatial and 3D world coordinates. The point cloud output can be used with the output of an RGB camera (such as RGB video and 3D skeleton). It contains 10 synchronized RGB-D videos, 3D point clouds from 10 RGB-D videos, 31 synchronized HD videos of the same scene from other viewpoints, calibration parameters for 10 RGB-D cameras and 31 HD cameras, and synchronization tables for all the RGB-D and HD videos [
AMASS: It is a large and diverse human motion database, which is standardized and parameterized by representing 15 different motion capture (mocap) datasets based on optical tags in a common framework. It uses a new method called MoSh++ to convert the mocap data into a real 3D human body mesh initially, and finally maps a large amount of labeled data to the common SMPL posture, shape, and soft tissue parameters. It is represented by an assembled human model, providing a standard skeleton representation and a fully assembled curved surface mesh. The consistent representation of AMASS makes it very useful for animation, visualization, and generating training data for deep learning. Compared with other human motion datasets, it is much richer, with more than 40 h of motion data and recordings, covering 344 subjects and 11265 actions [
Leeds Sports Pose (LSP) dataset: It is a single human body key point detection dataset. The number of key points is 14, and the number of samples is 2000. It is the second-most commonly used dataset in current research. It covers many sports postures, including athletics, badminton, baseball, gymnastics, parkour, football, volleyball, and tennis. It contains around 2000 posture annotations. The images are all obtained from athletes on Flickr. Each image is a three-channel color image. The row range of pixels is 23, and the column range is 16. Each image is marked with 14 joint positions, and the left and right joints are always marked according to the human center [
MPII Human Pose dataset: It is based on single/multiple human body key point detection datasets. The entire human body has a total of 16 joints. The MPII Human Pose dataset is a benchmark for human pose estimation. It includes 25,000 annotated images of more than 40,000 people, which are extracted from YouTube videos. The test set also includes body part occlusion, 3D torso, and head direction annotations. The dataset is available at
The PoseTrack dataset includes videos around key frames from the MPII Human Pose dataset. These videos contain multiple individuals and non-static scenes. It uses MPII to select 41–298 adjacent frames of video clips and crowded scenes. This dataset contains an unconstrained evaluation protocol (without any prior assumptions about the size, location, and number of people, which are arbitrary). The scenes contain multiple people, and the people are articulated with each other and participate in a variety of dynamic activities. The video contains many body movements and postures, appearance changes, as well as high mutual occlusion and truncation. Part or all of the target disappears and reappears as much as possible. The dataset annotates the head boundary of each person in the video. Each person who appears in the video is assigned a unique tracking ID until that person leaves the field of view of the camera. For each person being tracked, 15 parts are annotated in the video, including the head, nose, neck, shoulders, elbows, wrists, hips, knees, and ankles. Finally, the VATIC tool is used to accelerate the annotation process. The annotation between adjacent frames is completed by interpolation. The dataset contains 550 videos, 66,374 frames, divided into 292 training videos, 50 verification videos, and 208 test videos. For each video sequence, 30 frames in the middle are annotated to obtain a total of 23,000 annotated frames and 153,615 annotated poses. In addition, in the verification and test set, dense annotations are made every 4 frames to test the ability and stability of long-term tracking of body joints. The dataset is available at
The Martial Arts, Dancing and Sports (MADS) dataset is provided by the City University of Hong Kong. It contains five categories, namely tai chi, karate, jazz, hip hop, and sports, with a total of 53,000 frames. The motion capture of this dataset is recorded by two martial arts masters, two dancers, and one athlete using multiple cameras or stereo cameras. The motions in the MADS dataset are more complex and challenging than ordinary motions. First, they have a larger range of motion, and some postures do not appear in normal actions. Second, there are more self-occlusions and interactions between limbs. Third, the motions are relatively fast. A Gaussian Mixture Model (GMM) [
The CrowdPose dataset was constructed by a team at Shanghai Jiaotong University. It is used for multi-person joint position recognition in crowded scenes, with 14 joint positions per person. According to the crowd index of MSCOCO (person subset), MPII, and AI Challenger, the images are divided into 20 groups, ranging from 0 to 1, with a step size of 0.05 between the groups. Then, 30,000 images are evenly extracted from these groups. Further, 20,000 high-quality images are selected from the 30,000 images, each person in the image is cropped, and the key points of interference in each bounding box are marked. The dataset consists of 20,000 images and contains approximately 80,000 people. The training, verification, and test subsets are divided in a ratio of 5:1:4. The crowd index satisfies a uniform distribution in [0,1]. The dataset is designed to not only improve the performance in crowded situations but also extend the model to different scenarios [
PedX is a large multi-modal pedestrian dataset based on a complex urban intersection. It consists of more than 5,000 pairs of high-resolution (12 MP) stereo images and lidar data, and it provides 2D image tags and pedestrian 3D tags in the global coordinate system. The data were captured at the intersection of three four-way parking lanes with considerable interaction between pedestrians and vehicles. The author also proposed a 3D model-fitting algorithm to automatically label the constraints of different modes as well as novel shapes and time priors. All the annotated 3D pedestrians are in the real-world metric space, and the generated 3D model is validated using a motion capture system configured in a controlled outdoor environment to simulate pedestrians at urban intersections. The manual 2D image tags also can be replaced by advanced automatic labeling methods, which facilitate the automatic generation of large-scale datasets [
The pose structure score (PSS) is proposed to measure the structural similarity. The traditional distance evaluation indicators (MPJPE and PCK) deal with each joint position independently, and they cannot evaluate the structural accuracy of the posture as a whole.
Therefore, the PSS indicator is designed to measure structural similarity:
Calculating PSS requires the pose distribution of the ground truth as a reference. The ground-truth set is assumed to have
It is the average Euclidean distance between the joint position coordinates output by the network and the ground truth (usually converted to camera coordinates).
First, the network output is rigidly aligned (translation, rotation, and scaling) with the ground truth; then, the MPJPE is calculated.
When the distance between the predicted joint and the ground truth is within a certain threshold, the detected joint is considered to be correct.
If the distance between the two predicted joint positions and the ground truth is less than half of the limb length, the limb is considered to be detected.
The principle of 3DPCK is that if a joint is located within a 15 cm sphere centered at the ground-truth joint position, the joint prediction is regarded as correct and the common minimum set of 14 marked joints is evaluated. 3DPCK is more robust than MPJPE, and it [
Both MPJPE and P-MPJPE are commonly used evaluation indicators that represent the error value of the results. PSS focuses on the structural accuracy of the overall result rather than the average error of each point position. PCK and 3DPCK represent the percentage of correct key points. These evaluation methods can alleviate the problem of short limbs. Further, PCP penalizes shorter limbs more than PCK.
Research on 3D human pose estimation is attracting increasing attention. This article systematically introduced recent advancements in 3D human pose estimation on the basis of monocular cameras. Different data input formats lead to different research focus areas, namely image-based methods and video-based methods. The solutions to the problems faced by these two methods are similar. For landmark networks, this article compared the performances of some algorithms in order to prove their effectiveness.
Image-based input mainly focuses on estimation using the regression algorithm. The well-known existing algorithm is divided into two steps. The first step is to detect the 2D key points, and the second step is to map the 2D key points to the 3D key points. To address occlusion problems, researchers often use the multi-view method in order to improve the estimation results. To address the problem of insufficient training data, some methods [
Directions for future research on 3D human pose estimation based on a monocular camera are as follows: (1) Owing to the limitations of 3DHPE, the HPE method cannot be effectively extended to different fields. Therefore, how to reduce the model parameter compression to ensure real-time performance must be investigated. (2) The interaction between humans and 3D scenes must be explored. (3) Visual tracking and analysis can be achieved using physical constraints. (4) The problem of inaccurate estimation using low-resolution input must be solved. (5) Another inevitable problem is that noise has a significant impact on the performance of HPE. Therefore, how to improve the robustness of the HPE network is a topic for future research.
The authors would like to thank TopEdit (