Computers, Materials & Continua DOI:10.32604/cmc.2021.016871 | ![]() |
Article |
Convolutional Bi-LSTM Based Human Gait Recognition Using Video Sequences
1University of Wah, Wah Cantt, 47040, Pakistan
2National University of Technology (NUTECH), Islamabad, 44000, Pakistan
3COMSATS University Islamabad, Wah Campus, Wah Cantt, Pakistan
4Faculty of Applied Computing and Technology, Noroff University College, Kristiansand, Norway
5Department of Computer Science and Engineering, Soonchunhyang University, Asan, 31538, Korea
6Department of Mathematics, University of Leicester, Leicester, UK
*Corresponding Author: Yunyoung Nam. Email: ynam@sch.ac.kr
Received: 12 January 2021; Accepted: 14 February 2021
Abstract: Recognition of human gait is a difficult assignment, particularly for unobtrusive surveillance in a video and human identification from a large distance. Therefore, a method is proposed for the classification and recognition of different types of human gait. The proposed approach is consisting of two phases. In phase I, the new model is proposed named convolutional bidirectional long short-term memory (Conv-BiLSTM) to classify the video frames of human gait. In this model, features are derived through convolutional neural network (CNN) named ResNet-18 and supplied as an input to the LSTM model that provided more distinguishable temporal information. In phase II, the YOLOv2-squeezeNet model is designed, where deep features are extricated using the fireconcat-02 layer and fed/passed to the tinyYOLOv2 model for recognized/localized the human gaits with predicted scores. The proposed method achieved up to 90% correct prediction scores on CASIA-A, CASIA-B, and the CASIA-C benchmark datasets. The proposed method achieved better/improved prediction scores as compared to the recent existing works.
Keywords: Bi-LSTM; YOLOv2; open neural network; resNet-18; gait; squeezeNet
Gait biometric represents a person walking styles and more powerful as compared to other biometrics [1] i.e., iris, palmprint, face, and fingerprint [2], etc. Therefore, it can be utilized for person identification from a long-distance [3]. Human gait with different styles is illustrated in Fig. 1. Gait recognition methodologies have attained more attention in the last two decades in real-time applications such as forensic identification, video surveillance, and crime investigation [4]. In literature, some research works proposed improved feature vectors to discriminate the gait patterns based on the motion [5–7]. The recognition of human body parts in motion is achieving more attention from researchers [8]. However, it is a more challenging and difficult task to accurately track each part of the human body [9]. The appearance-based gait recognition methodologies commonly utilized human silhouettes as input.
Figure 1: Human gait with different actions at 90
These approaches might obtain maximum recognition scores when there is less variation in consecutive frames. When the variation increased in the consecutive frames, the performance of these algorithms decreased in real-time applications [10].
The gait features are drastically changed in case of different variations i.e., illumination, view, clothing, and carrying [11]. Model-based features are utilized to track the human body parts and movement [12–14]. The main contribution of the presented approach is based on feature vectors that are extracted from LSTM and ResNet-18 model. The extracted feature vectors contain more prominent discriminative information to classify the different types of human gaits based on fully connected and softmax layers. Furthermore, in phase II classified images are recognized using a proposed modified YOLOv2-ONNX model, which consists of 20 layers that are configured by applying the open neural network (ONNX) model and SqueezeNet architecture as the base-network of the tinyYOLOv2 model. The best recognition results are achieved by extracting deep features using the fireconcat-02 layer to the squeezeNet architecture and further fed as an input to the YOLOv2 model. The proposed method accurately recognizes the different kinds of human gaits.
Several machine learning approaches are used in the literature for human gait recognition (HGR) [15]. For HGR, features play a vital role to extract the discriminant information. Modified Local Optimal Oriented Pattern (MLOOP) features are extracted for HGR, and selected best features from MLOOP features vector [16]. The histogram oriented gradient (HOG) with Harlick features are combined for HGR and tested on the CASIA (A–B) datasets [17]. The Gabor wavelet features are extracted from the input images in different orientations [18] for HGR. The method performance is computed on CASIA (A and B) datasets [19]. The multi-scale LBP and Gabor features are extracted and selected the best features by spectra discriminant analysis-based regression method [20–22]. Principle component analysis (PCA) along with gait energy image (GEI) feature vectors are utilized for human identification [23]. However, it is difficult to recognize the variations in frames such as clothing, angle, and view [24]. To improve the recognition results, the fusion of structural gait profile and the energy shifted image is performed [25]. The deep features are extracted [26] using pre-trained AlexNet and VGG-19 and fused using skewness & entropy. The informative features are selected by the FEcS method for HGR. The method is evaluated on CASIA A, B, and C datasets [27]. The gait flow image & Gaussian image features are extracted to create a features vector and fed to the extended neural network classifier for HGR [28,29]. The stacked progressive work autoencoders (SPAE) model is employed for gait recognition at different angles and views, in which some temporal information is missing [30]. GaitSet is applied for the extraction of invariant features for action recognition. The component-based frequency features are extracted for the identification of human actions [31]. The temporal features among the frame might obtain improved results as compared with the GEI [32]. However, classifying the cross-clothing and cross-carrying conditions is still a difficult activity due to changes in human shape and appearance [33]. Feng et al. [34] extracted the heat map from the joint of the human body in an RGB input image instead of utilizing a binary silhouette. The extracted heat maps are supplied further to the LSTM model for temporal feature extraction. Recently, the skeleton and joints of the body are also utilized for the recognition of person identification [35]. It is observed that gait recognition with higher accuracy is still a challenging task [36].
The proposed model contains two phases; robust feature extraction and classification is a challenging task for human gait recognition. Therefore, in phase I, the Conv-BiLSTM model is developed, in which deep features are extracted from the localized images using Resnet-18 and supplied to the LSTM network to classify the different types of human gaits. In phase II, input images are passed to the proposed YOLOv2-Squeeze model, which extracts deep from the fireconcat-02 layer of the squeeze-Net model and is supplied as an input to the tinyYOLOv2 model for localization/recognition of the different types of human gaits. The proposed model steps are displayed in Fig. 2.
Figure 2: General proposed approach steps
3.1 Proposed Conv-BiLSTM Model for Classification of the Localized Images
The video frames are classified using the proposed Conv-BiLSTM model, in which deep features are extracted from the input frames by the CNN model such as Resnet18. Next, the sequence structures are restored and output is reshaped into sequence vectors using the unfolding sequence layer. After that, resultant vector sequences are created using BiLSTM and output layers. Finally, assembled both networks into a single network.
3.2 Convolutional Neural Network
The convolutional layers extract the feature vectors from the localized images. These feature vectors are used as the input of the activations function on the last pooling layer of the Resnet18 model as shown in Fig. 3. In the training phase, the model creates padding due to a large sequence of frames which has a negative impact on the accuracy of the gait classification.
Figure 3: Feature vectors extraction using Resnet-18 model
To overcome this problem, the classification results are improved by removing the sequences with more than 600-time steps with class labels. The bar length of the histogram represents the selected sequences in Fig. 4.
Figure 4: Visualization of the training data sequences
3.3 Bidirectional Long Short-Term Memory (BiLSTM) Model
The modified BiLSTM model is used for the classification of human gaits, in which LSTM layers are used for more efficient temporal feature learning. The selection of hyperparameters for model training is done after the extensive experiment as given in Tab. 1.
Table 1: Experiment for parameter selection for model training
Tab. 1, shows the experiment of the parameter’s selection, where 2000 hidden Units, 16 batch size is used for the further experiment because increase/decrease the HU obtained accuracy is decreased. The Hyperparameters of the BiLSTM model are stated in Tab. 2.
Table 2: Selected hyperparameters of BiLSTM
The model specification is as: Sequence input (1024 dimensions), LSTM layers (2000 hidden units (HU)), 50% dropout, fully connected layers, softmax, and a classification layer. The activation functions of the proposed BiLSTM model are mentioned in Tab. 3.
Table 3: BiLSTM layers with corresponding activations
The LSTM [37] cell has four gates, i.e., input, forget, output gate, and cell candidate. In the LSTM block, three weights are learnable, i.e., input f, recurrent weights RW, and bias b. The matrices of the learnable weights are expressed mathematically as:
The cell state
where
In the LSTM model, based on time steps, feature vectors are computed through LSTM layers and supplied to the next block. The nth block output is used for the class label prediction, in which HU follows the fully connected, softmax, and the output layers.
3.4 Concatenation of CNN and LSTM Models
In the proposed model, LSTM layers are concatenated with CNN layers, in which frames are transformed into a sequence of vectors to classify the human gaits. Fig. 5, shows the steps of the assembled network.
Figure 5: Proposed Conv-BiLSTM model
In Fig. 5, input sequences are passed to the convolutional layers, where features are extracted by convolutional operators. The convolutional layers follow the sequence folding layer. The sequence unfolding layer is followed by the flatten layer in which the structure of the sequences is restored and output is reshaped into a vector. The gait classification is performed using the output of BiLSTM followed by fully connected and softmax layers.
3.5 Localization of Human Gait UsingYOLOv2-SqueezeNet Model
YOLOv2 is fast and effective as compared with recurrent neural network (RCNN) and SSD detectors. Therefore, in this research, YOLOv2-SqueezeNet model is suggested for different types of human gait localization such as female, male, fast walk, slow walk, walk with the bag, normal, and wearing as shown in Fig. 6.
Figure 6: YOLOv2-SqueezeNet model for localization
Fig. 6, shows proposed YOLOv2-SqueezeNet model, where features are extracted from fireconcat-02 layer of the SqueezeNet model and passed as an input to the pre-train YOLOv2 detector. The proposed model more accurately localized the required regions with class labels.
The model consists of the 20 layers in which 01 input image, 04 convolutional, 04 ReLU, 01 depth concatenation, 01 max-pooling, of the squeezeNet model, and 02 YOLOv2 convolutional, 02 YOLOv2 batch-normalization, 02 YOLOv2ReLU, 01 YOLOv2 transforms, and 01 YOLOv2-output of the YOLOv2 model. The activation functions of the YOLOv2-SqueezeNet model are shown in Tab. 4. Tab. 5, presents the training hyperparameters.
Table 4: Layer wise activations of YOLOv2-SqueezeNet model
Table 5: Hyperparameters of YOLOv2-SqueezeNet model
Tab. 5 presents the hyperparameters that are selected to configure the proposed model for human gait classification, in which mini-batch size is selected 14, 1000 epochs are used for model training because greater than equal to the 1000 epochs model results are consistent.
Gait recognition is a great challenge due to complex recognition patterns that have been utilized in different fields such as machine learning, robotics, studying, biomedical, visual surveillance, and forensic. Therefore, intelligent recognition and the digital security group designed CASIA (A, B & C) datasets in the national pattern recognition laboratory [38–44].
The presented study is implemented on Matlab 2020RA Toolbox using a Core-i7 desktop Computer with a 740 K Nvidia Graphic Card. 0.5 hold out validation is used for model training. The description of the number of training and testing images are mentioned in Tab. 6.
Table 6: Description of training and testing number images in the corresponding datasets
In the developed framework, implement two experiments for the analysis of the proposed approach performance. The first experiment is performed to compute the performance of the YOLOv2-ONNX model and the second experiment is performed for classification results.
In this experiment, extracted feature vectors using the Conv-BiLSTM model are passed to the softmax layer for the classification of different types of human gaits such as female/male, bag, wearing, normal, and fast walk, slow walk, normal walk classes of the CASIA-A, CASIA-B and CASIA-C datasets respectively. Fig. 7, represents the proposed approach performance.
Figure 7: Training/testing results with respective loss rate (a) CASIA-A (b) CASIA-B (c) CASIA-C (blue line shows training, red shows loss rate, and dotted black line represent validation accuracy)
In Fig. 7, the proposed model achieved 1.00 validation accuracy (VA) on CASIA-A and CASIA-C datasets, while it achieved 0.96 VA on the CASIA-B dataset.
The classification outcomes are stated in the Tabs. 7–9.
Tab. 7, shows experimental results on CASIA-A dataset proposed method achieves 1.00 CPR on two classes of female/male.
Table 7: Proposed method results for human gaits recognition on different datasets using CASIA-A dataset
Tab. 8, CASIA-B dataset is considered for performance evaluation, where three classes such as Bag, wearing, and normal are involved. The method achieved 0.92 CPR in bag class, 1.00 CPR on wearing, and 0.88 CPR in the normal class.
Table 8: Proposed method results for human gaits recognition on different datasets using the CASIA-B dataset
The evaluation results in Tab. 9 shows that, the proposed method achieved 1.00 CPR on the classes of CASIA-C dataset. The outcomes in Tabs. 7–9, depicts that the proposed model obtained a 1.00 correct recognition rate (CPR). The recognition outcomes on the CASIA-B dataset are 0.92 CPR on humans with the bag, 1.00 CPR on wearing class, and 0.88 CPR on a normal class. The predicted labels of human gait recognition are shown in Fig. 8. The proposed approach comparison is mentioned in Tab. 10.
Table 9: Proposed method results for human gaits recognition on different datasets using the CASIA-C dataset
Figure 8: Predicted labels on benchmark datasets
Table 10: Proposed approach results compared with recent approaches
Six recent states of the art approaches are considered for performance evaluation based on some benchmark datasets. In the comparison scenario, the experimental setup is also discussed for existing work with proposed work. Wang et al. [45] used an ensemble learning method for human gait classification on CASIA-A & CASIA-B datasets and achieved results are 0.95 and 0.92 CPR respectively. Wang et al. [46] utilized the LSTM model to learn the sequential patterns of the input images and achieved 0.95 CPR on the CASIA-B dataset. The results in Tab. 11, are compared with the latest methodologies which show the proposed approach performance is superior. The proposed model results are better because of strongest feature vectors are obtained using the Conv-BiLSTM model for the classification of different types of human gaits with maximum CPR and also provided good results on a limited range of the input videos.
Table 11: Localization results
Figure 9: Localization results in term of mAP and IoU (a) CASIA-C (b) CASIA-B (c) CASIA-A (d) IoU
Figure 10: Gait localization (a, d and g) original gait images (b, e and h) gait labels (c, f and i) prediction scores
Figure 11: Gait localization (a, d) original gait images (b, e) gait labels (c, f) prediction scores
The proposed YOLOv2-ONNX model is validated on CASIA-A, CASIA-B, and CASIA-C in terms of mean average precision (mAP) as mentioned in Tab. 11. The localization outcome according to the respective class labels is graphically depicted in Fig. 9. Tab. 11, shows the proposed approach obtained mAP of 1.00, 0.91, and 1.00 on different classes such as Bag, wearing, and normal of the CASIA-B dataset respectively.
On different classes of the CASIA-C dataset i.e., fast walk, slow walk, and normal walk achieved mAP is 1.00, 070, and 0.95 respectively, where on the CASIA-A dataset attained mAP is 1.00 and 0.822 on female and male classes respectively. The proposed method more precisely localizes the different types of human gaits as illustrated in Figs. 10–12.
Figure 12: Gait localization (a, d) original gait images (b, e) gait labels (c, f) prediction scores
Fig. 10 shows, maximum achieved predicted scores of 0.979 on bag class, 0.986 on normal class, and 0.928 on wearing class.
Figs. 10–12 reveals that the suggested approach, the obtained higher predicted scores are 0.948 on the fast walk, 0.972 on female class, 0.955 on slow walk class, and 0.978 on male class.
Due to differences in the multiple viewpoints of human gaits, the HGR is a difficult activity. Therefore, in this study tinyYOLOv2-SqueezeNet model is developed that more accurately localized the different types of human gaits. The proposed method achieved mAP of 1.00, 0.91, and 1.00 on Bag, wearing, and normal classes of CASIA-B dataset respectively. Whereas 1.00, 0.70, and 0.95 mAP on the fast walk, slow walk, and normal walk of CASIA-C dataset respectively. Similarly, 1.00 and 0.82 mAP on female and male classes of the CASIA-A dataset respectively. Furthermore, this research investigates a features extraction model based on Conv-BiLSTM that more accurately classifies human gaits. The experimentation is performed on CASIA-A, B, and C datasets. The model achieves 1.00 CPR to classify human with coat wearing. 0.92 CPR on a human with bag class and 0.87 CPR in a normal class. The overall CPR including three classes (wearing, bag, and normal) achieved 0.91. The 1.00 CPR achieved on CASIA-A as well as CASIA-C datasets on all classes such as female, male, human with a slow walk, human with a fast walk, human with the bag. The computed results proved that a combination of CNN and BiLSTM provides the highest recognition rate as compared with individual CNN or the LSTM models. The proposed method performance is dependent on a selected number of features; however, some useful features may be ignored. Moreover, video sequences in a low-quality resolution that affect recognition accuracy.
Funding Statement: This research was supported by the Korea Institute for Advancement of Technology (KIAT) Grant funded by the Korea Government (MOTIE) (P0012724, The Competency, Development Program for Industry Specialist) and the Soonchunhyang University Research Fund.
Conflicts of Interest: The authors declare that they have no conflicts of interest to report regarding the present study.
![]() | This work is licensed under a Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |