Human activity detection and recognition is a challenging task. Video surveillance can benefit greatly by advances in Internet of Things (IoT) and cloud computing. Artificial intelligence IoT (AIoT) based devices form the basis of a smart city. The research presents Intelligent dynamic gesture recognition (IDGR) using a Convolutional neural network (CNN) empowered by edit distance for video recognition. The proposed system has been evaluated using AIoT enabled devices for static and dynamic gestures of Pakistani sign language (PSL). However, the proposed methodology can work efficiently for any type of video. The proposed research concludes that deep learning and convolutional neural networks give a most appropriate solution retaining discriminative and dynamic information of the input action. The research proposes recognition of dynamic gestures using image recognition of the keyframes based on CNN extracted from the human activity. Edit distance is used to find out the label of the word to which those sets of frames belong to. The simulation results have shown that at 400 videos per human action, 100 epochs, 234 × 234 image size, the accuracy of the system is 90.79%, which is a reasonable accuracy for a relatively small dataset as compared to the previously published techniques.
Video content (a sequence of 2D frames) is globally growing exponentially every year. As a result, lots of effort has been made in the image and video recognition domain. Video classification and video captioning are two major active research areas at the moment. Video classification recognizes these videos using their content while the video captioning gives a short description of these videos using their content. Video classification is done in the spatial domain as well as in the temporal domain either separately or collectively. Convolutional neural networks (CNN) has given promising performance for analyzing image content, image recognition, detection, and retrieval. These networks can process millions of parameters and handle huge labelled datasets for learning. This has led to the testing of CNN in large scale video classification, in static images as well as in complex temporal evolution of the videos.
Processing raw video sequences are not efficient as they have very high dimensionality depending on image dimensions and video duration. A video is a sequence of images. Most images in a video do not contain any new information. These images keep repeating, and usually after 10–15 frames approximately, a new chunk of data from the video appears and is vital for action recognition [
CNN can perform well if the system works with reliable datasets and GPUs. But still, many issues remain to make the system robust and practical. These are some of the problems:
All recognition systems depend on the extensive collection of videos. In many situations, a large video training set may not be available, So this puts some limitations on the use of CNN for recognition systems. We need to work with networks that can give good results with reasonably sized training data.
The recognition systems must be invariant to translation rotation and scaling. While dealing with video invariances in 3D is needed.
The networks should be robust to low resolution, blurring, pose variations, illumination, and occlusion [
The decisions like the number of layers, fully connected layers, dropouts, max-pooling operations can affect the efficiency of the CNN [
To determine the performance of the network, we divide our video data set to training validation and testing.
The early fusion methods combine input features from various modalities. The fusion is done immediately on the lowest possible level, which is a pixel level. The network learns the correlation and interactions of each modality. Early fusion performs multimodal learning. It usually requires the features from different modalities to align with their semantics. It uses a single model to predict, which shows that the model is well suited for all the modalities. The early and direct connectivity to pixel data allows the network to detect local motion speed and direction [
The problem requires to use low learning rates with gradient descent. For a slow computer, this process will take a long time for each step. A faster GPU can overcome this delay. Another way to handle this problem is to add more hidden layers which help the network to learn more complex arbitrary functions, and in predicting future outcomes.
The paper layout is as explained: Section 2 shows previous work done in the past in this domain, Section 3 elaborates on the experimental work based on the algorithm written in Section 3. Section 4 discusses the outcomes of the experiment, Section 6 compares the proposed system with the existing techniques and Section 7 gives a conclusion and suggests future work.
Kanehira et al. [
Cahuina et al. [
Panda et al. [
Singha et al. [
Zhu et al. [
The proposed technique is based on the Convolution neural network(CNN).
The layers in CNN use the features learned by the preceding layers to recognize the larger patterns. The classification layer combines them to group the images. Its output is equal to the number of classes in the target data. The sign language used in this research is Pakistan sign language (PSL).
The classification is done using the softmax function. The output by the softmax activation function helps in dividing each input to its corresponding classes. Accuracy is the measure of the number of true labels in the test data. Using the training data, CNN understands the object’s specific features and associates them with the corresponding category. Layers get data from the previous layer, process it, and pass it on. The network learns features of images on its own. The entire cycle starts with capturing input video, dividing it into frames of order 1280 × 720, and selecting the keyframes. The input is human action in the form of a video. The video is converted to sequential frames
The class to which these keyframes belong to forms an output string. The string is fed into the edit distance algorithm to find out the closest matching word. The layers in CNN use the features learned by the preceding layers to recognize the larger patterns. The classification layer combines them to group the images. Its output is equal to the number of classes in the target data. The sign language used in this research is PSL. The Input Layer is where we specify the image size for the images extracted as keyframe from the input video, which, in this case, is 234, and channel size is 1 as the images are in grayscale colors. The convolutional layer specifies filter size, which is the height and width of the filters during the training phase moved along the images extracted as a keyframe from input videos. We can use different sizes for the height and the width of the filter. Another feature is the number of filters, which specifies the number of neurons connecting to the same output area. This convolution layer determines the number of feature maps. The strides are taken as 1 for the convolution layer. The learning rate for this layer is kept relatively low.
The constraint
The inputs are zero-padded at the edges, to help filters fit near the edges. The number of zeros involved in zero-padding units is another hyperparameter to improve efficiency.
It should also be ensured to match the number of channels in the filters as well as the number of channels in its input. The convolution layer outputs go into a nonlinear layer/stage, which is just like the activation function. The detector layer normally uses the sigmoid function or hyperbolic tangent tanh ReLU for inducing nonlinearity in the model. A CNN block consists of one convolutional layer, an activation function like ReLU, and a pooling layer combined to form a network layer. The output of these blocks is flattened and sent to a fully connected output layer.
Create a datastore of images Store
The algorithm is as under:
Trainingpercentage ← 81
Testingandvalidationpercentage ← 19
ImageInputLayer ← 1
MaxPooling2dLayer ← 1
ClassificationLayer ← 1
Filtersize ← f
Number of filters ← n
Epochs ← defined n of epochs
learningrate ← .00001
TraintTheSystem()
Accuracy ← trainingimagesmatched ÷ totalimages
Misrate ← trainingimagesmismatched ÷ totalimages
The input layer uses only grayscale pixel values and is of sizes as shown in
A filter of 5 with no padding and stride s = 1 is used. A maxpooling layer with pool size is used with a pooling stride = 2 for maxpooling . The experiment is repeated will filter of size 3 × 3, 5 × 5 and 7 × 7. As the images are in grayscale, channel size is one. The number of filters are varied from 20 to 30 in the experiment. A fully connected layer follows it.
ReLu function is used to introduce nonlinearity in the model.
A maxpooling layer is introduced with stride = 2.
At the end of the network, a softmax layer and a classification layer is used to determine cross-entropy loss for the proposed solution.
The learning rate is kept as low as 0.00001.
Resolution | Dataset size | Epoch | Accuracy% | Miss rate% |
---|---|---|---|---|
72 × 72 | 200 | 15 | 45.91 | 54.09 |
72 × 72 | 200 | 50 | 55.69 | 44.31 |
72 × 72 | 400 | 50 | 87.06 | 12.94 |
72 × 72 | 400 | 100 | 89.22 | 10.78 |
90 × 90 | 200 | 100 | 85.4 | 14.6 |
90 × 90 | 300 | 100 | 87.4 | 12.6 |
90 × 90 | 400 | 100 | 90.4 | 9.6 |
100 × 100 | 200 | 100 | 86.35 | 13.65 |
100 × 100 | 300 | 100 | 87.6 | 12.4 |
100 × 100 | 400 | 100 | 89.15 | 10.85 |
120 × 120 | 200 | 15 | 70.33 | 29.67 |
120 × 120 | 200 | 50 | 86.34 | 13.66 |
120 × 120 | 300 | 100 | 88.38 | 11.62 |
120 × 120 | 400 | 50 | 89.59 | 13.41 |
120 × 120 | 400 | 100 | 90.53 | 9.47 |
234 × 234 | 300 | 100 | 88.69 | 11.31 |
234 × 234 | 400 | 15 | 80.69 | 19.31 |
Several video datasets are publicly available, however, for this research, the video dataset is formed by selecting 20 words from Pakistan sign language. 15 signers participated in preparing videos for 20 words. Approximately 400 videos for every word gesture are collected. The videos are preprocessed and passed through the video summarization process. The converted images will be stored in subfolders under that word gesture folder. The images are in grayscale and of size 234 × 234. This dataset has 8,000 video clips from 20 different categories prepared for 20 words by 15 different signers. The duration of these is 11.2 hours approximately. We use these 400 × 20 videos by 15 signers to train, validate, and test the network.
The video recognition process has two main components: Video summarization and image recognition. The process of video summarization consists of selecting keyframes, meaningful clip selection, and output generation. The technique proposed here uses the concept of mean and then median of entropy. The mean is a very important measure in digital image processing. It is used in spatial filtering and is helpful in noise reduction. The mean of k frames is defined as:
Here
A video summary is generated as under:
The input video is the video that is to be used for video summarization, the video may be in any standard format.
Frame extraction from videos as a finite number of still images called frames.
The feature extraction process can be based on features like color, edge, or motion features. Some algorithms use other low-level features such as color histogram, frame correlation, and edge histogram.
The video is summarized to keyframes by using the technique Median of Entropy of Mean Frames Method [
The Levenshtein distance or edit distance was named after his inventor, Vladimir I. Levenshtein. The Levenshtein distance is the number of edits needed to convert a sequence A into another sequence B. Edit operation consists of substitutions, insertions, and deletions. The variation Damerau Levenshtein distance adds an extra root in dynamic programming. It computes a
The PSL is a very rich sign language. It consists of thousands of words. The video classification works precisely like the image classification, as explained in Section 3.1. The video is summarized, and individual frames are stored in that particular video category and frame number. At the same time, the labels of summarized frames are stored in repository DS. These frames are trained using the images in the dataset. The process is repeated for all the words in the dictionary, the summarized images are combined into folders containing similar images. The CNN model is trained by using frames in the dataset. In the test phase, the frames are used to predict the folder label. In test mode, every frame from the video summary is predicted for the category to which it belongs to. The output string consisting of the folder labels is compared with the strings in DS. The string with minimum edit distance is chosen as the output string. Algorithm of the recognized dynamic gesture given below.
Input: The Video converted to
The datastore:DS[m] dictionary of m words containing at most size number of images
datastore: dgestures containing L folders
Output: wordrecognized
∀
The dynamic sign recognition starts with image recognition. The labels of the recognized images help in identifying the dynamic gesture class, i.e., complete words using edit distance algorithm. The video recognition is analyzed as under:
Every recognition system faces 4 major problems: shadow, rotation, scaling, and mirror images. Every recognition system must handle all these problems one by one. However If we train the system on images and If we can use a dataset of appropriate size, all these problems are automatically taken care of by the convolutional neural networks.
An Edit Distance is the number of edits needed to convert, a sequence A into another sequence B. The output V from the algorithm “Recognize the Dynamic Gesture” in Section 4, is compared with all the words in the data store, and we choose the string with minimum edit distance.
where,
where
Word |
Total time taken by all words in DS in (ms) | Time to find minimum in (ms) | CCR% | MCR% |
---|---|---|---|---|
3 | 800 | 650 | 100 | 0 |
4 | 1000 | 814 | 99.63 | 0.37 |
5 | 1120 | 917 | 99.53 | 0.47 |
6 | 1300 | 1105 | 99.88 | 0.12 |
7 | 1410 | 1228 | 99.67 | .33 |
Average | 99.74 | 0.26 |
In PSL gestures are usually 2–5 seconds long. The videos used to form DS are converted to images after passing through the video summarization process. The image labels are stored along with the video label in the dataset DS. The video recognition process starts with the input of a gesture by the signer. CNN recognizes these images. As the images are quite complicated so learning rate is kept very low and returned labels are stored in the form of a string sequence
Epochs | Mini-batch accuracy (%) | Miss rate (%) |
---|---|---|
1 | 35.16 | 64.84 |
17 | 67.19 | 32.81 |
34 | 82.81 | 17.19 |
50 | 82.03 | 17.97 |
67 | 92.97 | 7.03 |
84 | 94.53 | 5.47 |
100 | 92.19 | 7.81 |
150 | 93.75 | 6.25 |
200 | 96.09 | 3.91 |
The results can be improved by changing many factors, including the number of images per label, image resolution, learning rate, filter size and number of filters. As an example to test the input video, the following words were chosen: Cap, skirt, scarf, and gloves. The video title and the summarized image labels are stored in DS. This is done for all the selected videos. For this research, almost 1000 words are selected. However, more words can be added in the data store DS with an increased cost in terms of training time and a little impact on testing time. As some of the gestures are repeated, so a total number of classes, i.e., image labels, do not exceed a limit. Let’s now test an input video, V, summarized to frames
The computational complexity of deep neural networks is determined by matrix multiplication, nonlinear transformation and weight sharing. Dropout helps in keeping the computational complexity in the polynomial-time domain. The training part of the proposed solution is the most time-consuming part. It takes hours to get training results, however, the testing is of the order of a second. The system gives an accuracy of 90.03% on training data. The edit distance algorithm gives an accuracy of 99.99%. For the subset of words selected, it was found to be 99.74%, so the proposed system gives an accuracy of 90.79% on training data. This accuracy can be increased well above 91% by increasing the number of images per class in the dataset, increasing image resolution and increasing number of epochs.
The results of the proposed solution were also compared with other existing techniques. The proposed technique achieves accuracy comparable to those provided by [
Technique name | Accuracy % |
---|---|
Two-Stream Convolutional Networks for Action Recognition in Videos [ |
79.34% |
Long-term Temporal Convolutions for Action Recognition [ |
80.5% |
Beyond Temporal Pooling: Recurrence and Temporal Convolutions for Gesture Recognition in Videos [ |
86.02% |
Convolutional Two-Stream Network Fusion for Video Action Recognition [ |
84.85% |
Exploiting Feature and Class Relationships in Video Categorization with Regularized Deep Neural Networks [ |
71.31% |
The Proposed Dynamic Gesture Recognition Using CNN and Edit Distance | 90.79% |
The research is an effort to facilitate the deaf society and to provide an efficient touch-free interface to users of smart devices. The proposed technique has the edge that it gives good accuracy in a constraint-free environment. The proposed methodology provides a framework for sign language recognition that can be materialized for any sign language. A larger dataset can also give better video recognition accuracy. A better algorithm for string matching of the combined output of the image recognition algorithm, which gives improved results over edit distance, is left as future work. A detailed complexity analysis of the system has also been left as future work.
Thanks to our families & colleagues, who supported us morally.