With the advancement of communication and computing technologies, multimedia technologies involving video and image applications have become an important part of the information society and have become inextricably linked to people's daily productivity and lives. Simultaneously, there is a growing interest in super-resolution (SR) video reconstruction techniques. At the moment, the design of digital twins in video computing and video reconstruction is based on a number of difficult issues. Although there are several SR reconstruction techniques available in the literature, most of the works have not considered the spatio-temporal relationship between the video frames. With this motivation in mind, this paper presents VDCNN-SS, a novel very deep convolutional neural networks (VDCNN) with spatiotemporal similarity (SS) model for video reconstruction in digital twins. The VDCNN-SS technique proposed here maps the relationship between interconnected low resolution (LR) and high resolution (HR) image blocks. It also considers the spatiotemporal non-local complementary and repetitive data among nearby low-resolution video frames. Furthermore, the VDCNN technique is used to learn the LR–HR correlation mapping learning process. A series of simulations were run to examine the improved performance of the VDCNN-SS model, and the experimental results demonstrated the superiority of the VDCNN-SS technique over recent techniques.
Over the last few decades, multimedia has become an increasingly important component of people's daily life. It is used as a result of the rapid development of internet users, which, according to ITU (International Telecommunication Union) figures, reached approximately 3.2 billion users in 2015 [
One intriguing approach is to use today's massive multimedia data sources for the reconstruction step [
To create modules, the video frames are first summarized using a video summarizing algorithm. To summarise the videos, the unique idea proposed is to use discriminant Principal Component Analysis (d-PCA). Recently, the d-PCA concept [
Recently, deep learning techniques have produced efficient learning methods in a wide spectrum of AI, particularly for video and image analytics. This method can extract knowledge from massive amounts of unstructured data and deliver data-driven solutions. They have made significant progress in a wide range of research applications and domains, including pattern recognition, audio visual signal processing, and computer vision. Furthermore, it is expected that DL and its improved methodologies would be included into future image and sensor schemes. These methods are commonly used in CV and, more recently, video analysis. Indeed, different DL methods have emerged in research scholars, business, and academics scientists with efficient answers for numerous video and image-related difficulties. The primary goal of developing DL is to achieve greater detection accuracy than prior algorithms. With the rapid development of creative DL approaches and models such as Long Short-term Memory, Generative Adversarial Networks, DotNetNuke, and Recurrent Neural Network, as well as the increased demand for visual signal processing efficiency, unique probabilities are emerging in DL-based video processing, sensing, and imaging.
This research introduces VDCNN-SS, a novel VDCNN with spatiotemporal similarity (SS) model for video reconstruction in digital twins. The suggested VDCNN-SS technique visualizes the relationship between interconnected low resolution (LR) and high resolution (HR) picture blocks. It deals with non-local complementary and repeating data that is spatially and temporally distributed across nearby low-resolution video frames. The VDCNN model is utilized for learning the LR–HR correlation mapping to improve reconstruction speed while maintaining SR quality, and the resultant HR video frames are obtained effectively and quickly. A thorough simulation analysis is performed to evaluate the improved SR video reconstruction performance of the VDCNN-SS technique, and the findings are examined in terms of several evaluation factors.
Mur et al. [
Sankaralingam et al. [
Kong et al. [
This study developed a new VDCNN-SS technique for digital twins that employs correlation mapping between the outer correlative blocks and nonlocal paired and repetitive data in surrounding LR video frames to get higher quality reconstruction results. During the learning method, the VDCNN-SS technique employs the VDCNN model to get the reconstruction variables among the LR and HR picture blocks that increase the SR speed. In addition, curvelet transform (CLT) and structural similarity (SSIM) are used to provide spatiotemporal fuzzy registration and fusion across neighbouring frames at the subpixel level. At this point, the VDCNN-SS approach is extremely responsive to a complex motion process and produces robust results. The complete working process is represented in
Correlation mapping learning is a method for learning the relationships between HR and LR video frames. Sparse coding refers to the removal of all patches from the training set in order to reduce the burden of storage and computation. This approach comprises multiple parts, which are summarized below: reconstruction, patch extraction, sparse representation, and correlation mapping.
Whereas Patch extraction and sparse depiction. The filter Correlation mapping. Here, the size of Reconstruction. Here, the size of
The VDCNN is an adaptive framework for identifying text processes that was designed to provide different depth levels (9, 17, 29, and 49). The network begins with a lookup table, which generates embeddings for the input text and stores them in a two-dimensional tensor of size (f0, s). The number of input characters (s) is set to 1,024 and the embedded dimensional (f0) is set to sixteen. The following layer (three, Temp Convolution, 64) employs sixty-four sixty-four temporal convolutions of kernel size three, resulting in a size sixty-four s output tensor. It is a significant function for fitting the lookup table output with adaptive network segment input gathered by convolution blocks. All of the preceding blocks are a series of two temporal convolution layers, all of which are achieved by a temporal batch normalization layer [
They used another VDCNN for learning this parameter because the pair of biases and filters is critical in the reconstruction procedure. Smaller filter sizes, deeper layers, and extra filters could improve DL efficiency. The Media Source Extensions, as derived by Eq., is the cost function used at this stage (2). For minimizing the cost function, they utilized regular BP technique integrated by the arbitrary gradient decent technique for obtaining the optimum variables {W, B}.
The variable pairs {W, B} was initiated with the help of Gaussian function using the distribution
Whereas i and l denoted the iteration time and layer correspondingly, and four denotes the increase in the layer i. Since the variable pairs {W, B} attained in the training procedure could substantially enhance the reconstruction performance and speed, in this work, they selected this technique for studying the mapping relations among LR and HR frames and create an intermediate estimate frame.
The intermediate video frames obtained from the LRHR relation mapping technique considered the relationship between the LR and HR picture blocks in a single frame, which does not use the whole spatio-temporal relation data between the nearby video frames. This data, on the other hand, could help to maintain the video's temporal dependability. Conventional fuzzy registration is based on the relationship between pixels in the neighbouring and target frames, which is typically defined as the weighted average of each adjacent pixel [
SR films frequently contain objects with a variety of characteristics. This characteristic has edges that might be discontinuous or continuous. These edge-based discontinuities could be examined and tracked using CLT. In this method, separate objects with their associated edge data are labelled as curvelets, which can be seen via a multiscale directional transform. CLT is implemented in both continuous and discrete domains. The interpretations and turns of U polar wedge filter characterise the continuous CLT and is determined for 2−j scale in
The curvelet is determined in
Depending upon the orientation and scale, the curvelet coefficient is collected to different sub bands and curvelet coefficients are calculated for all the sub bands. Afterward calculating the curvelet coefficient, the normalized directional energy Ei (Ei the energy of ith subband) is calculated for all the curvelet subbands by L1 norm displayed in
Whereas ns denotes the overall amount of curvelet sub bands.
Whereas Ei denotes the energy of ith subband coefficient and ci indicates the curvelet coefficient of sub band i with dimension m × n.
For SSIM, assume 2 areas placed in pixels (i, j) & (k, l), and noted as Rij & Rkl, correspondingly, they calculated their mean μ & standard deviation σ and covariance among these 2 regions as cov(i, j, k, l) Depending upon these predetermined values, they could attain the SSIM as displayed in
For all the search regions centered in (k, l) is reconstructed frame, noted as Rkl, they traversed their nearby pixel points (i, j) in local window mark as R with predefined size and calculated the SSIM among 2 areas. Using a predetermined threshold, the area that is not equivalent to Rkl would be filtered out. Therefore, the CLT
where (i, j) represents the searched pixels in a nonlocal search region, C(k, l) represents the normalized constant determined by
Whereas Rsr(k, l) represents the searching region. Afterward calculating the similarity among the video frames, they can attain the weight of subregions based on
Whereas x(k, l) represents the older energy value in pixel (k, l) and
Grieves established the Digital Twin model and notion because the conceptual module underpins Product Lifecycle Management [
Innovative technologies are paving the way for smart cities, in which every physical object would have communication capabilities and embedded computing, allowing them to detect the platform and interact with one another to provide services. Machine-to-Machine Communication/IoT is another term for intelligent interoperability and linkages [
The concept of a “digital twin” is the creation of a module of a physical asset for the purpose of forecasting maintenance. Using real-world sensory data, this method would usually be used to anticipate the forthcoming of relevant physical resources in the operation/environment. It might discover and monitor potential threats posed by its physical counterpart. It is divided into three sections: I virtual products in virtual space, (ii) physical items in actual space, and (iii) a combination of virtual and real products. As a result, evaluating and collecting a massive amount of manufacturing data in order to uncover the data and relationship becomes critical for smart manufacturing. General Electric has begun their digital transformation path, which is centred on Digital Twin, by building critical jet engine modules that forecast the business results associated to the residual life of these modules. In this study, a new video reconstruction technique for digital twins is devised, which aids in real-time performance.
The concept engaged in the extensive reference approach is to broaden and hand over the conceptual model while conveying the scientific fundamentals of video reconstruction standards to the aspect of digital twins. The proposed concept is focused on “twinning” between the physical and virtual spheres. As a result, a digital twin model may be created using an abstract technique that includes all of the traits and completely explains the physical twin at a conceptual level. Thought simulations are performed based on the abstract model, allowing for the capture and understanding of the physical twin at an abstract level.
Based on the aforementioned procedures, the projected method is given below. Now, this technique is separated into 2 phases: reconstruction and training processes.
Training procedure Input the training sets
The proposed VDCNN-SS technique's SR video reconstruction performance is examined in terms of several factors.
Methods | Woman | Sign | Bird | Beach | ||||
---|---|---|---|---|---|---|---|---|
PSNR | SSIM | PSNR | SSIM | PSNR | SSIM | PSNR | SSIM | |
Bicubic | 32.75 | 88.00 | 35.28 | 96.00 | 39.85 | 96.00 | 31.56 | 87.00 |
SelfExSR | 34.21 | 90.00 | 41.90 | 99.00 | 41.38 | 97.00 | 33.17 | 90.00 |
SRCNN | 34.09 | 90.00 | 40.29 | 98.00 | 41.37 | 97.00 | 32.92 | 89.00 |
VSRnet | 34.67 | 91.00 | 41.59 | 98.00 | 41.63 | 97.00 | 33.34 | 90.00 |
SPMC | 33.73 | 91.00 | 34.69 | 97.00 | 40.26 | 97.00 | 32.08 | 89.00 |
DF-ESR | 34.39 | 91.00 | 42.30 | 98.00 | 41.63 | 97.00 | 33.46 | 90.00 |
VDCNN-SS | 36.28 | 95.00 | 46.91 | 99.45 | 44.21 | 98.00 | 38..72 | 96.00 |
A series of simulations are run on a benchmark video dataset to further ensure the increased performance of the proposed technique.
No. of frames | SPMC | DF-ESR | VDCNN-SS |
---|---|---|---|
1 | 32.23 | 33.66 | 35.98 |
10 | 31.02 | 32.48 | 34.94 |
20 | 32.00 | 34.26 | 35.84 |
30 | 30.50 | 32.71 | 35.12 |
40 | 36.20 | 38.30 | 39.42 |
50 | 36.40 | 38.99 | 40.14 |
60 | 36.00 | 37.25 | 38.47 |
70 | 30.42 | 32.96 | 34.15 |
80 | 32.60 | 34.87 | 37.26 |
90 | 33.01 | 34.55 | 36.14 |
100 | 33.20 | 35.40 | 37.00 |
110 | 33.12 | 34.57 | 36.10 |
120 | 32.70 | 34.21 | 36.56 |
130 | 31.62 | 33.79 | 35.23 |
140 | 31.24 | 32.60 | 33.84 |
150 | 31.80 | 33.06 | 35.53 |
Simultaneously, under 70 frames, the VDCNN-SS methodology performed better with a PSNR of 34.15 dB, while the SPMC and DF-ESR methods fared worse with PSNRs of 30.42 and 32.96 dB, respectively. Concurrently, under 100 frames, the VDCNN-SS approach yielded a better result with a PSNR of 37 dB, whilst the SPMC and DF-ESR algorithms yielded the worst results with PSNRs of 33.20 and 35.40 dB, respectively. Under 130 frames, the VDCNN-SS method yielded a greater result with a PSNR of 35.23 dB, but the SPMC and DF-ESR procedures yielded lower results with PSNRs of 31.62 and 33.79 dB, respectively. Finally, within 150 frames, the VDCNN-SS technique achieves the best results with a PSNR of 35.53 dB, while the SPMC and DF-ESR algorithms achieve the worst results with PSNRs of 31.80 and 33.06 dB, respectively.
A series of simulations on benchmark video datasets are performed to further ensure the improved performance of the current technique.
No. of frames | SPMC | DF-ESR | VDCNN-SS |
---|---|---|---|
1 | 38.02 | 41.33 | 43.61 |
100 | 36.13 | 39.27 | 42.76 |
200 | 35.62 | 38.95 | 42.28 |
300 | 37.12 | 40.62 | 43.73 |
400 | 37.02 | 40.32 | 43.61 |
500 | 34.97 | 37.33 | 38.64 |
600 | 34.99 | 36.45 | 39.88 |
700 | 39.56 | 41.09 | 42.30 |
800 | 39.23 | 42.40 | 45.51 |
900 | 40.68 | 41.83 | 45.16 |
1000 | 41.76 | 44.22 | 47.55 |
1100 | 42.51 | 43.86 | 45.16 |
1200 | 44.02 | 47.50 | 48.84 |
1300 | 45.65 | 47.05 | 48.42 |
1400 | 45.95 | 47.24 | 50.84 |
1500 | 42.55 | 45.69 | 48.87 |
At the same time, under 700 frames, the VDCNN-SS strategy performed better with a PSNR of 42.30 dB, whilst the SPMC and DF-ESR strategies performed worse with PSNRs of 39.56 and 41.09 dB, respectively. Simultaneously, under 1000 frames, the VDCNN-SS approach yielded a better result with a PSNR of 47.55 dB, whilst the SPMC and DF-ESR procedures yielded lower results with PSNRs of 41.76 and 44.22 dB, respectively. Following that, under 1300 frames, the VDCNN-SS technique performed better with a PSNR of 48.42 dB, whilst the SPMC and DF-ESR approaches performed worse with PSNRs of 45.65 and 47.05 dB, respectively. Finally, under 1500 frames, the VDCNN-SS method fared better with a PSNR of 48.87 dB, while the SPMC and DF-ESR methods performed worse with PSNRs of 42.55 and 45.69 dB, respectively.
This research provided a novel VDCNN-SS technique for successful SR video reconstruction. The VDCNN-SS technique primarily employs the VDCNN model to acquire the reconstruction variables among the LR and HR picture blocks that increase the SR speed. Additionally, the use of CLT and SSIM occurs. The intermediate video frames obtained from the LRHR relation mapping technique considered the relationship between the LR and HR picture blocks in a single frame, which does not use the whole spatio-temporal relation data between the nearby video frames. A comprehensive simulation analysis is performed to examine the improved SR video reconstruction performance of the VDCNN-SS technique, and the findings are analysed in terms of several evaluation factors. The testing results demonstrated the superiority of the VDCNN-SS technique over the more modern techniques. The pretrained rebuilt coefficient can be used to speed up the SR video reconstruction process in the future. Furthermore, the reconstruction results can be improved by employing the optimization technique using SS.