Hong’an Li1, Min Zhang1,*, Dufeng Chen2, Jing Zhang1, Meng Yang3, Zhanli Li1
1 College of Computer Science and Technology, Xi’an University of Science and Technology, Xi’an, 710054, China
2 Beijing Geotechnical and Investigation Engineering Insititute, Beijing, 100080, China
3 Xi’an Institute of Applied Optics, Xi’an, 710065, China
* Corresponding Author: Min Zhang. Email:
(This article belongs to this Special Issue: Advances in Edge Intelligence for Internet of Things)
Computer Modeling in Engineering & Sciences 2023, 135(1), 779-794. https://doi.org/10.32604/cmes.2022.022369
Received 07 March 2022; Accepted 27 May 2022; Issue published 29 September 2022
The Internet of Things (IoT) is getting more popular and has a high level of interest in both practitioners and academicians in the age of wireless communication due to its diverse applications. And the significant increase in the number of individuals with chronic ailments has dictated an urgent need for an innovative model for healthcare systems [1–3]. So IoT based on deep neural network model has been applied in medical treatment in recent years. The Internet of Medical Things (IoMT) based on computer-aided diagnosis is gradually developing faster and more efficiently [4–7]. Image color rendering based on deep learning can render the grayscale image into a color image to highlight the deep information and help readers to quickly understand the depth of information contained in the image [8–11]. With the rapid development of computer graphics, computer vision and related software and hardware technologies, the method of rendering existing images and enhancing details by using high-quality image data has attracted extensive research attention, and has been gradually applied to medical image processing and analysis in IoMT [12–14].
At present, the traditional color rendering method needs manual intervention and requires high reference image. The color rendering method based on deep learning uses the neural network model and the corresponding high-quality image dataset to train the model, and the image can be automatically rendered according to the model without being affected by human factors or other factors.
Image color rendering based on deep learning is generally divided into color rendering based on convolutional neural network and color rendering based on generative adversarial network [15,16]. Iizuka et al.  used a fusion layer in a convolutional neural network to combine low-dimensional features and global features of an image to generate the image color. Zhang et al.  designed an appropriate loss function to deal with multi-mode uncertainty in color rendering and maintain diversity of colors. Sangkloy et al.  combined a method based on graffiti with deep learning and trained images with color lines in a neural network. He et al.  selectively migrated reference image colors that were consistent with a target image in semantic structure and content to the target image, otherwise learn the color from large-scale data. Xiao et al.  proposed a depth sample-based image rendering method using a dense coded pyramid [22–25].
However, when extracting grayscale image features, the aforementioned method adopts upsampling to make the image size consistent, resulting in a loss of image information. Therefore, the network structure thus constructed cannot effectively extract and recognize complex features of the image. Hence, the rendering effect is relatively limited. Goodfellow et al.  proposed a generative adversarial network (GAN). GAN model based on unsupervised learning can approximate arbitrary distribution, so it has a wide range of applications in the field of image generation. But it is subject to the well-known disappearing and exploding gradient problem due to the instability of the model, resulting in a deterioration of image rendering performance. On this basis, Mirza et al.  proposed a conditional generative adversarial network (CGAN), which is conditional on additional information, such as class tags or data from other modes, to enter the discriminant and generator as an additional input layer for adjustment. Isola et al.  improved a CGAN to transform image styles, for example, from grayscale to color images, from day to night images, from lines to shaded images. The proposed pix2pix model has a powerful image conversion function, and can learn a mapping relationship between a gray and color image to achieve color rendering.
Although the GAN-based image color rendering method can automatically render images, it has certain problems such as blurred boundaries and unclear details. Moreover, unstable GAN models lead to low rendering quality . Therefore, Arjovsky et al.  used Wasserstein distance to replace the original JS divergence to stabilize the training of GANs, whereas Gulrajani et al.  added penalty terms to make the training of GANs more stable. Mao et al.  used the least square loss function to solve the problem of gradient disappearance and enhance the stability of the model. However, the stability of these GAN models and the effectiveness of the color rendering method still need to be improved. Therefore, this paper proposes a hinge-cross-entropy GAN (HCEGAN) for automatic image rendering. First, a new hinge-cross-entropy loss function is proposed to stabilize the GAN training. Second, an improved self-attention module is added to the GAN model to improve color rendering quality more quickly and effectively. Finally, by adding a Skip Connection, a network structure based on U-Net, as a generator for the GAN model, we adopt the PatchGAN based on image conversion as a discriminator. Experimental results show that the HCEGAN has a more significant rendering effect compared with the pix2pix model.
The hinge loss allows the distance between incorrectly classified samples and correctly classified samples to be sufficiently far. Unless the difference between a threshold , and an incorrect classification error is considered to be 0, a calculation error is accumulated . The loss function does not simply require the highest score in the right category, but rather a certain amount. Even a calculation classified as a loss may be correct, because it is possible that a correct category score will fail to exceed a certain threshold .
Suppose we classify some input , whose label is , and predict through a function . Threshold generally takes a value of 1. The output is . Then, we predict the input as the th type. Therefore, the output is . Let represent the loss of each class, the loss of a sample is the sum of the losses of all classes. Then the calculation of the loss function of the output is shown in Fig. 1.
Hinge loss is often used in binary classification problems. For output , threshold takes the value of 1. Then, we predict the hinge loss of as
When or , the classification result can be determined by the classifier, the loss value at this time is 0. When the predicted value , the classifier is uncertain as to the classification result and the loss value . Obviously, the loss is greatest when .
Cross-entropy is used to measure the difference between two probability distributions in information theory. The cross-entropy loss function is not only simple and effective, but can also avoid the reduction in learning rate caused by the mean square error loss function in gradient decline using the sigmoid function . Assuming that the probability of an event is , the amount of information I is expressed as
Entropy represents the measurement of the uncertainty of random variables. Under the discrete condition, entropy of information can be expressed as
In the discrete condition, if there are two probability distributions Q and P for the same random variable , we can use KL divergence to measure the difference between these two distributions.
The discrete distribution is the same, hence, the KL divergence is zero. Therefore, the cross-entropy is expressed as the sum of entropy and KL divergence.
Then, in the case of binary classification, the cross-entropy loss of batch samples is
At present, the cross-entropy loss is mainly used to classify the problem, combined with the Sigmoid or Softmax activation function. The advantage of cross-entropy loss is that the learning speed is fast when the model effect is poor, and slow when the model effect is good. However, because the cross-entropy loss combined with the Sigmoid or Softmax activation function adopts the inter-class competition mechanism, the features learned are scattered and multiple features cannot be distinguished. Therefore, the activation function can only be continuously optimized to better strengthen the model.
The loss function in deep learning is mainly used to evaluate the degree to which the predicted sample is different from the real sample. In general, the better the loss function, the better the model performance. The original GAN model scales the output of the discriminator neural network to probability [0, 1] by the Sigmoid function, and measures the cross-entropy loss of probability. In order to minimize JS (Jensen-Shannon) difference between model distribution and target distribution, the minimax game of the GAN model is realized. Many scholars use regularizer or loss function to minimize the difference between model distribution and target distribution, and ensure the stability of the GAN model. As can be seen from the analysis of the two loss functions in Section 2, the hinge loss function is able to keep the distance between the incorrectly classified samples and the correctly classified samples sufficiently far beyond a certain threshold so that the unclassified error value remains 0. In the cross-entropy loss function, the closer the predicted output is to the real sample, the smaller the loss function is, and the closer the predicted function is to 1. In this case, the variation trend of the function is completely in line with the actual needs. Therefore, the greater the difference between the predicted output and the real sample, the greater the loss value, that is, the greater the penalty to the current model, according to a non-linear increase similar to exponential growth.
This situation is mainly determined by the log function, which influences the model to tend to make the predicted output closer to the real sample. Further, the hinge loss optimization to enforce the requirements of remaining within less than a certain distance will cease optimization, whereas the cross-entropy loss is always optimized. Hence, in general, the cross-entropy loss is better than hinge loss. However, the cross-entropy loss is good at learning information between classes. Because it adopts the inter-class competition mechanism, the model will try to learn different types of features. And only consider the accuracy of the prediction probability of the correct samples, while ignoring the differences of other wrong samples. So the learned features are scattered, and the effect generated by the generator is suboptimal. Based on the advantages and disadvantages of hinge and cross-entropy loss, we propose a hinge-cross-entropy loss function.
First, we define the loss function based on the pix2pix network model as follows:
where, is the input image, is the expected output, is the generator, and D is the discriminator. In addition to the generating adversarial loss of CGAN model, the pix2pix model also adds a L1 regularization loss times a certain parameter to improve the training of the model. And is this parameter.
Meanwhile, the hinged version of adversarial loss in GAN model is defined. We set the generated image as . Then, the loss functions of the generator and discriminator are, respectively:
On this basis, the hinge-cross-entropy loss function is defined as follows:
The loss function of the generator is the cross-entropy loss function adding L1 loss, which is the same as the loss function of the generator of the pix2pix model. The loss function of the discriminant is in the form of the cross-entropy loss function after the real or generated image is processed by the hinge loss function. In this paper, the binary cross-entropy loss (BCE Loss) function is used, which is a special case of the cross-entropy loss function used for binary classification problems. The exact definition of the function is the same. Because this is a binary classification task, there are only two possibilities, plus or minus, and the sum of probabilities is 1, so we only have to predict one probability. In practical applications, a sigmoid function should be added to the layer of BCE Loss to normalize the data first before binary cross entropy loss can be used for calculation. The hinge-cross-entropy loss function also does not need to optimize the activation function to improve its problems such as large computation and small feature discrimination due to the large number of classification datasets.
The function of the attention mechanism in computer vision is to enable a neural network model to learn to ignore irrelevant information and focus on the important information . Usually, relevant features are used to learn the weight distribution. Then, the learned weight is applied to the features to extract relevant knowledge further. The weight can be applied to the original image, or to the spatial scale, channel scale, etc. The self-attention mechanism is a concept learned from natural language processing (NLP). Other tasks in the direction of computer vision have also achieved good results.
The basic structure of self-attention module is shown in Fig. 2. It can be seen that the self-attention module is divided into three branches by a convolution: , and . First, the dot product is used to calculate the similarity between f and g to get the weight. Second, a softmax function is used to normalize these weights. Finally, the weighted sum of weights and corresponding h is calculated to get the final attention value. However, this weight parameter needs to be initialized to 0, which first depends on the local original , then gradually increases the non-local weight, and finally applies self-attention to the generator and discriminator. If the initial weight is set to 0 and the weight is gradually increased through model optimization, the generator will learn slowly, leading to slow iteration. Therefore, we need to adjust the learning method, especially the initialization matrix, to speed up the learning speed of the attention mechanism.
In this paper, we adjust the initial weight to a diagonal matrix, and assume that the values on the diagonal of the diagonal matrix conform to the standard normal distribution with a mean of 0 and a variance of 1, such as the diagonal matrix A of size :
The calculation steps of self-attention mechanism are as follows. First, we aim to transform the feature map of the previous hidden layer into the feature space , and calculate attention, where , . Each element in the attention map represents how much the model pays attention to the th pixel when synthesizing the th pixel, where .
Then, we use as the weight of attention to calculate the output , where, , . , , and are the weight matrices learned after a convolution. Here, , , , .
Finally, we multiply the output of the attention layer by a scale parameter and add it back to the input feature map.
In this paper, we construct the generator of hinge-cross-entropy GAN based on U-Net structure, coding-decoding structure and Skip Connection, as shown in Fig. 3. The orange module is the improved self-attention module, and the blue module is the convolution layer, which contains 8 convolution layers and 8 deconvolution layers. Lines represent Skip Connection. The U-Net structure is a symmetric U-shaped structure of compression path and expansion path. The parameter transfer and error feedback of deep neural network can be strengthened by adding Skip Connection. In addition, in order to solve the problem of remote dependence and strengthen long-distance information, an improved self-attention module is added in the encoder, namely, in front of each convolutional layer, respectively, in order to effectively select feature information. The input of neural network is many vectors of different sizes, and there are certain relationships among them. However, the relationship between these inputs can not be fully developed in the actual training, resulting in poor results. The self-attention mechanism can establish correlations between multiple related inputs, allowing the network model to notice correlations between different parts of the entire input.
The discriminator of hinge-cross-entropy GAN uses 70 70 PatchGAN, which contains 4 convolution layers, each of which uses the unit form of conversion-regularization-Leakyrelu activation function. Instead of measuring the whole image with a single value, PatchGAN used an N * N matrix to evaluate the whole image, so that more areas could be focused on. The input/output size is , the step size is 2, the fill pixel is 1, the activation function is LeakyReLU, and BatchNorm is used. The whole GAN model adopts minimax game, generate color image as close as possible to the expected image, deceive the discriminator. Grayscale image and generated image are input into the discriminator at the same time. The discriminator tries to distinguish the generated image and see the difference between the generated image and the expected image. During model learning and training, generator and discriminator are trained alternately. After the training is complete, the generator is used to generate the desired color image.
The DIV2 K dataset covers a wide range of contexts, including people, handmade objects and cities and villages, for example based on single-image super-resolution benchmarking. The dataset participating in the NTIRE 2017 Challenge drives the state-of-the-art technology in terms of single-image super-resolution. There are 800 training sets and 100 testing sets in the DIV2 K dataset. The COCO2017 dataset is an advanced version of the Microsoft COCO dataset funded by Microsoft in 2014. Among them, the COCO competition was one of the most concerned and authoritative competitions in the field of computer vision at that time. The COCO datasets include six categories: fish, ladybug, orange, lion, dog and bird. There are 500 training sets and 100 testing sets for each category of the COCO dataset.
When DIV2 K dataset was used to verify the validity of the model, the experimental environment used was Windows 10, with a 64-bit operating system, an Intel(R) Core(TM) I7–9750H CPU @2.5 GHz on a notebook computer, as well as Python 3.7, Pytorch1.2, and CUDA 10.0. When the COCO dataset was used, the experimental environment used was Windows 10, with a 64-bit operating system, an Intel(R) Core(TM) I9–10900x CPU @3.70 GHz on a desktop computer, as well as Python 3.7, Pytorch1.7, and CUDA 10.2. In experiments, the same parameters were used for all models under the PyTorch framework, the iteration was 200, the optimizer was Adam, and the learning rate was 0.0002. In order to reflect the color rendering quality of different models, the experiment adopts PSNR and SSIM to evaluate the rendered images.
Effect of self-attention module To verify the effectiveness of the self-attention module, the added module was compared with the original pix2pix model, and the results on the DIV2 K dataset were shown in Figs. 4b, 4c. By comparing the rendering results of the original color image, the pix2pix model and the model with self-attention module, it can be seen that the pix2pix model has a large error in rendering due to the instability of GAN model, resulting in color pollution in the upper left corner of Fig. 4b. However, the model with self-attention module restores the real color of the image in both structure and color, and the overall tone is more harmonious. After the self-attention module is added, the self-attention mechanism can learn important features and suppress non-important features to achieve fast learning of image information. The results are shown in Table 1. After adding the self-attention module, SSIM and PSNR in the DIV2 K dataset were increased by 0.35% and 1.2 dB, respectively. In the COCO dataset, SSIM and PSNR were almost all improved. Therefore, after the self-attention module is added to the model, the module can improve the model’s attention to important information by using the inherent information of features for attention interaction, thus significantly improving the rendering quality. Therefore, adding self-attention module into the model is effective and improves the rendering effect of the model.
Effect of hinge-cross-entropy loss To verify the effectiveness of the hinge-cross-entropy loss function, we added the hinge-cross-entropy loss to the pix2pix model and compared it with the original pix2pix model by adding the mean-square error (MSE) loss. The results on the DIV2 K dataset are shown in Figs. 4d, 4e. Comparing the rendering results of the original color image with the hinge-cross-entropy loss function and the mean square error loss function, it may be observed that the hinge-cross-entropy loss function has a better rendering result on the details of an image such as steps and roofs, and there is no color pollution. This is because the hinge-cross-entropy loss function inherits the advantages of the hing loss function and the cross-entropy loss function. This puts the distance between misclassified samples and correctly classified samples far beyond a certain threshold. And when the optimization of the hinge loss function is forced to keep within a certain distance, the cross-entropy loss always keeps the optimal state.
Similarly, we used PSNR and SSIM to evaluate the rendered images. The experimental results are shown in Table 2. Hinge stands for hinge loss function, BCE stands for cross-entropy loss function, and MSE stands for mean square error loss function. After adding the hinge-cross-entropy loss function, PSNR and SSIM improved by 1.61 dB and 1.57% in the DIV2 K dataset, respectively. In the six categories of the COCO dataset, PSNR increased by 0.51, 0.64, 0.33, 0.69, 0.56, 0.24 dB, and SSIM increased by 2.24%, 2.74%, 2.04%, 2.00%, 3.7%, 1.63%, respectively. This is because the partial derivative value of the MSE loss function will be very small when the output probability value is close to 0 or 1, which may cause the partial derivative value to almost disappear at the beginning of training of the model. As a result, the learning rate of the model is very slow at the beginning, and the use of cross-entropy as the loss function will not lead to such a situation. Therefore, compared to the effect without the hinged loss function and with the added the MSE loss function, the model using the hinge-cross-entropy loss function has a significant improvement in rendering quality.
Effect of improved self-attention module To verify the effectiveness of the improved self-attention module, we added the self-attention module and the improved module to the pix2pix model and compared it with the original pix2pix model. The experimental results are shown in Fig. 5, where SA’ is the standard normal distribution in which the diagonal line conforms to the initial weight. The mean is 0 and the variance is 1 in the diagonal matrix, and SA” is the uniform random number in the diagonal matrix where the initial weight is 0–1. By comparing the results before and after adding hinge loss, we can see that compared with other models, the rendered image obtained by adding an improved self-attention module (pix2pix+SA’ and SA’+Hinge+BCE) is closer to the original color image, with better rendering effect, clearer details and less rendering error. This is because the improved self-attention mechanism reduces the dependence on external information and uses the inherent information of features as much as possible for attentional interaction. In addition, the self-attention mechanism can effectively capture the feature dependence of long distance and extract the important information of global context. In comparison, our proposed approach SA’+Hinge+BCE has the best performance. Not only is the improved attention mechanism added, but the hinge-cross-entropy loss function makes the gap between positive and negative samples large enough to produce an image that more closely resembles the real color image.
The experimental results are shown in Table 3, and we can see that the rendering effect is better when the diagonal matrix with the initial weight conforming to the standard normal distribution with a mean of 0 and a variance of 1, namely SA’, is used. Compared with the original pix2pix model, the proposed method SA’+ Hinge+BCE improved 1.82% in the DIV2 K dataset, and 1.76 dB in terms of PSNR. In the six categories of COCO dataset, SSIM increased by 2.11%, 2.47%, 2.18%, 1.54%, 3.73%, 0.43%, and PSNR increased by 0.41, 0.67, 0.27, 0.64, 0.44, 0.29 dB, respectively. Therefore, the improved self-attention module and the hinge-cross-entropy loss function can enhance the stability of the model and improve the rendering effect of the existing color rendering algorithm based on the GAN model. Therefore, the improved self-attention module and hinge-cross-entropy loss function in this paper are effective. It can enhance the stability of the model to different degrees and improve the existing image rendering algorithm based on the GAN model.
Fig. 6 shows the effect of our HCEGAN compared with the original pix2pix model on the COCO dataset. In order to improve the performance of the model, the current color rendering algorithm based on deep learning inevitably accumulates modules, which leads to the deep level and high complexity of the neural network. This paper realizes image color rendering based on GAN. Although the self-attention module is added to the model in this paper, the complexity of the system does not change much. At the same time, the addition of new loss function further strengthens the stability of the model. Therefore, the model’s lightweight and high performance are the limitations at present, and future work will also be carried out in this aspect.
At present, image color rendering based on the generative adversarial network has helped the medical industry to highlight and speed up the diagnosis of many diseases in the internet of medical things. Color images are beneficial because they can better highlight deep information in an image. In order to improve the existing GAN-based color model and render grayscale images, this paper introduces a new hinge-cross-entropy loss function and an improved self-attention module, and proposes a new hinge-cross-entropy GAN. In this paper, we use the DIV2 K and COCO datasets to verify the effectiveness of the proposed method and its superiority to prior approaches. The experimental results show that our hinge-cross-entropy GAN model demonstrates a great improvement in rendering quality and effect. Moreover, the stability of the model is greatly improved. However, the current GAN model has high model complexity and difficulty in pre-training. It is difficult to realize a model lightweight on the basis of ensuring algorithm efficiency. In the future, we will focus on achieving the high performance of GAN models while reducing model complexity. At the same time, we plan to extend the method to other tasks, such as style transfer and single image super-resolution.
Acknowledgement: We extend our gratitude to the peer reviewers for their helpful comments on an earlier version of the paper.
Funding Statement: The authors received National Natural Science Foundation of China (No. 61902311) funding for this study. And the project was supported in part by the Natural Science Foundation of Shaanxi Province in China under Grants 2022JM-508, 2022JM-317 and 2019JM-162.
Conflicts of Interest: The authors declare that they have no conflicts of interest to report regarding the present study.