iconOpen Access

ARTICLE

crossmark

PP-GAN: Style Transfer from Korean Portraits to ID Photos Using Landmark Extractor with GAN

Jongwook Si1, Sungyoung Kim2,*

1 Department of Computer AI Convergence Engineering, Kumoh National Institute of Technology, Gumi, 39177, Korea
2 Department of Computer Engineering, Kumoh National Institute of Technology, Gumi, 39177, Korea

* Corresponding Author: Sungyoung Kim. Email: email

(This article belongs to the Special Issue: Advanced Artificial Intelligence and Machine Learning Frameworks for Signal and Image Processing Applications)

Computers, Materials & Continua 2023, 77(3), 3119-3138. https://doi.org/10.32604/cmc.2023.043797

Abstract

The objective of style transfer is to maintain the content of an image while transferring the style of another image. However, conventional methods face challenges in preserving facial features, especially in Korean portraits where elements like the “Gat” (a traditional Korean hat) are prevalent. This paper proposes a deep learning network designed to perform style transfer that includes the “Gat” while preserving the identity of the face. Unlike traditional style transfer techniques, the proposed method aims to preserve the texture, attire, and the “Gat” in the style image by employing image sharpening and face landmark, with the GAN. The color, texture, and intensity were extracted differently based on the characteristics of each block and layer of the pre-trained VGG-16, and only the necessary elements during training were preserved using a facial landmark mask. The head area was presented using the eyebrow area to transfer the “Gat”. Furthermore, the identity of the face was retained, and style correlation was considered based on the Gram matrix. To evaluate performance, we introduced a metric using PSNR and SSIM, with an emphasis on median values through new weightings for style transfer in Korean portraits. Additionally, we have conducted a survey that evaluated the content, style, and naturalness of the transferred results, and based on the assessment, we can confidently conclude that our method to maintain the integrity of content surpasses the previous research. Our approach, enriched by landmarks preservation and diverse loss functions, including those related to “Gat”, outperformed previous researches in facial identity preservation.

Keywords


1  Introduction

With the advent of modern technologies, such as photography, capturing the appearances of people has become effortless. However, when these technologies were not developed artists would paint portraits of individuals. Such a painting is called a portrait, and because of the invention of photography, modern portraits have become a new field of art. However, all famous figures from the past were handed down in pictures. The main purpose of paintings is to depict politically famous figures [1], but in modern times, the purpose has expanded to the general public. Although the characteristics of portraits by period and country are very different, most differ greatly from the actual appearance of the characters unless they are surrealistic works. Korean portraits differ considerably depending on time and region. Fig. 1a shows a representative work of portraits from the Goryeo Dynasty. This work is a portrait of Ahn Hyang, a Neo-Confucian scholar from the mid-Goryeo period. Fig. 1b is a portrait of the late Joseon Dynasty, which indicates that there is a large difference in the preservation conditions and drawing techniques. In particular, in Fig. 1b, the “Gat” on the head is clearly visible [2].

images

Figure 1: The left photo (a) is a portrait of Ahn Hyang (1243~1306) in the mid-Goryeo dynasty and the right photo (b) is a portrait of Lee Chae (1411~1493) in the late Joseon Dynasty

Prior to the Three Kingdoms Period, Korean portrait records were absent, and only a limited quantity of portraits were preserved during the Goryeo Dynasty [3]. In contrast, the Joseon Dynasty produced numerous portraits with different types delineated according to their social status. Furthermore, works from the Joseon era exhibit a superior level of painting, in which facial features are rendered in greater detail than in earlier periods.

A portrait exhibits slight variations in the physical appearance of a person, but it uniquely distinguishes individuals akin to a montage. Modern identification photographs serve a similar purpose and are used as identification cards, such as driver’s licenses and resident registration cards. Old portraits may pique interest in how one appears in such artwork, for which style transfer technology can be used. Korean portraits may be used to provide the style of ID photos; however, the custom of wearing a “Gat” headgear renders transferring the style from Korean portraits to ID photos using previous techniques challenging. While earlier studies have employed global styles or partial styles to transfer onto content images, the distinct styles of texture, attire, and “Gat” must be considered simultaneously for Korean portraits. By independently extracting several styles from the style image, transferring the age, hairstyle, and costume of the person in a portrait onto the ID photos are possible. Fig. 2 showcases the results from the method presented in previous research [4]. The figure distinctly emphasizes the significant challenges encountered when attempting style transfer with multiple styles using CycleGAN [5]. In this study, we introduce a method for high-quality style transfer of Korean portraits, which overcomes the limitations of previous research to accurately preserve facial landmarks and produce realistic results.

images

Figure 2: Results of style transfer from Korean portraits to ID photos using CycleGAN

Style transfer techniques, such as GAN, are commonly used based on facial datasets, but maintaining the identity of the person is crucial for achieving high-quality results. Existing face-based style transfer studies only consider facial components, such as eyes, nose, mouth, and hair, when transferring styles onto content images. In contrast, this study aims to transfer multiple styles, including Gats and costumes, simultaneously.

To accomplish this, we propose an enhanced GAN-based network for style transfer that generates a mask using landmarks and defines a new loss function to perform style transfer based on facial data. we define the proposed method, “Style Transfer from Korean Portraits to ID Photos Using Landmark Extractor with GAN,” as PP-GAN. The primary contribution of this study is the development of a novel approach to style transfer that considers multiple styles and maintains the identity of a person.

•   The possibility of independent and arbitrary style transfer to a network trained with a small dataset has been demonstrated.

•   This study is the first attempt at arbitrary style transfer in Korean portraits, which was achieved by introducing a new combination of loss functions.

•   The generated landmark mask improved the performance of identity preservation and outperformed previous methods [4].

•   New data on upper-body Korean portraits and ID photos were collected for this study.

In Section 2, previous studies related to style transfer are reviewed. In Section 3, the foundational techniques for the method proposed in this paper are explained. Section 4 delves into the architecture, learning strategy, and loss functions of the proposed method in detail. In Section 5, the results of the proposed method and experimental and analytical results through performance metrics are presented. Lastly, Section 6 discusses the conclusions of this research and directions for future studies.

2  Related Work

Research on style transfer can be categorized into two main groups: those based on Convolutional Neural Networks (CNN) and those based on General Adversarial Networks (GAN).

2.1 CNN-Based Previous Works

AdaIN [6] suggested a method of transferring style at high speed using statistics in feature maps of content and style images. This is one of the earlier studies on style transfers. Huang et al. [7] used the correlation between the content feature map and scaling information of the style feature map for the fusion of content and style. In addition, the order statistics method, called “Style Projection”, demonstrated the advantages and results of fast training speed. Zhu et al. [8] maintained structural distortion and content by presenting a style transfer network that could preserve details. In addition, by presenting the refined network, which modified the VGG-16 [9], the style pattern was preserved via spatial matching of hierarchical structures. Simonyan et al. [10] proposed a new style transfer algorithm that expanded the texture synthesis work. It aimed to create images of similar quality and emphasized a consistent way of creating rich styles while keeping the content intact in the selected area. In addition, it was fast and flexible to process any pair of content and style images. Li et al. [11] suggested a style transfer method for low-level features to express content images in a CNN. Low-level features dominate the detailed structure of new images. A Laplacian matrix was used to detect edges and contours. It shows a better stylized image, which can preserve the details of the content image and remove artifacts. Chen et al. [12] proposed a stepwise method based on a deep neural network for synthesizing facial sketches. It showed better performance by proposing a pyramid column feature to enrich the parts by adding texture and shading. Fast Art-CNN [13] is a structure for fast style transfer performance in the feedforward mode while minimizing deterioration in image quality. It can be used in real-time environments as a method for training deconvolutional neural networks to apply a specific style to content images. Liu et al. [14] proposed an architecture that includes geometric elements in the style transfer. This new architecture can transfer textures into distorted images. In addition, because the content/texture-style/geometry style can be selected to be entered in triple, it provides much greater versatility to the output. Kaur et al. [15] proposed a framework that solves the problem of realistically transferring the texture of the face from the style image to the content image without changing the identity of the original content image. Changes around the landmark are gently suppressed to preserve the facial structure so that it can be transferred without changing the identity of the face. Ghiasi et al. [16] presented a study on neural style transfer capable of real-time inference by combining with a fast style transfer network. It utilized a learning approach that predicts conditional instance normalization parameters for style images, enabling the generation of results for arbitrary content and style images.

2.2 GAN-Based Previous Works

APDrawingGAN [17] improved the performance by combining global and regional networks. High-quality results were generated by measuring the similarity between the distance transform and artist drawing. Xu et al. [18] used a generator and discriminator as conditional networks. Subsequently, the mask module for style adjustment and AdaIN [6] for style transfer performed better than the existing GAN. S3-GAN [19] introduced a style separation method in the latent vector space to separate style and content. A style-transferred vector space was created using a combination of separated latent vectors. CycleGAN [5] proposed a method for converting a style to an image without a pair of domains. While training the generator mapping to XY, reverse mapping to YX is performed. In addition, the cycle consistency loss was designed such that an input image and its reconstructed image could be identical when the transferred style was removed through reverse mapping. Yi et al. [20] proposed a new asymmetric cycle mapping that forced the reconstruction information to be shown and included only in optional facial areas. Portrait images generated along with a localized discriminator for landmark and style classifiers were introduced. Considering the style vector, portraits were generated in several styles using a single network. They attempted to transfer the style of the portrait similar, which is similar to the purpose of our study. However, in this study, not only the portrait painting style but also the Gat and costume are transferred together.

Some attempts have been made to maintain facial landmarks in style transfer studies aimed at makeup, aging or changing. In SLGAN [21], a style-invariant decoder was created by a generator to preserve the identity of the content image and introduce a new perceptual makeup loss, resulting in high-quality conversion. BeautyGAN [22] defined instance and perceptual loss to change the makeup style while maintaining the identity of the face, thereby generating high-quality images and maintaining the identity. Paired-CycleGAN [23] trained two generators simultaneously to convert the makeup styles of other people from portrait photos. Stage 1 was used as a pair of powers through image analogy, and as an input of Stage 2, it showed excellent results by calculating identity preservation and style consistency compared to the power of Stage 1. Landmark-CycleGAN [24] showed incorrect results owing to the distortion of the geometrical structure while converting a face image to a cartoon image. To solve this problem, local discriminators have been proposed using landmarks to improve performance. Palsson et al. [25] suggested Group-GAN, which consisted of several models of CycleGAN [5], to integrate pre-trained age prediction models and solve the face aging problem. Wang et al. [26] proposed a method for interconverting edge maps to a CycleGAN-based E2E-CycleGAN network for aging. The old face was generated using the identity feature map and result of converting the edge map using the E2F-pixelHD network. Face-Dancer [27] proposed a model that transfers features from the face of a source image to a target image with the aim of face swapping. This involves transferring the identity of the source image, including its expressions and poses, while preserving the target image. This is significantly different from the method proposed in this paper. Although it claims to maintain identity, the results can be very different from the identity of the source image because it is influenced by the facial components of the target image. The key difference in our paper is that we propose a method that guarantees the identity of the source image itself while transferring the style of the target image.

3  Background

3.1 VGG-16

The VGG-16 [9] network is a prominent computer vision model that attained a 92.7% Top-5 accuracy in the ImageNet Challenge competition by receiving an RGB image with dimensions of 224 × 224 as input, containing 16 layers in a configuration of 13 convolution layers and three FC layers. The convolution filter measures 3 × 3 pixels and maintains fixed strides and padding at 1. The activation function employed in the network is ReLU, and the pooling layer is max pooling, which is set to a fixed stride of 2 on 2 × 2. The closer it is to the input layer, the more low-level information the feature map contains, such as color and texture. of the image, and the closer it is to the output layer, thus providing higher-level information, such as shape. The pre-trained VGG-16 [9] was used in this study to preserve facial and upper-body content and transfer the style efficiently.

3.2 Gram Matrix

The Gram matrix is a valuable tool for representing the color distribution of an image. This enabled the computation of the overall color and texture correlation between the two images. Gatys et al. [28] demonstrated that the style transfer performance can be improved using a Gram matrix on feature maps from various layers. Fig. 3 illustrates the process of calculating the Gram matrix, which involves converting each channel of a color image into a 1D vector, followed by obtaining the matrix by multiplying the H×W matrix with its transpose. The Gram matrix is a square matrix with channel size as its dimension. As the corresponding values in the Gram matrix of the two images become more similar, the color distribution of the images also becomes more similar.

images

Figure 3: The process of calculating into a Gram matrix for Korean portraits

3.3 Face Landmark

Facial landmarks, such as the eyes, nose, and mouth, play a vital role in identifying and analyzing facial structures. To detect the landmarks, this study employed the Shape Predictor of 68 face landmarks [29], which generated 68 x and y coordinates of the crucial facial components, including the jaw, eye, eyebrow, nose, and mouth and also provided the locations of the face. Subsequently, the coordinates obtained from the predictor were used to create masks for the eyes, nose, and mouth, as shown in Fig. 4.

images

Figure 4: Masks for eyes, nose, and mouth created by shape predictor 68 face landmarks

3.4 Image Sharpening

Image sharpening is considered a high-frequency emphasis filtering technique, which is employed to enhance image details. High frequency is characterized by the changes in brightness or color occurring locally, and it is useful in identifying facial landmarks. Image sharpening can be achieved using high-boost filtering. It involves generating a high-pass image by subtracting a low-pass image from an input image as shown in Eq. (1). A high-frequency emphasized image is obtained by multiplying the input image with a constant during this process.

g(x,y)=Af(x,y)=fL(x,y)(1)

Mean filtering is a low-pass filtering technique, and the coefficients of the filter can be determined using Eq. (2). The sharpening strength of the input image is controlled by the value of α, where 9A-1 is set to α. A high α value results in a decrease in the sharpness level owing to the high ratio of the original image to the output. Conversely, a small α value results in a reduction in contrast, owing to the removal of numerous low-frequency components.

To have similar structures between portrait images and ID photos, portrait images are cropped around the faces, as the face occupies a relatively small area. In contrast, ID photos are resized so that they have the same size, both horizontally and vertically, instead of being cropped. However, this resizing can make extracting facial landmarks difficult. Therefore, image sharpening is performed in the present study. This process is necessary to ensure that facial landmarks are extracted well from ID photos, as shown in Fig. 5, where the difference in facial landmark extraction with and without image sharpening is illustrated.

images

Figure 5: Result of landmark mask generation according to the use of high boost filtering (The first and third columns are the original and the high boost filtered image, respectively, and the second and fourth columns show the masks with the detected landmark for each corresponding image)

Af(x,y)fL(x,y)=A[000010000]19[111111111]=19[11119A11111](2)

[1111α1111]

4  Proposed Method

4.1 Network

The primary objective of the proposed method is to achieve a style transfer of ID photos to Korean portraits. Let X and Y indicate the domains of the three-dimensional color image and the Korean portrait, respectively. These domains are subsets of XRH×W×C and YRH×W×C and have a set relationship such that xX and yY.

The CycleGAN [5] network is limited in performing style transfer owing to its training over the entire domain. Therefore, the proposed method adopts a Dual I/O generator from BeautyGAN [20], which has a stable discriminator that enables mapping training between two domains and style transfer. Additionally, the proposed method incorporates VGG-16, a Gram matrix, and a landmark extractor to improve performance. Fig. 6 depicts the overall structure of the proposed method.

images

Figure 6: Overall structure of the system proposed in this study

4.1.1 Generator

The generator is trained to perform (X,Y)→(Y,X) mapping, resulting in a fake image G(x,y) = (xy,yx) with content X and style Y, which is evaluated in this study. Contents Y and X are used to generate another fake image yx. This study focuses only on xy results, even though the network structure can generate results in both directions. The image recovered by the Dual I/O generator performing style transfer and the input image must be identical. With an input size of (256, 256, 32), x and y pass through three convolution layers each, resulting in a size of (64, 64, 128). The x and y results are concatenated to produce a size of (64, 64, 256), which is restored to the original size through the deconvolution layer, allowing style transfer through nine residual blocks. This result represents a fake image of a style transferred and represents the result of the proposed method. Therefore, the generator deceives the discriminator by generating fake images that appear real, resulting in more natural and higher performance results.

4.1.2 Discriminator

The network structure includes two discriminators that are trained to classify the styles of fake and real images generated by the generator. The discriminator consists of five convolution layers and aims to distinguish styles. The input image size is (256, 256, 3), and the network result size is (30, 30, 1). The first four convolution layers, excluding the last layer, perform Spectral Normalization to improve the performance and maintain a stable distribution of the discriminator in a high-dimensional space. The discriminator is defined as follows: Dx classifies xy as fake and y as real, whereas Dy classifies yx as fake and x as real. Finally, the PatchGAN [30] is applied to produce the discriminator output, which is the final judgment result of the discriminator result image.

4.2 Loss Function

In this study, we propose a loss function for transferring ID photos for arbitrary Korean portrait styles. Six loss functions, including the loss function of the new approach, are used to generate good results.

CycleGAN introduced the concept of setting the result as the input of the generator again through a cycle structure, which should theoretically produce the same output as the original image. Therefore, in this study, we define the recovered result as the cycle loss, which consists of a loss function designed to reduce the difference between the input and output images. In particular, it can be expressed as xG(G(x,y))=G(xy,yx)=xx and yG(G(y,x))=G(yx,xy)=yy. This can be expressed as Eq. (3).

Lcy=ExP(X)xxx+EyP(Y)yyy(3)

The existing style transfer method distorts the shape of the face geometrically, leading to difficulties in recognizing the face shape. To maintain the identity of the character, a new condition is required. Hence, this study defines land loss based on a face landmark mask, which helps in preserving the eyes, nose, and mouth while enhancing the performance of style transfer. Land loss is defined by mathematical expression Eq. (4) in this study.

Ll=Ll_eye+Ll_nose+Ll_lip(4)

Land loss is a function that aims to maintain the landmark features of the input and output images generated by the generator. The pairs of images (xy, x) and (yx, y) contained the same content with different styles, and the landmark shapes are identical. The masks, MfX and MfY, generated for the eye, nose, and mouth areas, are used to calculate the area, as discussed in Section 3. Using a pixel-wise operation, each area of the eye, nose, and mouth is processed using a face landmark mask, and a loss function is defined to minimize the difference in pixel values. This process is expressed in Eq. (5). The difference for each landmark is based on L1 Loss.

Lf=ExP(X)xyMfXxMfX1+EyP(Y)yxMfYyMfY1 f={1_eye,1_nose,1_lip}(5)

The method proposed in this study differs greatly from previous style transfer research, which requires some content of the style target image rather than ignoring it and only considering the color relationship. In particular, for Korean portraits, the style of the Gat and clothes must be considered in addition to image quality, background, and overall color. However, the form of the Gat varies widely, which is difficult to detect due to the differences in wearing position, while the hair of Korean portraits and ID photos have completely different shapes. To address this, a head loss is proposed to minimize the difference between the head area of the result and style images, with the head area divided into the Gat and hair areas, represented by masks Mht and Mhr. Head loss uses the fact that the Gat does not cover the eyebrows; therefore, the feature point located at the top of the coordinates corresponding to the eyebrows is used to define the head area, which is then used to transfer the corresponding style to the resulting image. This is expressed in Eq. (6).

Lh=ExP(X)xyMhtyMht1+EyP(Y)yxMhrxMhr1(6)

To preserve the overall shape of the character and enhance the performance of style transfer, content loss and Style Loss are defined using a specific layer of VGG-16 in this study. The pre-trained network contained low- and high-level information, such as colors and shapes, which appeared differently depending on the layer location. Low-level information is related to style, and high-level information is related to content. Conversely, high-level layers represent the image characteristics. Therefore, the content and style losses are configured based on the layer characteristics. Style loss is defined using a Gram matrix, which is obtained by computing the inner product of the feature maps. The best set of layers obtained through the experiment is used to define style loss, as shown in Eq. (7), where N and M represent the product and channel of each layer, respectively, and g represents the Gram matrix of the feature map. By training to minimize the difference in the Gram matrix between the feature maps for both sides (xy and yx), the style of y can be transferred to x.

Ls=14N2M2[(gi(xy)gi(y))2+(gi(yx)gi(x))2](7)

Content loss is defined as a method to minimize linear differences in feature maps at the pixel level. Because the style transfer aims to maintain the content of an image while transferring the style, it is not necessary to consider correlations. The equation for content loss is the same as that in Eq. (8). This is a critical factor in preserving the identity of a person; however, if the weight of this loss is extremely large, it can result in poor style transfer results. Therefore, appropriate hyperparameters must be selected to achieve the desired outcome.

Lc=ExP(X)[li(xy)li(x)]2+EyP(Y)[li(yx)li(y)]2(8)

The discriminator loss solely comprised adversarial loss, which followed the GAN structure. The output of the discriminator is a 32 × 32 × 1 result that is evaluated based on PatchGAN [30] to identify whether they are authentic or fake, considering every image PatchGAN [30]. The loss function used to train the discriminator is given by Eq. (9). The loss function is reduced if the patches of xy and yx are fake, and the patches of x and y are genuinely classified. However, the equation related to the discriminator used during the training of the generator is defined as in Eq. (10). This concept is opposite to that of Eq. (9) and is designed as a metric for the generator to deceive the discriminator, distinguishing Fake as Real.

LD=ExP(X)[(Dx(y)1)2+(Dx(xy))2]+EyP(Y)[(Dy(x)1)2+(Dy(yx))2](9)

LDG=ExP(X)[(Dx(xy)1)2]+EyP(Y)[(Dy(yx)1)2](10)

The generator loss is composed of cycle, land, head, style, and content losses, as expressed in Eq. (11). Each loss is multiplied by a different hyperparameter, and the sum of the resulting values is used as the loss function of the generator.

LG=λcyLcy+λlLl+λhLh+λsLs+λcLc+λDGLDG(11)

The total loss employed in this study is expressed by Eq. (12) and is composed of the generator and discriminator losses. The generator seeks to minimize the generator loss to generate style transfer outcomes, whereas the discriminator aims to minimize the discriminator loss to enhance its discriminative capability. A trade-off between the generator and discriminator performances is observed, where if one is improved, the other is diminished. Consequently, the total loss is optimized by forming a competitive relationship between the generator and discriminator, which led to superior outcomes.

LTotal=minGminD(LG,LD)(12)

4.3 Training

The experimental environment in this study was conducted on a multi-GPU system using the GeForce RTX 3090 and Ubuntu 18.04 LTS operating system. As TensorFlow 1.x has a minimum version requirement for CUDA, the experiments were carried out using Nvidia-Tensorflow version 1.15.4. Datasets of ID photos and Korean portraits were collected through web crawling using Google and Bing search engines. To improve the training performance, preprocessing was conducted to separate the face area from the whole body of the Korean portraits, which typically feature the entire body. Data augmentation techniques, such as left and right inversion, blur, and noise, were applied to increase the limited number of datasets. Gat preprocessing was also performed, as shown in Fig. 7, to facilitate the feature mapping.

images

Figure 7: Examples of datasets preprocessing

Table 1 shows the resulting dataset consisting of 1,054 ID photos and 1,736 Korean portraits divided into 96% training and 4% test sets. Owing to the limited number of portraits, a higher ratio of training data was used, and no data augmentation was applied to the test set. As the number of combinations that could be generated from the test data was substantial (XTest×YTest), the evaluation was not problematic. Previous research has emphasized the importance of data preprocessing, and the results of this study further support its impact on training performance.

images

The proposed network was trained for 200 epochs using the Adam Optimizer. The initial learning rate was set to 0.0001 and linearly reduced to zero after 50% of the training epoch for stable learning. To match the equality between the loss functions, λcy was set to 50, which resulted in a relatively lower value than the other losses. To increase the effect of style transfer, λs and λDG were set to 1 and λh was set to 0.5, which helped concentrate on the head area between the style transfers. Finally, the training proceeded by setting λc = 0.1 and λl = 0.2. The entire training process took approximately 6 h 30 min. The results are presented in Fig. 8, which visually confirms that the proposed method shows a greater performance improvement than previous research [4]. While previous methods have focused only on style transfer, this study successfully maintained the identity of a person while transferring the style. The results show excellent outcomes in which the style is transferred while preserving the shape of the character in the content image. Additionally, the identity of the personnel is preserved, and the Gat is transferred naturally.

images

Figure 8: The result of the method proposed in this paper

5  Experiments

5.1 Feature Map

To perform style loss, this study adopted Conv2_2 and Conv3_2 from the VGG-16 layer, whereas Conv4_1 was used for content loss. Although the front convolution layers contain low-level information, they are sensitive to change and difficult to train because of the large differences in pattern and color between the style and content images. To overcome this problem, this study utilizes a feature map located in the middle to extract low-level information for style transfer. The results of using a feature map not adopted for style loss are presented in Fig. 9 with a smoothing set to 0.8 on the tensorboard. Loss graphs were output for only ten epochs because of the failure of training when layers not used for Style Loss were utilized in the experiment.

images

Figure 9: Result of a specific feature map experiment of VGG-16 for style loss

•   The Conv2_1 layer exhibits a large loss value and unstable behavior during training, indicating that training may not be effective for this layer.

•   Conv1_2 is close to zero in most of the losses, but it cannot be said that the training proceeds because the maximum minimum differs by more than 104 times owing to the very large loss of some data.

•   Conv1_1 exhibits a high loss deviation and instability during training, similar to Conv2_1 and Conv1_2. Moreover, owing to its sensitivity to color, this layer presents challenges for training.

If Conv4_1 is utilized as the style loss layer, it can transfer the style of the image content. However, because the feature map scarcely includes style-related content, the generator may produce images lacking style. Nevertheless, it is observed that transferring the style of the background is feasible as it corresponds to the overall style of the image and can be recognized as content because clothing style is not a prominent feature. Therefore, high-level layers, such as Conv4_1, contain only the style of the background in the character content. The result of utilizing the feature map employed in the content loss for style loss is shown in Fig. 10. In general, most contents of the content image are preserved, whereas the style is marginally transferred. Hence, we proceed with using trainable layers, which results in stable training and enables us to transfer styles while conserving content.

images

Figure 10: Results when the feature map used for Style Loss is used with the same layer as Content Loss (Column 1: input image; Column 2 and Column 3: output image using Conv4_1)

5.2 Ablation Study

An ablation study was conducted on four loss functions, excluding Cycle Loss, to demonstrate the effectiveness of the loss function proposed in this study for Generator Loss. The results are presented in Fig. 11, where one row represents the use of the same content and style images. If Lc is excluded, the character shape is not preserved, leading to poor results owing to the style concentration during training. Consequently, only facial components are transferred based on Ll. When Lc and Ll are excluded simultaneously, the style transfer outcome lacks facial components. Similarly, when Ls is excluded, the result of the style transfer is of poor quality, with the character remaining almost the same. The use of Lcy allows for style transfer of the background without the need for a separate loss function for style; however, owing to the main focus of training on the character shape, style transfer barely occurs, resulting in the creation of the Gat when Lh is used. Excluding Ll leads to unclear and blurred liver facial components, and the face color becomes bright. Therefore, Ll plays a crucial role in preserving the character identity by making the liver face components more apparent. If Lh is excluded, the head area becomes blurred or is not created, leading to unsatisfactory style transfer results. Unlike the overall style transfer, the Gat must be newly generated; thus, Ls serves a different purpose. Therefore, the head area must be set separately, and Ls can be used with Lh to achieve this purpose. Consequently, the use of all the loss functions proposed in this study results in the best performance, generating natural images without bias in any direction.

images

Figure 11: Results highlighting the importance of the various loss functions. Column 1 to 4 excluding the loss functions Lc, Ls, Ll, and Lh; Column 5 results generated using all loss functions

5.3 Performance Evaluation

This study conducts a performance comparison with previous research [4] with the same subject, as well as an ablation study. Although there are diverse existing studies on style transfer, the subject of this paper differs significantly from them, leading to a comparison only with a single previous study [4].

Based on CycleGAN [5], evaluation in pairs could not be performed because of the impossibility of arbitrary style transfer. Thus, an evaluation survey was conducted with 59 students of different grades from the Department of Computer Engineering at Kumoh National Institute of Technology to evaluate the performance of the proposed method in terms of three items: the transfer of style (Sst), preservation of content (Scn), and generation of natural images (Snt). The survey was conducted online for 10 days using Google Forms, and the surveyors received and evaluated a combination of 10 results from a previous study [4] and 10 results from the proposed method. The survey results presented in Table 2. It shows that the proposed method outperforms the previous method [4] in all three aspects, with the highest difference in scores for preserving character content. In contrast, the previous method [4] failed to preserve the shape of the character during style transfer, leading to blurring or disappearance of the landmark of the face, which was not natural. However, the proposed method successfully preserved the character content and effectively transferred the style, producing relatively natural results. Thus, the proposed method showed better overall performance than the previous method [4].

images

Peak signal-to-noise ratio (PSNR) and structural similarity index measure (SSIM) are commonly employed to measure performance. However, when conducting style transfer while preserving content, it is crucial to ensure a natural outcome without any significant bias towards either the content or style. Consequently, to compare the performance, we propose new performance indicators that utilize a weighted arithmetic mean, which is further weighted by the median values of the PSNR and SSIM. The final result is obtained by combining the results of both content and style using the proposed performance indicator.

The PSNR is commonly used to assess the quality of images after compression by measuring the ratio of noise to the maximum value, which can be calculated using Eq. (13). The logarithmic denominator in this equation represents the average sum of squares between the original and compressed images, and a lower value indicates a higher PSNR and better preservation of the original image. By contrast, the SSIM is used to evaluate distortions in image similarity between a pair of images by comparing their structural, luminance, and contrast features. Eq. (14) is used to calculate the SSIM, which involves various probability-related definitions such as mean, standard deviation, and covariance.

PSNR(A,B)=10log10(MAX2(AB)2)(13)

SSIM(A,B)=(2μAμB+C1)(2σAB+C2)(2μA2+μB2+C1)(σA2+σB2+C2)(14)

To evaluate the performance, the indicators are sorted in ascending order, resulting in a sequence of values denoted as [x1,x2,x3,x4,x5], where x3 represents the best result. The value in [] from x1 to x5 represents loss when various specific losses are excluded and when all are used. As our goal is to find an optimal combination, it is crucial to focus on the median values. The weight vector (w) is designed to emphasize the significance of the median value. Compared to the adjacent value, it has twice the weight, and compared to the value two steps away, it has five times the weight. This weight vector accentuates the importance of the central point while gradually diminishing the significance of the surrounding data points. This can be beneficial in analyses that prioritize the center of specific data. To assign more weight to the median value, a weight vector (w) of [10,25,50,25,10] is assigned, and the weighted arithmetic mean is calculated using Eq. (15). The performance is evaluated using Eq. (16), which calculates the square of the difference between the weighted arithmetic mean and PSNR and SSIM values. The resulting value indicates the degree of performance, with smaller values indicating a better performance. The difference from the average weight (wavg) is squared and added, resulting in a large one-sided result. Finally, the sum of the square errors for content and style (EPSNR, ESSIM) is presented as the final indicator of the performance evaluation.

wavg=i=15xiwii=15wi(15)

Ed=(wavgxi)2,d={content, style}(16)

Table 3 shows the results of the proposed performance indicators based on PSNR and SSIM, which were evaluated based on 1,452 generated results. PSNR values for content and style images are represented by PContent and PStyle respectively, whereas SContent and SStyle calculate the SSIM values for content and style images, respectively. EContent + EStyle, which is the sum of the square errors for content and style, is used as the final metric. In this context, the “+” symbol refers to the combined value of Content-based Loss and Style-based Loss, which is used as a performance metric in this paper. The preservation of content was highest when Lh was not used, while not using Lc or Ls resulted in a loss of content and style. Ll showed no significant difference in terms of content, whereas style was relatively high. Therefore, EPSNR and ESSIM are good evaluation metrics when all loss functions are used. The distribution of the results generated with the content retention performance as the style transfer performance is shown in Fig. 12. When PSNR (Fig. 12a) is considered, the distribution of results with respect to Lh is different from the other results. The distribution with respect to Lc is located in the first half and with respect to Ls in the second half. However, the distribution of LTotal is relatively close to the center with a small deviation, making it the most appropriate result. In the case of SSIM (Fig. 12b), the distribution shape is similar to that of the PSNR, but several distributions show parallel movement results. The smaller the ESSIM, the more central the distribution, indicating better performance. Therefore, LTotal shows better performance than Ll, and LTotal with a small difference has similar distributions in the center. Other results are considered to produce relatively poor results because they are located away from the center.

images

images

Figure 12: Scatter plot of two metrics (PSNR and SSIM) in terms of content and style for test datasets and their transfer results based on combinations of used loss function combinations

The performance of the style transfer of Korean portraits to ID photos presented in this paper is highly satisfactory. However, there are three issues. First, the dissimilarity in the texture of the images on both sides leads to unsatisfactory results in the reverse direction. While this paper mainly focuses on transferring the Korean portrait style to ID photos, the generator structure allows the transfer of the ID photos style to Korean portraits. Nevertheless, maintaining the shape of the Korean portraits, which are paintings, can pose a challenge. Furthermore, owing to mapping difficulties, only the face may be preserved during style transfer. Second, if the feature map of the Korean portrait dataset does not map well with the ID photos dataset, unsatisfactory results are obtained. This can be attributed to either dataset limitations or faulty preprocessing. For instance, ID photos are front-facing, but Korean portraits may depict characters from other angles. Improper cropping during preprocessing can also lead to different feature maps and poor results. Finally, Korean portrait data are scarce, and using existing data leads to limited and inadequate style representation. Although data augmentation increases the number of data points, the training style remains unchanged, thereby limiting the results.

6  Conclusions

The objective of this study is to propose a generative adversarial network that utilizes facial feature points and loss functions to achieve arbitrary style transfers while maintaining the original face shape and transferring the Gat. To preserve the characteristics of the face, two loss functions, Land Loss and Head Loss, were defined using landmark masks to minimize the difference and speed up the learning process. Style Loss, which uses a Gram matrix for content loss and style transfer, enables style transfer while preserving the character’s shape. However, if the input images have large differences and the feature maps have significant discrepancies, the results are not satisfactory, and there are color differences in some instances. Additionally, it is noticeable that when hair is prominently displayed in ID photos, the chances of experiencing the ghosting effect increase. To overcome these limitations, it is recommended to define a loss function that considers color differences and aligns the feature map through the alignment of facial landmarks in future studies.

Acknowledgement: The authors thank the undergraduate students, alumni of Kumoh National Institute of Technology, and members of the IIA Lab.

Funding Statement: This work was supported by Metaverse Lab Program funded by the Ministry of Science and ICT (MSIT), and the Korea Radio Promotion Association (RAPA).

Author Contributions: Study conception and design: J. Si, S. Kim; data collection: J. Si; analysis and interpretation of results: J. Si, S. Kim; draft manuscript preparation: J. Si, S. Kim.

Availability of Data and Materials: Data supporting this study cannot be made available due to ethical restrictions.

Ethics Approval: The authors utilized ID photos of several individuals, all of which were obtained with their explicit consent for providing their photos.

Conflicts of Interest: The authors declare that they have no conflicts of interest to report regarding the present study.

References

1. Encyclopedia of Korean culture. http://encykorea.aks.ac.kr/Contents/Item/E0057016/ (accessed on 28/12/2022) [Google Scholar]

2. Cultural heritage administration. http://www.heritage.go.kr/heri/cul/culSelectDetail.do?pageNo=1_1_1_1&ccbaCpno=1113701110000/ (accessed on 17/05/2023) [Google Scholar]

3. Cultural Heritage Administration. https://www.heritage.go.kr/heri/cul/culSelectDetail.do?pageNo=1_1_1_1&ccbaCpno=1121114830000/ (accessed on 17/05/2023) [Google Scholar]

4. J. Si, J. Jeong, G. Kim and S. Kim, “Style interconversion of Korean portrait and ID photo using CycleGAN,” in Proc. of Korean Institute of Information Technology (KIIT), Cheong Ju, Korea, pp. 147–149, 2020. [Google Scholar]

5. J. Zhu, T. Park, P. Isola and A. A. Efros, “Unpaired image-to-Image translation using cycle-consistent adversarial networks,” in Proc. of the IEEE Int. Conf. on Computer Vision (ICCV), Venice, Italy, pp. 2223–2232, 2017. [Google Scholar]

6. X. Huang and S. Belongie, “Arbitrary style transfer in real-time with adaptive instance normalization,” in Proc. of the IEEE Int. Conf. on Computer Vision (ICCV), Seoul, Korea, pp. 1501–1510, 2017. [Google Scholar]

7. S. Huang, H. Xiong, T. Wang, Q. Wang, Z. Chen et al., “Parameter-free style projection for arbitrary style transfer,” arXiv preprint arXiv:2003.07694, 2020. [Google Scholar]

8. T. Zhu and S. Liu, “Detail-preserving arbitrary style transfer,” in Proc. of IEEE Int. Conf. on Multimedia and Expo (ICME), London, UK, pp. 1–6, 2020. [Google Scholar]

9. M. Elad and P. Milanfar, “Style transfer via texture synthesis,” IEEE Transactions on Image Processing, vol. 26, no. 5, pp. 2338–2351, 2017. [Google Scholar] [PubMed]

10. K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in Proc. of Int. Conf. on Learning Representations (ICLR), San Diego, CA, USA, pp. 1–14, 2015. [Google Scholar]

11. S. Li, X. Xu, L. Nie and T. Chua, “Laplacian-steered neural style transfer,” in Proc. of ACM Int. Conf. on Multimedia, New York, NY, USA, pp. 1716–1724, 2017. [Google Scholar]

12. C. Chen, X. Tan and K. Y. K. Wong, “Face sketch synthesis with style transfer using pyramid column feature,” in Proc. of IEEE Winter Conf. on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, pp. 485–493, 2018. [Google Scholar]

13. B. Blakeslee, R. Ptucha and A. Savakis, “Faster Art-CNN: An extremely fast style transfer network,” in Proc. of IEEE Western New York Image and Signal Processing Workshop (WNYISPW), Rochester, NY, USA, pp. 1–5, 2018. [Google Scholar]

14. X. Liu, X. Li, M. Cheng and P. Hall, “Geometric style transfer,” arXiv preprint arXiv:2007.05471, 2020. [Google Scholar]

15. P. Kaur, H. Zhang and K. Dana, “Photo-realistic facial texture transfer,” in Proc. of IEEE Conf. on Applications of Computer Vision (WACV), Waikoloa, HI, USA, pp. 2097–2105, 2019. [Google Scholar]

16. G. Ghiasi, H. Lee, M. Kudlur, V. Dumoulin and J. Shlens, “Exploring the structure of a real-time, arbitrary neural artistic stylization network,” arXiv preprint arXiv:1705.06830, 2017. [Google Scholar]

17. R. Yi, Y. Liu, Y. Lai and P. Rosin, “APDrawingGAN: Generating artistic portrait drawings from face photos with hierarchical GANs,” in Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, California, USA, pp. 10743–10752, 2019. [Google Scholar]

18. Z. Xu, M. Wilber, C. Fang, A. Hertzmann and H. Jin, “Learning from multi-domain artistic images for arbitrary style transfer,” in Proc. of the ACM/Eurographics Expressive Symp. on Computational Aesthetics and Sketch Based Interfaces and Modeling and Non-Photorealistic Animation and Rendering (Expressive’ 19), Goslar, Germany, pp. 21–31, 2019. [Google Scholar]

19. R. Zhang, S. Tang, Y. Li, J. Guo, Y. Zhang et al., “Style separation and synthesis via generative adversarial networks,” in Proc. of the ACM Int. Conf. on Multimedia, New York, NY, USA, pp. 183–191, 2018. [Google Scholar]

20. R. Yi, Y. J. Liu, Y. K. Lai and P. L. Rosin, “Unpaired portrait drawing generation via asymmetric cycle mapping,” in Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), Virtual, pp. 8217–8225, 2020. [Google Scholar]

21. D. Horita and K. Aizawa, “SLGAN: Style-and latent-guided generative adversarial network for desirable makeup transfer and removal,” in Proc. of the ACM Int. Conf. on Multimedia in Asia, Tokyo, Japan, pp. 1–8, 2022. [Google Scholar]

22. T. Li, R. Qian, C. Dong, S. Liu, Q. Yan et al., “BeautyGAN: Instance-level facial makeup transfer with deep generative adversarial network,” in Proc. of the ACM Int. Conf. on Multimedia, New York, NY, USA, pp. 645–653, 2018. [Google Scholar]

23. H. Chang, J. Lu, F. Yu and A. Finkelstein, “PairedCycleGAN: Asymmetric style transfer for applying and removing makeup,” in Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, pp. 40–48, 2018. [Google Scholar]

24. R. Wu, X. Gu, X. Tao, X. Shen, Y. W et al., “Landmark assisted cyclegan for cartoon face generation,” arXiv preprint arXiv:1907.01424, 2019. [Google Scholar]

25. S. Palsson, E. Agustsson, R. Timofte and L. V. Gool, “Generative adversarial style transfer networks for face aging,” in Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition Workshops (CVPRW), Salt Lake City, UT, USA, pp. 2084–2092, 2018. [Google Scholar]

26. Z. Wang, Z. Liu, J. Huang, S. Lian and Y. Lin, “How old are you? Face age translation with identity preservation using GANs,” arXiv preprint arXiv:1909.04988, 2019. [Google Scholar]

27. F. Rosberg, E. E. Aksoy, F. Alonso-Fernandez and C. Englund, “FaceDancer: Pose-and occlusion-aware high fidelity face swapping,” in Proc. of the IEEE/CVF Winter Conf. on Applications of Computer Vision, Hawaii, Walkoloa, pp. 3454–3463, 2023. [Google Scholar]

28. L. A. Gatys, A. S. Ecker and M. Bethge, “Image style transfer using convolutional neural networks,” in Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Las Vegas, Nevada, USA, pp. 2414–2423, 2016. [Google Scholar]

29. V. Kazemi and J. Sullivan, “One millisecond face alignment with an ensemble of regression trees,” in Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, pp. 1867–1874, 2014. [Google Scholar]

30. C. Li and M. Wand, “Precomputed real-time texture synthesis with markovian generative adversarial networks,” in Proc. of European Conf. on Computer Vision (ECCV), Amsterdam, Netherlands, pp. 702–716, 2016. [Google Scholar]

Appendix A. Comparison of generated results with similar related works

In this section, we will perform comparative analysis alongside the results of two similar categories: Neural Style Transfer and Face Swap.

In the case of Neural Style Transfer, it involves transferring the style of the entire Style image to the Content image. Therefore, it is unable to produce results such as transferring the Gat or changing the texture of the clothes. For Face Swap, it can extract only the facial features of the ID photos and generate an image on the Korean portrait. However, one of its limitations is that it superimposes the features from the ID photos onto the facial landmarks of the Korean portrait itself, which leads to a loss of original identity. The aim of this study is quite different as it seeks to preserve all aspects of the upper body shape and facial area found in the ID photos, while also transferring the texture of the clothes, the overall color of the picture, and the Gat from the Korean portrait.

Fig. 13 illustrates the visualized results of the methods proposed by Ghiasi et al. [16], Face-Dancer [27], and this study. The results of Ghiasi et al. [16] show a transfer from the ID photos based on the overall style distribution of the Korean portrait images. For Face-Dancer [27], it maintains all content outside the face region of the Korean portrait, and the internal facial features of the ID photos are spatially transformed and transferred onto the face of the Korean portrait. The generated result is not a transfer of the Korean portrait style to the ID photos, but rather a projection of the ID photos onto the Korean portrait itself, hence the preservation of the identity is not maintained.

images

Figure 13: Comparison of results between neural style transfer, face swap methods and ours (Column 2: Ours, Column 3: G. Ghiasi et al. [16], Column 4: Face-Dancer [27]). Adapted with permission from reference [16], Copyright © 2017, Arxiv., reference [27], Copyright © 2023, IEEE


Cite This Article

APA Style
Si, J., Kim, S. (2023). PP-GAN: style transfer from korean portraits to ID photos using landmark extractor with GAN. Computers, Materials & Continua, 77(3), 3119-3138. https://doi.org/10.32604/cmc.2023.043797
Vancouver Style
Si J, Kim S. PP-GAN: style transfer from korean portraits to ID photos using landmark extractor with GAN. Comput Mater Contin. 2023;77(3):3119-3138 https://doi.org/10.32604/cmc.2023.043797
IEEE Style
J. Si and S. Kim, "PP-GAN: Style Transfer from Korean Portraits to ID Photos Using Landmark Extractor with GAN," Comput. Mater. Contin., vol. 77, no. 3, pp. 3119-3138. 2023. https://doi.org/10.32604/cmc.2023.043797


cc This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
  • 575

    View

  • 247

    Download

  • 2

    Like

Share Link