Samples from the pix2pix paper
In this report we study the possibility of building the neural model of human faces using cGAN.
In my last experiment Generate Photo-realistic Avatars with DCGAN I showed that it is possible to use DCGAN (Deep Convolutional Generative Adversarial Networks), the non-conditional variation of GAN, to synthesize photo-realistic animated facial expressions using a model trained from limited number of images or videos of a specific person.
This report is a follow-up on the general idea, but this time we want to use the cGAN as described in the paper Image-to-Image Translation with Conditional Generative Adversarial Networks (referred to as the pix2pix paper below), and apply it for the purpose of synthesizing photo-realistic images from the black-and-white sketch images (either Photoshoped or hand-drawn) of a specific person.
Overall this report is an empirical study on cGAN, with an eye towards finding practical applications for the technology (see the Motivation section below).
Our long-term goal is to build a crowd-contributed repository of 3D models that represent the objects in our physical world. As opposed to scanning and representing physical object using the traditional mathematical 3D model representation, we want to explore the idea of using a representation based on the Artificial Neural Networks (ANN) for it ability to learn, infer, associate, and encode rich probability distribution of visual details. We call such ANN-based representation the neural model of a physical object.
The intuition behind studying cGAN here is that if cGAN is capable of generating realistic visual details when given only scanty information, then perhaps it has in fact constitute an adequate representation for many visual aspects of a complex physical object.
The reasons for choosing human faces in this study are because:
As a first step towards the long-term goal stated above, we choose to use cGAN for building the neural model over the faces of a specific person. This differs from the typical GAN applications which tend to apply towards a wide variety of images. If successful then we will proceed with using cGAN or its extension on other types of physical objects.
Goal of Experiments
Using human faces as the subject matter for a series of experiments, we seek to answer the following questions:
The setup for our experiments is as follows:
Baseline Experiment: building the AJ Model
Here we attempt to build a neural model of American actress, filmmaker, and humanitarian Angelina Jolie (referred to as the AJ Model below) using a relatively small training dataset.
The ground truth images (or target images) are color photos of the Angelina Jolie, manually scrapped from over the Internet. The input images (for either training or testing) are manually processed by an artist using Photoshop, by converting them to black-and-white with filter effects. All input images have one particular style of effect applied, which we shall call the Style A effect. The center image in Figure 1a shows the typical output from the testing phase, which are sampled from the trained model using the input image at left.
Regardless of the small size of the training dataset, overall cGAN does a good job converting from the black-and-white input images to the color version, with a great deal of convincing shading and colors that very closely match the ground truth photos. Some observations:
Experiment #1: the case of too many effectsFigure 1a. Images shown from left are: input, output, ground truth. The input image shows the Style A effect, which are applied to all training and test images in the baseline experiment. Figure 1b. The input image at left shows the Style B effect applied which creates strong contrast. This effect is not applied to any of the training samples. The output image in the center shows the problem of the woodblock printing look where there is little gradient.Figure 1c. This is the same test sample as in Figure 1b, which shows that the woodblock printing look in the output image has disappeared after including relevant training samples, and it now appears photo-realistic.
While the baseline experiment above shows good result, the input images are in fact entirely uniform with only one type of effect applied to reduce them to black-and-white. Here we want to find out what would happen when the input images include effects in several varieties.
The AJ Model from the baseline experiment is used, but the test samples contain some input images with Style B effect applied (see left image in Figure 1b).
Figure 1d. Tests with two additional effects (applied to the two black-and-white images) show the same problem: individual effect needs to be included in the training samples in order to achieve satisfactory result, otherwise the sampled output image tend to appear washed out.
Figure 1c shows the result from this test, where the same test image now appears photo-realistic without the woodblock printing effect.
Further tests with additional effects (see left images in Figure 1d) show a general pattern, i.e., test samples with new effect (i.e., that the model was never trained on samples with the effect applied) tend to show poor result, and including such samples in training resolves the problem.
While the result above is not entirely surprising, we do wish to find ways to make the model more tolerant to a wider variety of effects, so that we don't have to retrain cGAN on every new effect.
Experiment #2: the case of mutilated facesFigure 2a. This demonstrates the problem of missing features, where the input images at left is missing part of the nose (top-left) or an eye (bottom-left). No sign of recovery is observed when sampled from the AJ Model.Figure 2b. After including such samples in the training, sampling from the new model show that it is capable of recovering the features to some degree.Figure 2c. One view of the training process, showing the an out-of-place eye in the intermediate output imageFigure 2d. Another view of the training process, showing duplicated and out of place eyes in the intermediate output image.
In this experiment we want to find out whether it is possible to recover missing facial features in the input images. This is of interest here because as a neural model we would want it to be able to infer missing information from partial or altered observations.
We created a set of new test input images applied with Style A or Style B effects, then manually modified to have certain facial features erased. These test images are then used to sample the AJ Model from Experiment #1 (which has been trained with Style A & B effects). The result is shown in Figure 2a, which demonstrates that the model is unable to recover the facial feature omitted in the input images.
The two samples shown in Figure 2a, which were used only as test samples earlier, are now included here for training.
Figure 2b shows the result after 4000 epochs of training. Note that in the top row of Figure 2b, the output image (at center) has been repaired by cGAN with a somewhat acceptable nose, though smaller than in the ground truth photo. The output image (at center) in the bottom row has been repaired with an eye that seems to be a copy from the ground truth photo, but it is larger and not quite in the right place.
A curious effect is observed (using pix2pix's Display UI tool) during the training phase of this experiment, where successive snapshots show the missing part moving and resized around the face, with no clear sign of convergence. Figure 2c and 2d give a glimpse of the phenomenon.
The problem was eventually resolved by turning off the random jitter operation which was applied by this cGAN implementation by default. The random jitter operation essentially add some small randomness in the cropping and re-sizing the images, which seem to work well for other types of subject matters. Our conjecture is that such an operation does not appear to work well in this particular experiment in part because we are extremely sensitive to the precise relative positioning of facial parts, so while we tolerate it in other types of subject matters (e.g., street scenes, building facades, etc.), it become much more noticeable with faces.
With the random jitter removed it can then be observed during training that missing parts are being repaired to near perfection. This of course does not mean much, unless it can also do so with new test images. This is further investigated below.
Figure 2e. An input image with a different defect (i.e., missing the right eye) is used to sample against the model trained in Test 2.B, and the output image (at center) shows no repair made to the right eye at all. Curiously, the model chooses to repair the good left eye and replaced it with a larger version.Figure 2f. Top row shows that a missing left eye is repaired well during trainning. When a new test input image (bottom row left image) is given, the trained model repair it with an eye that looks off.The model trained from Test 2.B (referred to as the model-2B below) is observed to repair the missing nose and eye satisfactorily during training, the next question is whether such repair is transferable in the following sense:
Figure 2g. This example demonstrates cGAN's ability to repair the same defect across different identities. When an input image that has been trained to repair the missing left eye based on a different photo, the output (at center) shows a partial repair with a very faint and mismatched eye.
Experiment #3: from art to photo-realismFigure 3a. The image at the far left is hand-drawn sketch (courtesy: Michelle Chen) based on the ground truth photo at far right. Figure 3b. The hand-drawn black-and-white sketch at left was intentionally made to have a somewhat different expression from the ground truth photo. The output image is blurry and feature-wise closer to the ground truth photo that the model was trained on than the input image.
All of the black-and-white input images used in the experiments above are processed by an artist using Photoshop. This means that such an input image is a precise reduction of a ground truth photo, it thus retain a great deal of precision regarding the position and arrangement of many visual features in relation to its ground truth counterpart.
In this experiment we seek to find out if an input image is entirely hand-drawn, with all of the imprecision of a human hand, then can such art work be converted to a photo-realistic image, like those other Photoshop-processed samples that we have seen before. This is somewhat similar to the handbag example in the pix2pix paper with hand-drawn outline, but we get to check it out using human faces.
For this experiment we asked an artist to find a photo of Angelina Jolie, draw several black-and-white sketches by hand based on the photo, such image pairs (the original photo and the sketch) are then used for additional experiments. The sketch was made using graphite pencil on paper, then scanned and converted a 400x400 jpeg file, which manual retouch in Photoshop as needed.
Here we use the photo-sketch pairs as new test samples against the model from Experiment #2, which was trained on Style A and Style B effects in the input images, but never on imprecise hand-sketched samples (let's call this hand-drawn effect Style C). Figure 3a shows the initial result. It is not unexpected since the model has never been trained on this style.
Here we include some hand-drawn samples in the training phase to derive a new model (referred to as model-3B below). When the input image in Figure 3b is sampled against model-3B, The result (center image in Figure 3b) shows much improvement than what's in Figure 3a, though still somewhat blurry, possibly due to insufficient training. The output image is judged to be too similar to the ground truth photo used to train model-3B, so this experiment should be repeated with more samples.
Experiment #4: the case of mistaken identityFigure 4a. This is a test where male image is used to sample against the AJ model. The black-and-white image at left is the input, far right is the ground truth photo, and the center image is the output which appear to have picked up some features of Angelina Jolie.Figure 4c. When sampling the AJ model using a Byonce image, the output image picks up AJ's skin tone.
Given that we have built a neural model of Angelina Jolie (referred to as the AJ Model), how useful is it when trying to apply it to other people? Since a neural model trained exclusively on one person represents the probability distribution of this person's facial features, it is expected that applying the AJ Model to another person's photos will get us somewhat reasonable result, but with some limits.
Figure 4a shows the result of sampling American actor and producer Brad Pitt based on the AJ Model. As expected the result (center image) shows somewhat reasonable colors and shading, but it also picks up softer feminine lines, a less stubby beard, and Jolie's brown hair color.
Similarly in Figure 4b, when sampling against the AJ Model using an input image of the American singer, songwriter and actress Beyoncé results in an output image (at center) that picks up the lighter skin tone of Angelina Jolie.
From the perspective of building neural models for human faces, it would seem that it is appropriate to have a separate model for each individual of interest. It would be interesting to see the creation of a hierarchy of such models, where the top one represents a model for all human faces, the bottom leaf models represent specific individuals, and those in between models represent groups of people (such as by race, by distinct features, etc.). With a well-designed mechanism we might be able to derive much efficiency in the training and storage from such a hierarchical structure of many models.
Experiment #5: the case of decomposing faces
In this experiment we want to study how to decompose a face into parts, so that each part can be manipulated individually.
Why is this important? This is because if a neural model is composed of parts that can be learned without supervision, and that such parts can be treated as shared features across sample instances, then it is possible to achieve a kind of one-shot learning.
For example, assuming that cGAN is able to generate facial parts (e.g., eyes, noses, etc.) during its training process (just like a typical deep CNN could), and that the two noses in two photos activate the same neuron in cGAN, then you can say that this neuron now represents an anonymous concept of nose.
If we now attach a text label 'nose' to the image of a nose in photo A, then the system would know right way that the 'nose' label is likely also applicable to all those other noses in other photos. So here we have achieved a sort of one-shot supervised learning through the common nose neuron mentioned above.
If we use cGAN as the basis for implementing the neural model in question, then it would mean the following:
This is a topic which will be explored further in a separate post.
Experiment #6: from photo to imitated artworkFigure 6a. A trained cGAN model is used to convert an unseen test color photo to produce the black-and-white style effect (center). The result is quite similar to the one manually created by an artist (right).
In this experiment we seek to apply cGAN in the other direction, by mapping from color photos to sketches of a certain style.
This turns out to be very easy, at least for those manually applied Photoshop effects that we have used in the previous experiments.
We use a training dataset of 48 pairs of images, where all the black-and-white images are manually created using the same Photoshop effect applied to the color photos. cGAN is then trained to map form color photos to black-and-white. After training for one hour we use the trained model on a separate set of color photos for testing. Figure 6a shows a typical test input image (left, in color), which is converted to black-and-white output image (center) by the trained cGAN model. The result is deemed very good when the output image is compared with another image (right) converted manually by an artist using the same Photoshop effect used to create the training dataset.
So with this it is then possible to have use cGAN to bootstrap our own experiments as follows in order to reduce the amount of manual work:
This technique should be applicable to many types of datasets that involves some sort of straight-forward information reduction.
It would be interesting to see how far we can push it for generating more artistic and less faithful effects, such as caricatures, etc.
In this report we have conducted a series of empirical studies on the possibility of using cGAN as the basis for building a neural representation of human faces, with an eye towards applying the same technique to other types of physical objects in the future.
This particular flavor of the Conditional GAN allows us to map from an input image to another image, which gives us a handle to use cGAN in many ways.
Following is a summary of observations made from this study.
The experiments described above were conducted with very limited amount of data samples, as well as limited model training time. The observations and suggestions made above are quite preliminary, and further study is warranted.
There are several possible applications of the cGAN technology (or its extension) that we want to explore in separate posts:
Figure 7. Training cGAN to convert a normal photo (at left) to a depth map (at right).
I also want to show my gratitude to Fonchin Chen and Michelle Chen for offering to do the hand-drawn sketches, as well as helping with the unending process of collecting and processing the images needed for the project.