TLDR: Artistic Style Transfer
TLDR: Artistic Style Transfer


  1. A Neural Algorithm of Artistic Style
  2. Perceptual Losses for Real-Time Style Transfer and Super-Resolution
  3. Universal Style Transfer via Feature Transforms (Coming soon)

1. A Neural Algorithm of Artistic Style

Convolutional layer

A convolutional layer works as follows:

conv_layer(in_arr: float [IX, IY, IZ],
           filters: float [NUM_FILTERS, FLT_SZ, FLT_SZ, IZ]) -> float [NUM_FILTERS, IX, IY]
    let out_arr: float[IX, IY, NUM_FILTERS];
    for f in 0..NUM_FILTERS { // For each filter:
        let flat_filter: float[FLT_SZ * FLT_SZ * IZ] = flatten(filters[f]);
        for x in 0..IX {
            for y in 0..IY { // For each XY position in the input data:
                // Get neighborhood centered at (x, y) and spanning the size of the filter
                let neighborhood: float[FLT_SZ, FLT_SZ, IZ] = get_filter_neighborhood(in_arr, x, y);
                let flat_in_arr: float[FLT_SZ * FLT_SZ * IZ] = flatten(neighborhood);
                let raw = dot_product(flat_in_arr, flat_filter) + bias; // W*x + b
                out_arr[f, x, y] = max(0, raw); // ReLU activation function
    return out_arr;
Input3D grid of (IX * IY * IZ) real numbers.
FiltersEach filter is a 3D grid of (FLT_SZ * FLT_SZ * IZ) real numbers. NUM_FILTERS such filters. FLT_SZ is the filter size and is 3 in our neural network.
OutputGrid of (NUM_FILTERS * IX * IY) real numbers.

While training the network, the input and desired output are known. Filters and the bias are learned using backpropagation. During inference, input and filters are known. The output is calculated.

* CNNs don't perform convolutions in the strictest sense of the word; the filter is not flipped.
* On boundary elements, the neighborhood overflows. We plug in padding values for convenience to keep input and output arrays equally sized in 2D.

Max-pooling layer

A max-pooling layer produces an output that is half the width, half the height and the same depth as the input. While converting a 2x2 grid of real numbers to a single real number, it outputs the maximum value among the 4 numbers it samples. Silliest downsampling ever.

The VGG-19 network

The convolutional layers of VGG19.
Max-pooling layers are not drawn, but the reduction in size comes from them.

VGG-19 is a convolutional network designed for image classification and trained on the ImageNet dataset. We're interested only in its convolutional and max-pooling layers. Since it's a pre-trained network, filters and biases are known. Each filter is 3x3.

The core idea of the paper

The content of an image can be reverse-engineered from the activations it produces in convolutional layers, and its style can be reverse-engineered from the statistical correlations of those activations. We can thus combine the content of an image with the style of another.


  1. Pick a content image and a style image.

  2. Plug in the content image as the input to the network, and observe the output $\ell_{feat}$ produced at any convolutional layer. For any input image that produces output $\ell$ at the same convolutional layer, the content loss $\mathcal{L}_{feat}$ is the Euclidean distance between flattened $\ell$ and $\ell_{feat}$.

  3. Do this at all five convolutional layers, and get the five Gram matrices $G_{style}$. For any input image, we get its Gram matrices $G$. Then the style loss $\mathcal{L}_{style}$ of that image is the average of the Euclidean distances between each $G$ and its corresponding $G_{style}$.

  4. Plug in a white noise image, and minimize its loss $\mathcal{L} = \lambda_{feat} \mathcal{L}_{feat} + \lambda_{style} \mathcal{L}_{style}$, where $\lambda_{feat}$ and $\lambda_{style}$ are scalar constants controlling respective contributions to the final result.

Pros & cons

+ Works on arbitrary style images.
- Is a training problem, since it involves backprop. Training is a much slower process than inference.


* The VGG network works for this task despite being trained for image classification, not style transfer.
* A covariance matrix works just as well as a Gram matrix for capturing style.
* They key takeaway from the paper is not the specific technique or loss function, but it is the idea that correlations of activations in a pretrained convolutional network are a viable way of identifying artistic style.

Further reading

[1] Original paper by Leon A. Gatys, Alexander S. Ecker, Matthias Bethge. 2015
[2] Lecture on Convolutional Neural Networks by Andrej Karpathy. 2016
[3] Paper introducing VGG by Karen Simonyan and Andrew Zisserman. 2014

2. Perceptual Losses for Real-Time Style Transfer and Super-Resolution

The two appended networks. Sourced from original paper [1] with permission.

The core idea of the paper

This paper prepends a custom-made neural network to the pretrained VGG network of paper 1. The prepended network is called the image transformation network, and the VGG is called the loss network.

Paper 1 was slow because it was posed as a training problem. This paper makes it a fast inference problem by pre-training the prepended network for a specific style image.


The loss network is a VGG-16 network, instead of a VGG-19 from the previous paper.

The image transform network has two downsampling layers, then a bunch of residual blocks, and then two upsampling layers. The residual blocks are simple but outside the scope of this article. Instead of using max-pooling, downsampling is done using a convolution stride of 2. Similarly, upsampling is done with a stride of 0.5. This makes the convolutional layers themselves responsible for changing resolution.

  1. A style image is chosen, and the activation $\ell_{style}$ and Gram matrices $G_{style}$ are obtained from the loss network, just like the previous paper.

  2. We iterate over the COCO dataset, training the image transformation network to minimize the loss $\mathcal{L}$, which just like paper 1, is a weighted sum of $\mathcal{L}_{feat}$ and $\mathcal{L}_{style}$, but with an extra regularization component. We thus train the image transform network to combine the style of one image to the content of any image.

  3. Once the image transform network has been trained for a given style image, any content image just needs to pass through it only once to transfer the style. This is why this paper is faster than paper 1.

Example output. Sourced from original paper [1] with permission.

Pros & cons

+ Is an inference problem, and hence is fast.
+ Same technique also works for super-resolution.
- Has a big up-front cost per style image. Thus, in production, arbitrary style images cannot be selected.


* The most interesting part of this paper is its incremental improvement over paper 1. This paper shows that neural networks can learn to mimic gradient descent in some circumstances.

Further reading

[1] Original paper by Justin Johnson, Alexandre Alahi, Fei-Fei Li. 2016

3. Universal Style Transfer via Feature Transforms

(Coming soon)
Email / Twitter / GitHub / CV