ControlNet Explained
🪁

ControlNet Explained

 
ControlNet is a new neural network architecture built to control existing large image diffusion models - like Stable Diffusion - by enabling them to support additional input conditions. Users of vanilla Stable diffusion models will understand why this is immensely valuable. While SD models produce fantastical images, getting them to generate specific details that match your expectations such as a human in a specific pose is an extremely consuming process that takes dozens of seed/prompt-tweaking → generation cycles. These process essentially requires the interpretation of raw inputs into object-level or scene level understanding. With ControlNet, it is now possible to use a wide array of input conditions such as outlines, human poses, depth maps, outlines, edge maps, segmentation maps, key-points etc to control diffusion models making them highly malleable.
notion image
 
ControlNet is a clever way to fine-tune a network to enable additional conditioning by creating a trainable copy of an existing network interposed with zero-convolution layers. It is important to note that the original model is safely frozen during the training process, while ControlNet itself does all the learning on the side. ControlNet itself is not a diffusion model - it is is simply a neural network that operates on the outputs of intermediary layers.
notion image
The ControlNet creates two versions of a large diffusion model's weights: a "trainable copy" and a "locked copy". The locked copy retains the network's capability that was learned from analyzing billions of images, while the trainable copy is used to learn conditional control on task-specific datasets. These two neural network blocks are then linked using a special convolution layer called "zero convolution". The zero convolution gradually grows the convolution weights from zeros to optimized parameters in a learned manner. The use of this technique ensures that the training process is robust across datasets of different scales, as the production-ready weights are preserved. Additionally, the zero convolution method does not introduce any new noise to deep features, which results in faster training times compared to starting with new layers from scratch.

What is the Zero Convolution Layer? How does it grow from zero?

The zero-con layer is simply a convolution layer with both weight and biases introduced to zeros. To understand why the zero-conv layer grows consider the following basic example
The derivatives are
If and , then
As long x is non-zero in the the first step, then gradient descent will set w to non-zero in subsequent steps and the zero-conv layers will progressively be trained with non-zero weights.
By learning task-specific conditions in an end-to-end manner, ControlNet can effectively handle small training datasets (< 50k) with robustness. Training ControlNet is also fast, comparable to fine-tuning a diffusion model, and can be performed on personal devices. By augmenting large diffusion models such as Stable Diffusion with ControlNets, conditional inputs such as edge maps, segmentation maps, keypoints, etc., can be added, resulting in more effective control and expanding the potential applications of large diffusion models.

ControlNet + Stable Diffusion Architecture

notion image
Stable operates in the Latent Space - rather than the pixel space - converting images into smaller converted feature maps or “latent images” via an encoder. This means that ControlNet also takes as input similar sized images pre-processed via a small network of four convolutional layers with with kernels and strides (activated by ReLU, channels are 16, 32, 64, 128, initialized with Gaussian weights, trained jointly with the full model) to encode image-space conditions into feature maps with
ControlNet hooks into Stable Diffusion’s U-net architecture, creating a trainable copy for each of the 12 blocks at different hierarchies of the resolution pyramid. ControlNet’s outputs are added to the 12 skip-connections and middle-block of the of the U-Next backbone. A notable feature of this architecture to note is that it’s not unique specialized to Stable Diffusion, and can be re-used in other diffusion models.