paint-brush
Russian Scientists Say New AI Architecture Produces State-of-the-Art Text-to-Image Synthesisby@autoencoder
New Story

Russian Scientists Say New AI Architecture Produces State-of-the-Art Text-to-Image Synthesis

tldt arrow

Too Long; Didn't Read

Researchers have developed a text-to-image generation model called Kandinsky that uses a novel latent diffusion model to produce images that appear natural.
featured image - Russian Scientists Say New AI Architecture Produces State-of-the-Art Text-to-Image Synthesis
Auto Encoder: How to Ignore the Signal Noise HackerNoon profile picture
0-item

Authors:

(1) Anton Razzhigaev, AIRI and Skoltech;

(2) Arseniy Shakhmatov, Sber AI;

(3) Anastasia Maltseva, Sber AI;

(4) Vladimir Arkhipkin, Sber AI;

(5) Igor Pavlov, Sber AI;

(6) Ilya Ryabov, Sber AI;

(7) Angelina Kuts, Sber AI;

(8) Alexander Panchenko, AIRI and Skoltech;

(9) Andrey Kuznetsov, AIRI and Sber AI;

(10) Denis Dimitrov, AIRI and Sber AI.

Editor's Note: This is Part 4 of 8 of a study detailing the development of Kandinsky, the first text-to-image architecture designed using a combination of image prior and latent diffusion. Read the rest below.

4 Kandinsky Architecture

In our work, we opted to deliver state-of-the-art text-to-image synthesis. In the initial stages of our research, we experimented with multilingual text encoders, such as mT5 (Xue et al., 2021), XLMR (Conneau et al., 2020), XLMR-CLIP[7], to facilitate robust multilingual text-to-image generation. However, we discovered that using the CLIP-image embeddings instead of standalone text encoders resulted in improved image quality. As a result, we adopted an image prior approach, utilizing diffusion and linear mappings between text and image embedding spaces of CLIP, while keeping additional conditioning with XLMR text embeddings. That is why Kandinsky uses two text encoders: CLIP-text with image prior mapping and XLMR. We have set these encoders to be frozen during the training phase.


The significant factor that influenced our design choice was the efficiency of training latent diffusion models, as compared to pixel-level diffusion models (Rombach et al., 2022). This led us to focus our efforts on the latent diffusion architecture. Our model essentially comprises three stages: text encoding, embedding mapping (image prior), and latent diffusion.


The construction of our model involves three primary steps: text encoding, embedding mapping (image prior), and latent diffusion. At the embedding mapping step, which we also refer to as the


Figure 3: Kandinsky web interface for “a corgi gliding on the wave”: generation (left) and in/outpainting (right).


Table 1: Proposed architecture comparison by FID on COCO-30K validation set on 256×256 resolution. * For the IF model we reported reproduced results on COCO30K, but authors provide FID of 7.19.


image prior, we use the transformer-encoder model. This model was trained from scratch with a diffusion process on text and image embeddings provided by the CLIP-ViT-L14 model. A noteworthy feature in our training process is the use of elementwise normalization of visual embeddings. This normalization is based on full-dataset statistics and leads to faster convergence of the diffusion process. We implemented inverse normalization to revert to the original CLIP-image embedding space in the inference stage.


The image prior model is trained on text and image embeddings, provided by the CLIP models. We conducted a series of experiments and ablation studies on the specific architecture design of the image prior model (Table 3, Figure 6). The model with the best human evaluation score is based on a 1D-diffusion and standard transformer-encoder with the following parameters: num_layers=20, num_heads=32, and hidden_size=2048.


The latent diffusion part employs a UNet model along with a custom pre-trained autoencoder. Our diffusion model uses a combination of multiple condition signals: CLIP-image embeddings, CLIPtext embeddings, and XLMR-CLIP text embeddings. CLIP-image and XLMR-CLIP embeddings are merged and utilized as an input to the latent diffusion process. Also, we conditioned the diffusion process on these embeddings by adding all of them to the time-embedding. Notably, we did not skip the quantization step of the autoencoder during diffusion inference as it leads to an increase in the diversity and the quality of generated images (cf. Figure 4). In total, our model comprises 3.3 B parameters (Table 2).


Table 2: Kandinsky model parameters.


We observed that the image decoding was our main bottleneck in terms of generated image quality; hence, we developed a Sber-MoVQGAN, our custom implementation of MoVQGAN (Zheng et al., 2022) with minor modifications. We trained this autoencoder on the LAION HighRes dataset (Schuhmann et al., 2022), obtaining the SotA results in image reconstruction. We released the weights and code for these models under an open source licence[11]. The comparison of our autoencoder with competitors can be found in Table 4.


Table 3: Ablation study: FID on COCO-30K validation set on 256 × 256 resolution.


This paper is available on arxiv under CC BY 4.0 DEED license.


[7] https://github.com/FreddeFrallan/ Multilingual-CLIP


[8] https://github.com/Stability-AI/ stablediffusion


[9] https://github.com/ai-forever/ru-dalle


[10] https://github.com/gligen/GLIGEN


[11] https://github.com/ai-forever/MoVQGAN