paint-brush
Accelerating Diffusion Models with TheStage AI: A Case Study of Recraft's 20B and Red Panda modelsby@thestage
3,992 reads
3,992 reads

Accelerating Diffusion Models with TheStage AI: A Case Study of Recraft's 20B and Red Panda models

by TheStage AI7mNovember 26th, 2024
Read on Terminal Reader
Read this story w/o Javascript

Too Long; Didn't Read

This article explores the acceleration of Recraft's text-to-image diffusion models using TheStage AI's Python tools. We examine the challenges of optimizing proprietary models and developing efficient pipelines for PyTorch inference optimization.
featured image - Accelerating Diffusion Models with TheStage AI: 
A Case Study of Recraft's 20B and Red Panda models
TheStage AI HackerNoon profile picture
0-item
1-item


Recraft AI is a design tool powered by proprietary diffusion models. Their new model Red Panda outperforming all existing text-to-image models including Midjourney, DALL-E 3, FLUX. Recraft combines a unique user experience for designers with cutting-edge AI tools. To support millions of users, diffusion models require robust inference infrastructure—merging powerful hardware with efficient software. In this article, we'll explore how TheStage AI acceleration tools helped Recraft’s AI engineers and researchers to achieve up to 2x performance on Nvidia GPUs through an intuitive Python interface!

Introduction

Diffusion models have shown extraordinary results in recent years for content generation, including images, music, videos, and 3D meshes. These models use inference time compute costs to iteratively improve generation results, slightly updating the output at each inference step. We can now see analogy in LLMs which uses reasoning through multistep inference to provide high quality answer.


In TheStage AI we are building general mathematical framework for arbitrary PyTorch models to handle complicated model acceleration flow fully automatically. Our system automatically detects existing optimizations on your hardware (qunatization, sparsification) and select for each layer proper algorithm to achieve the best quality with a desired model size and latency constraints or find the best accceleration with restricted quality constraints. Its a hard mathematical problem which we can solve in a highly efficient way! This article explores how we apply these tools through our partnership with Recraft AI.


When designing our tools, we decided to respect the following principles:


  • Hardware customization. High-quality AI products already have their preferred infrastructure
  • Quality preserving. High-quality AI products cannot accept quality degradation
  • Privacy. High-quality AI products want to keep their technologies confidential and work with tools on their own infrastructure
  • Arbitrary DNNs. High-quality AI products may use in-house developed architectures where public acceleration tools for open-source models can't handle complex DNN architectures to produce correct outputs.
  • Pytorch. The most popular and convenient framework for a lot of AI engineers.


Given these initial conditions, we aimed to create tools with the following features:


  • Controllable acceleration. We frame inference optimization as a business optimization problem, allowing customers to specify their desired model size, latency, or quality for their data.
  • Simple compilation. Compiling the produced models for efficient hardware execution requires just a single line of code. We also provide a simple interface to handle graph breaks.
  • Fast cold start. To achieve the fastest cold start possible, we enable saving of compiled models. This is why we don't use JIT compilers.
  • Simple deployment. Deploying the optimized model should be as straightforward as deploying the original one.


Text-to-Image Diffusion Models

Denoising Diffusion process simple visulization.


In each iteration of the diffusion process, a neural network denoises the image in the latent space of a Variational AutoEncoder. The newly obtained image is then mixed with noise again, but with progressively less weight. During the initial iterations, the diffusion model sketches the main scene, leveraging the significant noise weight to make substantial upgrades. In later iterations, it refines high-frequency details. This observation allows us to design specific acceleration pipelines by strategically allocating network capacity across layers from iteration to iteration, preserving quality. However, such allocation requires specialized tools that combine mathematical insights with sound engineering — this is where TheStage AI can significantly help!


When we gaze at clouds, focusing on specific shapes, our minds can discern random forms that resemble our thoughts. This phenomenon illustrates how our brains identify patterns in noise, finding elements that align with our mental context. Similarly, diffusion models employ this concept during their initial iteration, selecting patterns from noise to create a preliminary sketch of the desired image.


Diffusion Models Acceleration and Compression

Accelerating diffusion models can be viewed as accelerating arbitrary DNNs, but we need to account for specific challenges. For instance, static quantization, which typically provides significant acceleration, introduces a challenge in diffusion models as activation distributions change from iteration to iteration. To address this, we either need to properly estimate optimal values for all iterations or use different quantization setups for each iteration.


Diffusion models are challenging to train and achieve high performance. Nevertheless, results demonstrated by the Recraft team outperforming all modern text-to-image models. Validating the degradation of such models can be difficult, making it crucial to use acceleration techniques that preserve the original feature semantics. Quantization algorithms can be a good choice if they can handle the challenge of varying activation distributions. Let’s take a look on our automatic pipelines, which we will describe in the following sections.

Profiling

Profile a given model with a specific data allows to:


  • Determine the size of each parameter
  • Identify applicable quantization, sparsification, pruning algorithms for each basic block
  • Estimate latency for individual blocks with different memory layouts
  • Compile all collected information for ANNA (Automated NNs Accelerator)

Automatic Compression & Acceleration

After the profiler has collected all necessary data, we can start our ANNA board and move the slider to produce different optimized model versions. Our users can then select the best candidates based on the quality vs. inference cost trade-off. Our tools handle these subjective quality decisions in a simple way.


TheStage AI ANNA. Move the slider to adjust model size or latency with minmum quality degradation!

Operations Fusion and Compilation

As mentioned earlier, we don't use JIT compilation because it increases the cold start time of a new node. We also don't use off-the-shelf compilers. Instead, we compile our own complex accelerated configuration that can mix different algorithms. To achieve this, we've developed our own internal protocol to define DNN-accelerated layers in a hardware-agnostic way. One of the key benefits of TheStage AI acceleration framework is that a single checkpoint can be compiled for a wide range of hardware, solving cross-platform compatibility for AI software. This feature will be particularly important for edge device deployment in application development.


The goals of DNN compilers are to:


  • Graph Simplification. Simplify the execution graph through mathematical operation fusion, reducing inference time
  • Memory Management. Calculate the memory required for each operation and manage allocation scheduling with efficient memory reuse
  • Optimal Implementation. Profile the optimal implementation for each basic operation—a challenging task, as the best implementation may require specific memory layouts, leading to analysis of interlayer connections
  • Operations Scheduling. Create an operations schedule for the optimized execution graph
  • Serialization. Save all this information to avoid recompiling the model in subsequent runs


Sequential operations can be combined into a fused kernel. Instead of copying the first operation's output to global memory, the fused kernel evaluates the second operation directly in register of local memory. This significantly speeds up inference because memory transfers often take longer than the actual computations. However, not all operation sequences can be fused—some are incompatible with fusion entirely. For element-wise operations, fused kernels can be generated automatically. Nvidia's NVFuser tool, for example, can generate kernels for any sequence of element-wise operations.

Deployment and Serving

Inference servers and auto-scaling pipelines play an important role in cost-effective and efficient processing of incoming requests. It can also include specific request grouping and statistics collection to set up predictive scaling for auto-scalers. In our future articles, we will discuss efficient inference servers in more detail!

Results

Applying all pipeline we can achieve performance which is better than Pytorch compiler (torch.compile) and of course significantly better than float16 eager PyTorch execution. Moreover as PyTorch compiler uses JIT compilation approach on each model initialization it requires recompilation for a lot input sizes what makes cold start long enough for practical applications where latency is highly important.


Business Benefits

Here are the key business benefits of resulted acceleration for Recraft’s product:


  • Lower infrastructure costs by serving twice as many users with the same hardware
  • Improved user experience with faster image generation response times
  • Ability to serve more concurrent users during peak loads
  • Competitive advantage through faster service delivery


TheStage AI optimisation tools allow us to speed up our text-to-image models without quality degradation, creating a better user experience for our customers.


CEO Recraft, Anna Veronika Dorogush

Acknowledgements

These results provide excellent validation of our tools and research on high-scale workload products. TheStage AI team continues to work toward delivering even greater performance. To achieve this, we're collaborating with outstanding partners! We are deeply grateful to:


  • Recraft CEO Anna Veronika for the fruitful cooperation. We're thrilled to be even a small part of their great journey in delivering the best design tools.
  • Recraft Head of AI Pavel Ostyakov for his expertise in DNNs, strong feedback on tools, and for setting challenging goals for our cooperation project.
  • The Recraft AI team for building this great product. Images in this article were generated with Recraft!
  • The Nebius team for their consistent support with excellent GPU infrastructure for our research.

Contacts / Resources

Feel free to connect with us regarding any questions! We can help you to reduce inference infrastructure costs!

Our email: hello@thestage.ai

TheStage AI main page: thestage.ai

TheStage AI inference optimization platform: app.thestage.ai