New Story

Backbone, Diffusion Prior, & Submodules Explained

by Image Recognition4mApril 9th, 2025

Too Long; Didn't Read

The study was conducted using fMRI-to-Image Reconstruction. The results were published in the journal Human Experiments. The study was published on November 13, 2013.

Coin Mentioned

featured image - Backbone, Diffusion Prior, & Submodules Explained

‘a sketch of a spine’ Image created by HackerNoon AI Image Generator

Table of Links

Abstract and 1 Introduction

2 MindEye2 and 2.1 Shared-Subject Functional Alignment

2.2 Backbone, Diffusion Prior, & Submodules

2.3 Image Captioning and 2.4 Fine-tuning Stable Diffusion XL for unCLIP

2.5 Model Inference

3 Results and 3.1 fMRI-to-Image Reconstruction

3.2 Image Captioning

3.3 Image/Brain Retrieval and 3.4 Brain Correlation

3.5 Ablations

4 Related Work

5 Conclusion

6 Acknowledgements and References

A Appendix

A.1 Author Contributions

A.2 Additional Dataset Information

A.3 MindEye2 (not pretrained) vs. MindEye1

A.4 Reconstruction Evaluations Across Varying Amounts of Training Data

A.5 Single-Subject Evaluations

A.6 UnCLIP Evaluation

A.7 OpenCLIP BigG to CLIP L Conversion

A.8 COCO Retrieval

A.9 Reconstruction Evaluations: Additional Information

A.10 Pretraining with Less Subjects

A.11 UMAP Dimensionality Reduction

A.12 ROI-Optimized Stimuli

A.13 Human Preference Experiments

2.2 Backbone, Diffusion Prior, & Submodules

Flattened spatial patterns of brain activity are first linearly mapped to the shared-subject space using an output dimensionality of 4096. Then, these latents are fed through an MLP backbone with 4 residual blocks, followed by a linear mapping that goes from 4096-dim to 256×1664 dimensionality of OpenCLIP ViT-bigG/14 image token embeddings. These backbone embeddings are then simultaneously fed through a diffusion prior (Ramesh et al., 2022) and two MLP projectors (retrieval and low-level submodules). Differences from MindEye1 include linear mapping to a shared-subject space, mapping to OpenCLIP ViT-bigG/14 rather than CLIP ViT-L/14, and adding a low-level MLP submodule.

MindEye2 has three losses that are summed, stemming from the diffusion prior, retrieval submodule, and low-level submodule. The end-to-end loss, with α1 = .033 and α2 = .016, is defined as:

2.2.1 DIFFUSION PRIOR

Using a diffusion prior to align outputs from a contrastive learning model was inspired by DALL-E 2 (Ramesh et al., 2022), where a “diffusion prior” maps CLIP text embeddings to CLIP image space before using an unCLIP decoder to reconstruct images. Here we trained our own diffusion prior from scratch to map fMRI latents to the OpenCLIP ViTbigG/14 image space, which was kept frozen as done with locked-image text tuning (LiT) (Zhai et al., 2022). We used the same prior loss as Ramesh et al. (2022), implemented with the same code as MindEye1 which used modified code from the DALLE2-pytorch repository.

2.2.2 RETRIEVAL SUBMODULE

MindEye1 observed a tradeoff if using contrastive loss and MSE loss on the outputs of the diffusion prior directly, such that the model could not effectively learn a single embedding to satisfy both objectives. Instead, applying MSE loss on the diffusion prior and applying contrastive loss on the outputs from an MLP projector attached to the MLP backbone effectively mitigated this tradeoff because the objectives no longer shared identical embeddings. We adopted the same approach here, with the retrieval submodule contrastively trained to maximize cosine similarity for positive pairs while minimizing similarity for negative pairs. We used the same BiMixCo and SoftCLIP losses used in MindEye1 (Scotti et al., 2023), which involved the first third of training iterations using bidirectional MixCo data augmentation (Kim et al., 2020) with hard labels and the last two-thirds of training iterations using soft labels (generated from the dot product of CLIP image embeddings in a batch with themselves) without data augmentation.

2.2.3 LOW-LEVEL SUBMODULE

MindEye1 used an independent low-level pipeline to map voxels to the latent space of Stable Diffusion’s variational autoencoder (VAE) such that blurry reconstructions were returned that lacked semantic information but performed well on low-level metrics. Here, we reimplement this pipeline as a submodule, similar to the retrieval submodule, such that it need not be trained independently. The MLP projector feeds to a CNN upsampler that upsamples to the (64, 64, 4) dimensionality of SD VAE latents with L1 loss as well as an additional MLP to the embeddings of a teacher linear segmentation model VICRegL (Bardes et al., 2022) ConvNext-XXL (α = 0.75) for an auxilliary SoftCLIP loss (soft labels from VICRegL model).