
2 MindEye2 and 2.1 Shared-Subject Functional Alignment
2.2 Backbone, Diffusion Prior, & Submodules
2.3 Image Captioning and 2.4 Fine-tuning Stable Diffusion XL for unCLIP
3 Results and 3.1 fMRI-to-Image Reconstruction
3.3 Image/Brain Retrieval and 3.4 Brain Correlation
6 Acknowledgements and References
A Appendix
A.2 Additional Dataset Information
A.3 MindEye2 (not pretrained) vs. MindEye1
A.4 Reconstruction Evaluations Across Varying Amounts of Training Data
A.5 Single-Subject Evaluations
A.7 OpenCLIP BigG to CLIP L Conversion
A.9 Reconstruction Evaluations: Additional Information
A.10 Pretraining with Less Subjects
A.11 UMAP Dimensionality Reduction
A.13 Human Preference Experiments
Flattened spatial patterns of brain activity are first linearly mapped to the shared-subject space using an output dimensionality of 4096. Then, these latents are fed through an MLP backbone with 4 residual blocks, followed by a linear mapping that goes from 4096-dim to 256×1664 dimensionality of OpenCLIP ViT-bigG/14 image token embeddings. These backbone embeddings are then simultaneously fed through a diffusion prior (Ramesh et al., 2022) and two MLP projectors (retrieval and low-level submodules). Differences from MindEye1 include linear mapping to a shared-subject space, mapping to OpenCLIP ViT-bigG/14 rather than CLIP ViT-L/14, and adding a low-level MLP submodule.
MindEye2 has three losses that are summed, stemming from the diffusion prior, retrieval submodule, and low-level submodule. The end-to-end loss, with α1 = .033 and α2 = .016, is defined as:
2.2.1 DIFFUSION PRIOR
Using a diffusion prior to align outputs from a contrastive learning model was inspired by DALL-E 2 (Ramesh et al., 2022), where a “diffusion prior” maps CLIP text embeddings to CLIP image space before using an unCLIP decoder to reconstruct images. Here we trained our own diffusion prior from scratch to map fMRI latents to the OpenCLIP ViTbigG/14 image space, which was kept frozen as done with locked-image text tuning (LiT) (Zhai et al., 2022). We used the same prior loss as Ramesh et al. (2022), implemented with the same code as MindEye1 which used modified code from the DALLE2-pytorch repository.
2.2.2 RETRIEVAL SUBMODULE
MindEye1 observed a tradeoff if using contrastive loss and MSE loss on the outputs of the diffusion prior directly, such that the model could not effectively learn a single embedding to satisfy both objectives. Instead, applying MSE loss on the diffusion prior and applying contrastive loss on the outputs from an MLP projector attached to the MLP backbone effectively mitigated this tradeoff because the objectives no longer shared identical embeddings. We adopted the same approach here, with the retrieval submodule contrastively trained to maximize cosine similarity for positive pairs while minimizing similarity for negative pairs. We used the same BiMixCo and SoftCLIP losses used in MindEye1 (Scotti et al., 2023), which involved the first third of training iterations using bidirectional MixCo data augmentation (Kim et al., 2020) with hard labels and the last two-thirds of training iterations using soft labels (generated from the dot product of CLIP image embeddings in a batch with themselves) without data augmentation.
2.2.3 LOW-LEVEL SUBMODULE
MindEye1 used an independent low-level pipeline to map voxels to the latent space of Stable Diffusion’s variational autoencoder (VAE) such that blurry reconstructions were returned that lacked semantic information but performed well on low-level metrics. Here, we reimplement this pipeline as a submodule, similar to the retrieval submodule, such that it need not be trained independently. The MLP projector feeds to a CNN upsampler that upsamples to the (64, 64, 4) dimensionality of SD VAE latents with L1 loss as well as an additional MLP to the embeddings of a teacher linear segmentation model VICRegL (Bardes et al., 2022) ConvNext-XXL (α = 0.75) for an auxilliary SoftCLIP loss (soft labels from VICRegL model).
This paper is available on arxiv under CC BY 4.0 DEED license.
Authors:
(1) Paul S. Scotti, Stability AI and Medical AI Research Center (MedARC);
(2) Mihir Tripathy, Medical AI Research Center (MedARC) and a Core contribution;
(3) Cesar Kadir Torrico Villanueva, Medical AI Research Center (MedARC) and a Core contribution;
(4) Reese Kneeland, University of Minnesota and a Core contribution;
(5) Tong Chen, The University of Sydney and Medical AI Research Center (MedARC);
(6) Ashutosh Narang, Medical AI Research Center (MedARC);
(7) Charan Santhirasegaran, Medical AI Research Center (MedARC);
(8) Jonathan Xu, University of Waterloo and Medical AI Research Center (MedARC);
(9) Thomas Naselaris, University of Minnesota;
(10) Kenneth A. Norman, Princeton Neuroscience Institute;
(11) Tanishq Mathew Abraham, Stability AI and Medical AI Research Center (MedARC).