
2 MindEye2 and 2.1 Shared-Subject Functional Alignment
2.2 Backbone, Diffusion Prior, & Submodules
2.3 Image Captioning and 2.4 Fine-tuning Stable Diffusion XL for unCLIP
3 Results and 3.1 fMRI-to-Image Reconstruction
3.3 Image/Brain Retrieval and 3.4 Brain Correlation
6 Acknowledgements and References
A Appendix
A.2 Additional Dataset Information
A.3 MindEye2 (not pretrained) vs. MindEye1
A.4 Reconstruction Evaluations Across Varying Amounts of Training Data
A.5 Single-Subject Evaluations
A.7 OpenCLIP BigG to CLIP L Conversion
A.9 Reconstruction Evaluations: Additional Information
A.10 Pretraining with Less Subjects
A.11 UMAP Dimensionality Reduction
A.13 Human Preference Experiments
MindEye2 involves pretraining and then fine-tuning a single model where brain activity is mapped to the embedding space of pretrained deep learning models. During inference, these embeddings predicted from the brain are fed into frozen image generative models that translate from model space to pixel space. Our strategy to reconstruct seen images from brain activity using minimal training data is to first pretrain the model using data from 7 subjects (30-40 hours of scanning data each) and then to fine-tune the model using data from a held-out 8th subject. The full MindEye2 pipeline is depicted in Figure 2.
Single-subject models were trained/fine-tuned on a single 8xA100 80Gb GPU node for 150 epochs with a batch size of 24. Multi-subject pretraining was done with a batch size of 63 (9 samples per each of 7 subjects). Models were trained with Huggingface Accelerate (Gugger et al., 2022) and DeepSpeed (Rajbhandari et al., 2020) Stage 2 with CPU offloading.
Every subject has a uniquely shaped brain with different functional organization, meaning that there needs to be an initial alignment step to ensure the model can handle inputs from different brains. Unlike anatomical alignment where every subject’s brain is mapped to the same brain template (Talairach and Tournoux, 1990; Mazziotta et al., 2001), we remain in subjects’ native brain space and functionally align flattened spatial patterns of fMRI activity to a shared-subject latent space using subject-specific ridge regression. That is, each subject has a separate linear layer with weight decay to map the input fMRI voxels (13,000 to 18,000 voxels depending on the subject) to a 4096-dim latent.
Following this initial linear layer, the rest of the model pipeline is shared across subjects without any subjectspecific mappings. The whole pipeline is trained end-to-end where pretraining involves each batch containing brain inputs from all subjects. That is, alignment to shared-subject space is not trained independently and we do not pretrain models separately for each subject; rather, we pretrain a single model equally sampling across all the subjects except the held-out subject used for fine-tuning.
Two strengths of this novel functional alignment procedure are in its simplicity and flexibility. Using a simple linear mapping for alignment can provide robust, generalizeable performance in low-sample, high-noise settings because simple mappings are less likely to overfit to noise. Also, unlike typical functional alignment approaches that require subjects to process a shared set of images (Haxby et al., 2011), our approach has the flexibility to work even when subjects are viewing entirely unique images in the training data. This is critical for the Natural Scenes Dataset, where 90% of the seen images are unique to the subject and the 10% that were seen across subjects are relegated to the test set. Further, this approach holds advantages for subsequent data collection of a new subject, where such data collection does not need to be restricted to showing a predefined set of images.
This paper is available on arxiv under CC BY 4.0 DEED license.
Authors:
(1) Paul S. Scotti, Stability AI and Medical AI Research Center (MedARC);
(2) Mihir Tripathy, Medical AI Research Center (MedARC) and a Core contribution;
(3) Cesar Kadir Torrico Villanueva, Medical AI Research Center (MedARC) and a Core contribution;
(4) Reese Kneeland, University of Minnesota and a Core contribution;
(5) Tong Chen, The University of Sydney and Medical AI Research Center (MedARC);
(6) Ashutosh Narang, Medical AI Research Center (MedARC);
(7) Charan Santhirasegaran, Medical AI Research Center (MedARC);
(8) Jonathan Xu, University of Waterloo and Medical AI Research Center (MedARC);
(9) Thomas Naselaris, University of Minnesota;
(10) Kenneth A. Norman, Princeton Neuroscience Institute;
(11) Tanishq Mathew Abraham, Stability AI and Medical AI Research Center (MedARC).