New Story

MindEye2 and Shared-Subject Functional Alignment

by Image Recognition4mApril 9th, 2025

Too Long; Didn't Read

MindEye2 involves pretraining and then fine-tuning a single model where brain activity is mapped to the embedding space of pretrained deep learning models.

Companies Mentioned

featured image - MindEye2 and Shared-Subject Functional Alignment

‘an eye in outer space’ Image created by HackerNoon AI Image Generator

Table of Links

Abstract and 1 Introduction

2 MindEye2 and 2.1 Shared-Subject Functional Alignment

2.2 Backbone, Diffusion Prior, & Submodules

2.3 Image Captioning and 2.4 Fine-tuning Stable Diffusion XL for unCLIP

2.5 Model Inference

3 Results and 3.1 fMRI-to-Image Reconstruction

3.2 Image Captioning

3.3 Image/Brain Retrieval and 3.4 Brain Correlation

3.5 Ablations

4 Related Work

5 Conclusion

6 Acknowledgements and References

A Appendix

A.1 Author Contributions

A.2 Additional Dataset Information

A.3 MindEye2 (not pretrained) vs. MindEye1

A.4 Reconstruction Evaluations Across Varying Amounts of Training Data

A.5 Single-Subject Evaluations

A.6 UnCLIP Evaluation

A.7 OpenCLIP BigG to CLIP L Conversion

A.8 COCO Retrieval

A.9 Reconstruction Evaluations: Additional Information

A.10 Pretraining with Less Subjects

A.11 UMAP Dimensionality Reduction

A.12 ROI-Optimized Stimuli

A.13 Human Preference Experiments

2 MindEye2

MindEye2 involves pretraining and then fine-tuning a single model where brain activity is mapped to the embedding space of pretrained deep learning models. During inference, these embeddings predicted from the brain are fed into frozen image generative models that translate from model space to pixel space. Our strategy to reconstruct seen images from brain activity using minimal training data is to first pretrain the model using data from 7 subjects (30-40 hours of scanning data each) and then to fine-tune the model using data from a held-out 8th subject. The full MindEye2 pipeline is depicted in Figure 2.

Single-subject models were trained/fine-tuned on a single 8xA100 80Gb GPU node for 150 epochs with a batch size of 24. Multi-subject pretraining was done with a batch size of 63 (9 samples per each of 7 subjects). Models were trained with Huggingface Accelerate (Gugger et al., 2022) and DeepSpeed (Rajbhandari et al., 2020) Stage 2 with CPU offloading.

2.1 Shared-Subject Functional Alignment

Every subject has a uniquely shaped brain with different functional organization, meaning that there needs to be an initial alignment step to ensure the model can handle inputs from different brains. Unlike anatomical alignment where every subject’s brain is mapped to the same brain template (Talairach and Tournoux, 1990; Mazziotta et al., 2001), we remain in subjects’ native brain space and functionally align flattened spatial patterns of fMRI activity to a shared-subject latent space using subject-specific ridge regression. That is, each subject has a separate linear layer with weight decay to map the input fMRI voxels (13,000 to 18,000 voxels depending on the subject) to a 4096-dim latent.

Following this initial linear layer, the rest of the model pipeline is shared across subjects without any subjectspecific mappings. The whole pipeline is trained end-to-end where pretraining involves each batch containing brain inputs from all subjects. That is, alignment to shared-subject space is not trained independently and we do not pretrain models separately for each subject; rather, we pretrain a single model equally sampling across all the subjects except the held-out subject used for fine-tuning.

Two strengths of this novel functional alignment procedure are in its simplicity and flexibility. Using a simple linear mapping for alignment can provide robust, generalizeable performance in low-sample, high-noise settings because simple mappings are less likely to overfit to noise. Also, unlike typical functional alignment approaches that require subjects to process a shared set of images (Haxby et al., 2011), our approach has the flexibility to work even when subjects are viewing entirely unique images in the training data. This is critical for the Natural Scenes Dataset, where 90% of the seen images are unique to the subject and the 10% that were seen across subjects are relegated to the test set. Further, this approach holds advantages for subsequent data collection of a new subject, where such data collection does not need to be restricted to showing a predefined set of images.