REVEAL: Retinal-risk Vision-Language Early Alzheimer’s Learning
Model Description
REVEAL is a multimodal vision-language model designed to align retinal fundus imaging with individualized clinical risk factors for early prediction of Alzheimer’s disease (AD) and dementia. The model learns joint representations from retinal morphology and structured health data transformed into clinical narratives.
REVEAL leverages pretrained medical foundation models and introduces a group-aware contrastive learning (GACL) strategy to capture clinically meaningful multimodal relationships. The model is designed to support early disease risk stratification and multimodal biomarker discovery.
Model Architecture
REVEAL is composed of:
- Image Encoder: RETFound retinal imaging foundation model
- Text Encoder: GatorTron clinical language model
- Projection Layers: Trainable modules mapping image and text embeddings into a shared latent space
- Contrastive Learning Module: Group-aware contrastive learning for multimodal alignment
The framework operates in two stages:
- Multimodal representation learning using contrastive vision-language alignment
- Downstream risk prediction using multimodal embeddings
Training Data
Dataset Source
The model was trained using multimodal data derived from the UK Biobank (https://www.ukbiobank.ac.uk/), a large population-scale biomedical dataset containing retinal imaging and clinical health variables.
Cohort Composition
The dataset includes color fundus photographs and clinical risk factor data from 39,242 participants:
- Training set: 30,462 participants
- Validation set: 3,384 participants
- Test set: 5,396 participants
Training and validation sets contained only cognitively normal participants at baseline. Individuals who developed incident AD or dementia were reserved for downstream evaluation.
Imaging Data
- Imaging modality: Color fundus photography
- Initial dataset: 136,994 retinal images
- Quality-controlled dataset: 66,251 images
Retinal morphometric features were extracted using the AutoMorph pipeline, including:
- Optic nerve head measurements (cup-to-disc ratios)
- Vascular morphology metrics
- Vessel tortuosity and fractal measurements
Clinical Risk Factors
Risk factors include:
Demographic
- Age
- Sex
- Socioeconomic status
- Ethnicity
- Employment status
General Health
- BMI
- HbA1C
- Blood pressure
- Cognitive test scores
Behavioral and Psychiatric
- Depression
- Sleep deprivation
- Smoking history
- Alcohol use
- Cannabis use
Lifestyle and Social
- Physical activity
- Social engagement
- Leisure activity
Diet
- Food intake patterns
- Beverage consumption
- Nutritional indicators
Synthetic Clinical Text Generation
Structured clinical variables were converted into standardized clinical narratives using a large language model. Each participant’s risk factors were mapped into a predefined clinical template to enable compatibility with vision-language training.
Training Procedure
Multimodal Representation Learning
REVEAL aligns fundus images and clinical narratives using contrastive vision-language learning. Both modalities are encoded and projected into a shared latent embedding space.
Group-Aware Contrastive Learning (GACL)
REVEAL introduces a group-aware pairing strategy that:
- Identifies subjects with similar retinal morphology
- Identifies subjects with similar clinical risk profiles
- Forms positive training pairs across similar individuals
This enables the model to learn clinically meaningful multimodal relationships rather than relying only on subject-level pairings.
Loss Function
REVEAL uses a modified contrastive loss supporting multiple positive pairs per sample. Similarity is computed using cosine similarity between image and text embeddings.
Hyperparameters
- Projection dimension: 1024
- Batch size: 128
- Learning rate: 2.42e-4
- Weight decay: 0.0232
- Temperature parameter: 0.07
Hyperparameters were optimized using Optuna (https://optuna.org/).
Intended Use
Primary Use Cases
REVEAL is intended for research applications, including:
- Early risk stratification for Alzheimer’s disease and dementia
- Multimodal biomarker discovery
- Development of non-invasive screening strategies
- Population-level disease risk modeling
- Multimodal clinical representation learning
Appropriate Use
The model should be used:
- For research or exploratory clinical modeling
- With appropriate ethical and institutional review
- With external validation before use in new populations
Out-of-Scope Use
The model is not intended for:
- Direct clinical diagnosis
- Medical decision-making without clinician oversight
- Deployment as a medical device
- Use in unvalidated populations
Evaluation
REVEAL embeddings were evaluated using downstream support vector machine classifiers.
Incident Alzheimer’s Disease Prediction
- AUROC: 0.658
- Balanced Accuracy: 0.610
Incident Dementia Prediction
- AUROC: 0.659
- Balanced Accuracy: 0.605
Performance reflects average results across multiple random seeds.
Limitations
- Model training is limited to the UK Biobank cohort
- Performance is sensitive to similarity threshold selection
- Incident AD and dementia cases remain relatively limited
- Synthetic clinical narrative generation may introduce bias
- Generalizability to other populations requires external validation
Ethical Considerations
- Retinal images and clinical variables contain sensitive health data
- Predictions may influence disease risk interpretation
- Model outputs should not replace clinical judgment
- Use requires adherence to privacy, regulatory, and ethical guidelines
Citation
If you use this model, please cite:
@article{leem2026reveal, title={REVEAL: Multimodal Vision-Language Alignment of Retinal Morphometry and Clinical Risks for Incident AD and Dementia Prediction}, author={Leem, Seowung and Gu, Lin and You, Chenyu and Gong, Kuang and Fang, Ruogu}, journal={MIDL 2026 (Under Review)}, year={2026} }
Model tree for smilelab/REVEAL
Base model
UFNLP/gatortronSEvaluation results
- AUROCself-reported0.658
- AUROCself-reported0.659