COLIPRI
COLIPRI is a 3D vision–language transformer model trained to encode chest CT scans and reports.
Model description
COLIPRI was trained using tens of thousands of chest CT scans and reports, without any annotations, using multiple objectives to learn strong joint representations of 3D images and text. The procedure is described in detail in our manuscript, Comprehensive language-image pre-training for 3D medical image understanding (Wald et al. 2026).
The weights shared here correspond to our best-performing model, COLIPRI-CRM.
- Developed by: Microsoft Health Futures
- Model type: 3D vision–language encoder
- License: MIT
Uses
COLIPRI is shared for research purposes only. It is not meant to be used for clinical practice.
The encoders be plugged to other models, or used independently or jointly for many downstream tasks, such as:
- Image classification with text prompts
- Image clustering
- Text clustering
- Text-to-image retrieval
- Image-to-image retrieval
- Image-to-text retrieval
- Text-to-text retrieval
- Image classification with a classifier
- Text classification with a classifier
- Image segmentation with a decoder
- Report generation with a language decoder
Fine-tuning COLIPRI is typically not necessary to obtain good performance in downstream tasks.
Getting started
Installation
git clone https://huggingface.co/microsoft/colipri
pip install --quiet ./colipri
Usage examples
Below we share some usage snippets to get started with COLIPRI. A more complete Jupyter notebook is also available.
First, let's get a 3D chest CT we can use for demonstration. The plotted slices intersect a lung nodule near the heart.
>>> from colipri import load_sample_ct
>>> image = load_sample_ct()
>>> image
ScalarImage(shape: (1, 512, 512, 139); spacing: (0.76, 0.76, 2.50); orientation: LPS+; dtype: torch.IntTensor; memory: 139.0 MiB)
The image looks like this:
Now, let's instantiate the model and processor.
>>> from colipri import get_model
>>> from colipri import get_processor
>>> model = get_model().cuda()
>>> processor = get_processor()
Zero-shot classification
>>> from colipri import ZeroShotImageClassificationPipeline
>>> pipeline = ZeroShotImageClassificationPipeline(model, processor)
>>> pipeline(image, ["No lung nodules", "Lung nodules"])
[
{'score': 0.005, 'label': 'No lung nodules'},
{'score': 0.995, 'label': 'Lung nodules'}
]
Feature extraction
>>> import torch
>>> preprocessed_images = processor.process_images(image)
>>> preprocessed_images[0]
ScalarImage(shape: (1, 192, 192, 192); spacing: (2.00, 2.00, 2.00); orientation: SAR+; dtype: torch.FloatTensor; memory: 27.0 MiB)
>>> images_batch = processor.to_images_batch(preprocessed_images)
images_batch.shape
torch.Size([1, 1, 192, 192, 192])
>>> with torch.no_grad():
... patch_embeddings = model.encode_image(images_batch)
>>> patch_embeddings.shape
torch.Size([1, 768, 24, 24, 24])
>>> with torch.no_grad():
... pooled_embeddings = model.encode_image(images_batch, pool=True, project=True)
>>> pooled_embeddings.shape
torch.Size([1, 768])
Biases, risks, and limitations
COLIPRI was trained with data from Turkey and the USA only, therefore it might be biased towards population in the training data. Underlying biases of the training datasets may not be well characterized.
Environmental impact
- Hardware type: NVIDIA A100 GPUs
- Hours used: 72 hours × 4 GPUs = 288 GPU-hours
- Cloud provider: Azure
- Compute region: West US 2
- Carbon emitted: 21.6 kg CO₂ eq.
Compute infrastructure
COLIPRI was trained on Azure Machine Learning.
Hardware
| Stage | Node type | Num. nodes | GPU type | GPUs per node |
|---|---|---|---|---|
| Pre-training | Standard_NC96ads_A100_v4 |
1 | NVIDIA A100 (80 GB) | 4 |
| Evaluation | Standard_NC24ads_A100_v4 |
1 | NVIDIA A100 (80 GB) | 1 |
Software
The main software libraries used in this work were nnSSL for training, TorchIO for preprocessing and augmentation, nifti-zarr-py for data loading, and nnU-Net for segmentation evaluation.
Citation
BibTeX
@misc{
wald2026_colipri,
title={Comprehensive language-image pre-training for 3D medical image understanding},
author={Tassilo Wald and Ibrahim Ethem Hamamci and Yuan Gao and Sam Bond-Taylor and Harshita Sharma and Maximilian Ilse and Cynthia Lo and Olesya Melnichenko and Anton Schwaighofer and Noel C. F. Codella and Maria Teodora Wetscherek and Klaus H. Maier-Hein and Panagiotis Korfiatis and Valentina Salvatelli and Javier Alvarez-Valle and P{\'e}rez-Garc{\'i}a},
year={2026},
eprint={2510.15042},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2510.15042},
}
APA
Wald, T., Hamamci, I. E., Gao, Y., Bond-Taylor, S., Sharma, H., Ilse, M., Lo, C., Melnichenko, O., Schwaighofer, A., Codella, N. C. F., Wetscherek, M. T., Maier-Hein, K. H., Korfiatis, P., Salvatelli, V., Alvarez-Valle, J., & Pérez-García, F. (2026). Comprehensive language-image pre-training for 3D medical image understanding. arXiv. https://doi.org/10.48550/ARXIV.2510.15042
Model card contact
Fernando Pérez-García (fperezgarcia@microsoft.com).
