eva

Oncology FM Evaluation Framework by kaiko.ai

With the first release, eva supports performance evaluation for vision Foundation Models ("FMs") and supervised machine learning models on WSI-patch-level image classification task. Support for radiology (CT-scans) segmentation tasks will be added soon.

With eva we provide the open-source community with an easy-to-use framework that follows industry best practices to deliver a robust, reproducible and fair evaluation benchmark across FMs of different sizes and architectures.

Support for additional modalities and tasks will be added in future releases.

Use cases

1. Evaluate your own FMs on public benchmark datasets

With a specified FM as input, you can run eva on several publicly available datasets & tasks. One evaluation run will download and preprocess the relevant data, compute embeddings, fit and evaluate a downstream head and report the mean and standard deviation of the relevant performance metrics.

Supported datasets & tasks include:

WSI patch-level pathology datasets

Patch Camelyon: binary breast cancer classification
BACH: multiclass breast cancer classification
CRC: multiclass colorectal cancer classification
MHIST: binary colorectal polyp cancer classification

Radiology datasets

TotalSegmentator: radiology/CT-scan for segmentation of anatomical structures (support coming soon)

To evaluate FMs, eva provides support for different model-formats, including models trained with PyTorch, models available on HuggingFace and ONNX-models. For other formats custom wrappers can be implemented.

2. Evaluate ML models on your own dataset & task

If you have your own labeled dataset, all that is needed is to implement a dataset class tailored to your source data. Start from one of our out-of-the box provided dataset classes, adapt it to your data and run eva to see how different FMs perform on your task.

Evaluation results

We evaluated the following FMs on the 4 supported WSI-patch-level image classification tasks. On the table below we report Balanced Accuracy for binary & multiclass tasks and show the average performance & standard deviation over 5 runs.

FM-backbone	pretraining	BACH	CRC	MHIST	PCam/val	PCam/test
DINO ViT-S16	N/A	0.410 (±0.009)	0.617 (±0.008)	0.501 (±0.004)	0.753 (±0.002)	0.728 (±0.003)
DINO ViT-S16	ImageNet	0.695 (±0.004)	0.935 (±0.003)	0.831 (±0.002)	0.864 (±0.007)	0.849 (±0.007)
DINO ViT-B8	ImageNet	0.710 (±0.007)	0.939 (±0.001)	0.814 (±0.003)	0.870 (±0.003)	0.856 (±0.004)
Lunit - ViT-S16	TCGA	0.801 (±0.005)	0.934 (±0.001)	0.768 (±0.004)	0.889 (±0.002)	0.895 (±0.006)
Owkin - iBOT ViT-B16	TCGA	0.725 (±0.004)	0.935 (±0.001)	0.777 (±0.005)	0.912 (±0.002)	0.915 (±0.003)
kaiko.ai - DINO ViT-S16	TCGA	0.797 (±0.003)	0.943 (±0.001)	0.828 (±0.003)	0.903 (±0.001)	0.893 (±0.005)
kaiko.ai - DINO ViT-S8	TCGA	0.834 (±0.012)	0.946 (±0.002)	0.832 (±0.006)	0.897 (±0.001)	0.887 (±0.002)
kaiko.ai - DINO ViT-B16	TCGA	0.810 (±0.008)	0.960 (±0.001)	0.826 (±0.003)	0.900 (±0.002)	0.898 (±0.003)
kaiko.ai - DINO ViT-B8	TCGA	0.865 (±0.019)	0.956 (±0.001)	0.809 (±0.021)	0.913 (±0.001)	0.921 (±0.002)
kaiko.ai - DINOv2 ViT-L14	TCGA	0.870 (±0.005)	0.930 (±0.001)	0.809 (±0.001)	0.908 (±0.001)	0.898 (±0.002)

The runs use the default setup described in the section below.

eva trains the decoder on the "train" split and uses the "validation" split for monitoring, early stopping and checkpoint selection. Evaluation results are reported on the "validation" split and, if available, on the "test" split.

For more details on the FM-backbones and instructions to replicate the results, check out Replicate evaluations.

Evaluation setup

Note that the current version of eva implements the task- & model-independent and fixed default set up following the standard evaluation protocol proposed by [1] and described in the table below. We selected this approach to prioritize reliable, robust and fair FM-evaluation while being in line with common literature. Additionally, with future versions we are planning to allow the use of cross-validation and hyper-parameter tuning to find the optimal setup to achieve best possible performance on the implemented downstream tasks.

With a provided FM, eva computes embeddings for all input images (WSI patches) which are then used to train a downstream head consisting of a single linear layer in a supervised setup for each of the benchmark datasets. We use early stopping with a patience of 5% of the maximal number of epochs.


Backbone	frozen
Hidden layers	none
Dropout	0.0
Activation function	none
Number of steps	12,500
Base Batch size	4,096
Batch size	dataset specific*
Base learning rate	0.01
Learning Rate	[Base learning rate] * [Batch size] / [Base batch size]
Max epochs	[Number of samples] * [Number of steps] / [Batch size]
Early stopping	5% * [Max epochs]
Optimizer	SGD
Momentum	0.9
Weight Decay	0.0
Nesterov momentum	true
LR Schedule	Cosine without warmup

* For smaller datasets (e.g. BACH with 400 samples) we reduce the batch size to 256 and scale the learning rate accordingly.

[1]: Virchow: A Million-Slide Digital Pathology Foundation Model, 2024

License

eva is distributed under the terms of the Apache-2.0 license.

Next steps

Check out the User Guide to get started with eva