Skip to content

Leaderboards

We evaluated the following FMs on the 6 supported WSI-classification tasks. We report Balanced Accuracy for binary & multiclass tasks and generalized Dice score (no background) for segmentation tasks. The score shows the average performance over 5 runs.


Screenshot


Screenshot


The runs use the default setup described in the section below.

eva trains the decoder on the "train" split and uses the "validation" split for monitoring, early stopping and checkpoint selection. Evaluation results are reported on the "test" split if available and otherwise on the "validation" split.

For details on the FM-backbones and instructions to replicate the results, check out Replicate evaluations. For information on the tasks, check out Datasets.

Evaluation protocol

eva uses a task- & model-independent and fixed default set up which closely follows the standard evaluation protocol proposed by [1] (with adjustments for slide-level tasks to ensure convergence and computational efficiency).

We selected this approach to prioritize reliable, robust and fair FM-evaluation while being in line with common literature.

WSI patch-level classification tasks WSI slide-level classification tasks WSI patch-level segmentation tasks
Backbone frozen frozen frozen
Head single layer MLP ABMIL Mult-stage convolutional
Dropout 0.0 0.0 0.0
Hidden activation function n/a ReLU n/a
Output activation function none none none
Number of steps 12,500 12,500 (2) 2,000
Base batch size 4,096 (1) 32 64
Base learning rate 0.01 (1) 0.001 0.0001
Early stopping 5% * [Max epochs] 10% * [Max epochs] (3) 10% * [Max epochs] (3)
Optimizer SGD AdamW AdamW
Momentum 0.9 n/a n/a
Weight Decay 0.0 n/a n/a
betas n/a [0.9, 0.999] [0.9, 0.999]
LR Schedule Cosine without warmup Cosine without warmup PolynomialLR
Loss Cross entropy Cross entropy Dice
number of patches per slide 1 dataset specific (4) dataset specific (4)

(1) For smaller datasets (e.g. BACH with 400 samples) we reduce the batch size to 256 and scale the learning rate accordingly.

(2) Upper cap at a maximum of 100 epochs.

(3) Lower cap at a minimum of 8 epochs.

(4) Number of patches per slide depends on task and slide size. E.g. for PANDA and Camelyon16 we use a max of 1,000 and 10,000 random patches per slide respectively.