Skip to content

PANDA (Prostate cANcer graDe Assessment)

The PANDA datasets consists of 10,616 whole-slide images of digitized H&E-stained prostate tissue biopsies originating from two medical centers. After the biopsy, the slides were classified into Gleason patterns (3, 4 or 5) based on the architectural growth patterns of the tumor, which are then converted into an ISUP grade on a 0-5 scale.

The Gleason grading system is the most important prognostic marker for prostate cancer and the ISUP grade has a crucial role when deciding how a patient should be treated. However, the system suffers from significant inter-observer variability between pathologists, leading to imperfect and noisy labels.


Raw data

Key stats

Modality Vision (WSI)
Task Multiclass classification (6 classes)
Cancer type Prostate
Data size 347 GB
Image dimension ~20k x 20k x 3
Magnification (μm/px) 20x (0.5) - Level 0
Files format .tiff
Number of images 10,616 (9,555 after removing noisy labels)


The data from kaggle is organized as follows:

├── train_images
│   ├── 0005f7aaab2800f6170c399693a96917.tiff
│   └── ...
├── train_label_masks (not used in eva)
│   ├── 0005f7aaab2800f6170c399693a96917_mask.tiff
│   └── ...
├── train.csv (contains Gleason & ISUP labels)
├── test.csv
├── sample_submission.csv

Download and preprocessing

The PANDA dataset class doesn't download the data during runtime and must be downloaded manually from kaggle.

As done in other studies1 we exclude ~10% of the samples with noisy labels according to kaggle's 6th place solution resulting in a total dataset size of 9555 WSIs.

We then generate random stratified train / validation and test splits using a 0.7 / 0.15 / 0.15 ratio:

Splits Train Validation Test
#Samples 6686 (70%) 1430 (15%) 1439 (15%)




1 : A General-Purpose Self-Supervised Model for Computational Pathology