PANDA (Prostate cANcer graDe Assessment)
The PANDA datasets consists of 10,616 whole-slide images of digitized H&E-stained prostate tissue biopsies originating from two medical centers. After the biopsy, the slides were classified into Gleason patterns (3, 4 or 5) based on the architectural growth patterns of the tumor, which are then converted into an ISUP grade on a 0-5 scale.
The Gleason grading system is the most important prognostic marker for prostate cancer and the ISUP grade has a crucial role when deciding how a patient should be treated. However, the system suffers from significant inter-observer variability between pathologists, leading to imperfect and noisy labels.
Source: https://www.kaggle.com/competitions/prostate-cancer-grade-assessment
Raw data
Key stats
Modality | Vision (WSI) |
Task | Multiclass classification (6 classes) |
Cancer type | Prostate |
Data size | 347 GB |
Image dimension | ~20k x 20k x 3 |
Magnification (μm/px) | 20x (0.5) - Level 0 |
Files format | .tiff |
Number of images | 10,616 (9,555 after removing noisy labels) |
Organization
The data prostate-cancer-grade-assessment.zip
from kaggle is organized as follows:
prostate-cancer-grade-assessment
├── train_images
│ ├── 0005f7aaab2800f6170c399693a96917.tiff
│ └── ...
├── train_label_masks (not used in eva)
│ ├── 0005f7aaab2800f6170c399693a96917_mask.tiff
│ └── ...
├── train.csv (contains Gleason & ISUP labels)
├── test.csv
├── sample_submission.csv
Download and preprocessing
The PANDA
dataset class doesn't download the data during runtime and must be downloaded manually from kaggle.
As done in other studies1 we exclude ~10% of the samples with noisy labels according to kaggle's 6th place solution resulting in a total dataset size of 9555 WSIs.
We then generate random stratified train / validation and test splits using a 0.7 / 0.15 / 0.15 ratio:
Splits | Train | Validation | Test |
---|---|---|---|
#Samples | 6686 (70%) | 1430 (15%) | 1439 (15%) |
Relevant links
License
References
1 : A General-Purpose Self-Supervised Model for Computational Pathology