Skip to content

UniToPatho

UniToPatho is an annotated dataset of 9536 hematoxylin and eosin stained patches extracted from 292 whole-slide images, meant for training deep neural networks for colorectal polyps classification and adenomas grading. The slides are acquired through a Hamamatsu Nanozoomer S210 scanner at 20x magnification (0.4415 μm/px). Each slide belongs to a different patient and is annotated by expert pathologists, according to six classes as follows:

  • NORM - Normal tissue;
  • HP - Hyperplastic Polyp;
  • TA.HG - Tubular Adenoma, High-Grade dysplasia;
  • TA.LG - Tubular Adenoma, Low-Grade dysplasia;
  • TVA.HG - Tubulo-Villous Adenoma, High-Grade dysplasia;
  • TVA.LG - Tubulo-Villous Adenoma, Low-Grade dysplasia.

For this benchmark we used only the 800 subset which contains 8669 images of resolution 1812x1812 (the 7000 subset contains much bigger images and would therefore be difficult to handle as patch classification task).

Raw data

Key stats

Modality Vision (WSI patches)
Task Multiclass classification (6 classes)
Cancer type Colorectal
Data size 48.37 GB
Image dimension 1812 x 1812
Magnification (μm/px) 20x (0.4415)
Magnification after resize (μm/px) 162x (3.57)
Files format png
Number of images 8669

Splits

The data source provides train/validation splits

Splits Train Validation
#Samples 6270 (72.33) 2399 (27.67%)

The dataset authors only provide two splits, which is why we don't report performance on a third test split.

Organization

The UniToPatho data is organized as follows (note that we are using only the 800 subset):

unitopatho
├── 800
    test.csv
    train.csv
│   ├── HP                    # 1 folder per class 
│   ├── NORM
│   ├── TA.HG
│   ├── ...

Download and preprocessing

The UniToPatho dataset class doesn't download the data during runtime and must be downloaded manually from the official source.

License

CC BY 4.0