Skip to content

BACH

The BACH dataset consists of microscopy and WSI images, of which we use only the microscopy images. These are 408 labeled images from 4 classes ("Normal", "Benign", "Invasive", "InSitu"). This dataset was used for the "BACH Grand Challenge on Breast Cancer Histology images".

Raw data

Key stats

Modality Vision (microscopy images)
Task Multiclass classification (4 classes)
Cancer type Breast
Data size total: 10.4GB / data in use: 7.37 GB (18.9 MB per image)
Image dimension 1536 x 2048 x 3
Magnification (μm/px) 20x (0.42)
Files format .tif images
Number of images 408 (102 from each class)
Splits in use one labeled split

Organization

The data ICIAR2018_BACH_Challenge.zip from zenodo is organized as follows:

ICAR2018_BACH_Challenge
├── Photos                    # All labeled patches used by eva
│   ├── Normal
│   │   ├── n032.tif
│   │   └── ...
│   ├── Benign
│   │   └── ...
│   ├── Invasive
│   │   └── ...
│   ├── InSitu
│   │   └── ...
├── WSI                       # WSIs, not in use
│   ├── ...
└── ...

Download and preprocessing

The BACH dataset class supports downloading the data during runtime by setting the init argument download=True.

[!NOTE] In the provided BACH-config files the download argument is set to false. To enable automatic download you will need to open the config and set download: true.

The splits are created from the indices specified in the BACH dataset class. These indices were picked to prevent data leakage due to images belonging to the same patient. Because the small dataset in combination with the patient ID constraint does not allow to split the data three-ways with sufficient amount of data in each split, we only create a train and val split and leave it to the user to submit predictions on the official test split to the BACH Challenge Leaderboard.

Splits Train Validation
#Samples 268 (67%) 132 (33%)

License

Attribution-NonCommercial-ShareAlike 4.0 International