CRC
The CRC-HE dataset consists of labeled patches (9 classes) from colorectal cancer (CRC) and normal tissue. We use the NCT-CRC-HE-100K dataset for training and validation and the CRC-VAL-HE-7K for testing.
The NCT-CRC-HE-100K-NONORM consists of 100,000 images without applied color normalization. The CRC-VAL-HE-7K consists of 7,180 image patches from 50 patients without overlap with NCT-CRC-HE-100K-NONORM.
The tissue classes are: Adipose (ADI), background (BACK), debris (DEB), lymphocytes (LYM), mucus (MUC), smooth muscle (MUS), normal colon mucosa (NORM), cancer-associated stroma (STR) and colorectal adenocarcinoma epithelium (TUM)
Raw data
Key stats
| Modality | Vision (WSI patches) |
| Task | Multiclass classification (9 classes) |
| Cancer type | Colorectal |
| Data size | total: 11.7GB (train), 800MB (val) |
| Image dimension | 224 x 224 x 3 |
| Magnification (μm/px) | 20x (0.5) |
| Files format | .tif images |
| Number of images | 107,180 (100k train, 7.2k val) |
| Splits in use | NCT-CRC-HE-100K (train), CRC-VAL-HE-7K (val) |
Splits
We use the splits according to the data sources:
- Train split:
NCT-CRC-HE-100K - Validation split:
CRC-VAL-HE-7K
| Splits | Train | Validation |
|---|---|---|
| #Samples | 100,000 (93.3%) | 7,180 (6.7%) |
A test split is not provided. Because the patient information for the training data is not available, dividing the training data in a train/val split (and using the given val split as test split) is not possible without risking data leakage. eva therefore reports evaluation results for CRC HE on the validation split.
Organization
The data NCT-CRC-HE-100K.zip, NCT-CRC-HE-100K-NONORM.zip and CRC-VAL-HE-7K.zip
from zenodo are organized as follows:
NCT-CRC-HE-100K # All images used for training
├── ADI # All labeled patches belonging to the 1st class
│ ├── ADI-AAAFLCLY.tif
│ ├── ...
├── BACK # All labeled patches belonging to the 2nd class
│ ├── ...
└── ...
NCT-CRC-HE-100K-NONORM # All images used for training
├── ADI # All labeled patches belonging to the 1st class
│ ├── ADI-AAAFLCLY.tif
│ ├── ...
├── BACK # All labeled patches belonging to the 2nd class
│ ├── ...
└── ...
CRC-VAL-HE-7K # All images used for validation
├── ... # identical structure as for NCT-CRC-HE-100K-NONORM
└── ...
Download and preprocessing
The CRC dataset class supports downloading the data during runtime by setting the init argument download=True.
[!NOTE] In the provided
CRC-config files the download argument is set tofalse. To enable automatic download you will need to open the config and setdownload: true.