Skip to content

Datasets

Reference information for the Dataset base class.

eva.core.data.Dataset

Bases: TorchDataset

Base dataset class.

prepare_data

Encapsulates all disk related tasks.

This method is preferred for downloading and preparing the data, for example generate manifest files. If implemented, it will be called via :class:eva.core.data.datamodules.DataModule, which ensures that is called only within a single process, making it multi-processes safe.

Source code in src/eva/core/data/datasets/base.py
def prepare_data(self) -> None:
    """Encapsulates all disk related tasks.

    This method is preferred for downloading and preparing the data, for
    example generate manifest files. If implemented, it will be called via
    :class:`eva.core.data.datamodules.DataModule`, which ensures that is called
    only within a single process, making it multi-processes safe.
    """

setup

Setups the dataset.

This method is preferred for creating datasets or performing train/val/test splits. If implemented, it will be called via :class:eva.core.data.datamodules.DataModule at the beginning of fit (train + validate), validate, test, or predict and it will be called from every process (i.e. GPU) across all the nodes in DDP.

Source code in src/eva/core/data/datasets/base.py
def setup(self) -> None:
    """Setups the dataset.

    This method is preferred for creating datasets or performing
    train/val/test splits. If implemented, it will be called via
    :class:`eva.core.data.datamodules.DataModule` at the beginning of fit
    (train + validate), validate, test, or predict and it will be called
    from every process (i.e. GPU) across all the nodes in DDP.
    """
    self.configure()
    self.validate()

configure

Configures the dataset.

This method is preferred to configure the dataset; assign values to attributes, perform splits etc. This would be called from the method ::method::setup, before calling the ::method::validate.

Source code in src/eva/core/data/datasets/base.py
def configure(self):
    """Configures the dataset.

    This method is preferred to configure the dataset; assign values
    to attributes, perform splits etc. This would be called from the
    method ::method::`setup`, before calling the ::method::`validate`.
    """

validate

Validates the dataset.

This method aims to check the integrity of the dataset and verify that is configured properly. This would be called from the method ::method::setup, after calling the ::method::configure.

Source code in src/eva/core/data/datasets/base.py
def validate(self):
    """Validates the dataset.

    This method aims to check the integrity of the dataset and verify
    that is configured properly. This would be called from the method
    ::method::`setup`, after calling the ::method::`configure`.
    """

teardown

Cleans up the data artifacts.

Used to clean-up when the run is finished. If implemented, it will be called via :class:eva.core.data.datamodules.DataModule at the end of fit (train + validate), validate, test, or predict and it will be called from every process (i.e. GPU) across all the nodes in DDP.

Source code in src/eva/core/data/datasets/base.py
def teardown(self) -> None:
    """Cleans up the data artifacts.

    Used to clean-up when the run is finished. If implemented, it will
    be called via :class:`eva.core.data.datamodules.DataModule` at the end
    of fit (train + validate), validate, test, or predict and it will be
    called from every process (i.e. GPU) across all the nodes in DDP.
    """

Embeddings datasets

eva.core.data.datasets.EmbeddingsClassificationDataset

Bases: EmbeddingsDataset[Tensor]

Embeddings dataset class for classification tasks.

Expects a manifest file listing the paths of .pt files that contain tensor embeddings of shape [embedding_dim] or [1, embedding_dim].

Parameters:

Name Type Description Default
root str

Root directory of the dataset.

required
manifest_file str

The path to the manifest file, which is relative to the root argument.

required
split Literal['train', 'val', 'test'] | None

The dataset split to use. The split column of the manifest file will be splitted based on this value.

None
column_mapping Dict[str, str]

Defines the map between the variables and the manifest columns. It will overwrite the default_column_mapping with the provided values, so that column_mapping can contain only the values which are altered or missing.

default_column_mapping
embeddings_transforms Callable | None

A function/transform that transforms the embedding.

None
target_transforms Callable | None

A function/transform that transforms the target.

None
Source code in src/eva/core/data/datasets/embeddings.py
def __init__(
    self,
    root: str,
    manifest_file: str,
    split: Literal["train", "val", "test"] | None = None,
    column_mapping: Dict[str, str] = default_column_mapping,
    embeddings_transforms: Callable | None = None,
    target_transforms: Callable | None = None,
) -> None:
    """Initialize dataset.

    Expects a manifest file listing the paths of .pt files that contain
    tensor embeddings of shape [embedding_dim] or [1, embedding_dim].

    Args:
        root: Root directory of the dataset.
        manifest_file: The path to the manifest file, which is relative to
            the `root` argument.
        split: The dataset split to use. The `split` column of the manifest
            file will be splitted based on this value.
        column_mapping: Defines the map between the variables and the manifest
            columns. It will overwrite the `default_column_mapping` with
            the provided values, so that `column_mapping` can contain only the
            values which are altered or missing.
        embeddings_transforms: A function/transform that transforms the embedding.
        target_transforms: A function/transform that transforms the target.
    """
    super().__init__()

    self._root = root
    self._manifest_file = manifest_file
    self._split = split
    self._column_mapping = default_column_mapping | column_mapping
    self._embeddings_transforms = embeddings_transforms
    self._target_transforms = target_transforms

    self._data: pd.DataFrame

    self._set_multiprocessing_start_method()

eva.core.data.datasets.MultiEmbeddingsClassificationDataset

Bases: EmbeddingsDataset[Tensor]

Dataset class for where a sample corresponds to multiple embeddings.

Example use case: Slide level dataset where each slide has multiple patch embeddings.

Expects a manifest file listing the paths of .pt files containing tensor embeddings.

The manifest must have a column_mapping["multi_id"] column that contains the unique identifier group of embeddings. For oncology datasets, this would be usually the slide id. Each row in the manifest file points to a .pt file that can contain one or multiple embeddings (either as a list or stacked tensors). There can also be multiple rows for the same multi_id, in which case the embeddings from the different .pt files corresponding to that same multi_id will be stacked along the first dimension.

Parameters:

Name Type Description Default
root str

Root directory of the dataset.

required
manifest_file str

The path to the manifest file, which is relative to the root argument.

required
split Literal['train', 'val', 'test']

The dataset split to use. The split column of the manifest file will be splitted based on this value.

required
column_mapping Dict[str, str]

Defines the map between the variables and the manifest columns. It will overwrite the default_column_mapping with the provided values, so that column_mapping can contain only the values which are altered or missing.

default_column_mapping
embeddings_transforms Callable | None

A function/transform that transforms the embedding.

None
target_transforms Callable | None

A function/transform that transforms the target.

None
Source code in src/eva/core/data/datasets/classification/multi_embeddings.py
def __init__(
    self,
    root: str,
    manifest_file: str,
    split: Literal["train", "val", "test"],
    column_mapping: Dict[str, str] = embeddings_base.default_column_mapping,
    embeddings_transforms: Callable | None = None,
    target_transforms: Callable | None = None,
):
    """Initialize dataset.

    Expects a manifest file listing the paths of `.pt` files containing tensor embeddings.

    The manifest must have a `column_mapping["multi_id"]` column that contains the
    unique identifier group of embeddings. For oncology datasets, this would be usually
    the slide id. Each row in the manifest file points to a .pt file that can contain
    one or multiple embeddings (either as a list or stacked tensors). There can also be
    multiple rows for the same `multi_id`, in which case the embeddings from the different
    .pt files corresponding to that same `multi_id` will be stacked along the first dimension.

    Args:
        root: Root directory of the dataset.
        manifest_file: The path to the manifest file, which is relative to
            the `root` argument.
        split: The dataset split to use. The `split` column of the manifest
            file will be splitted based on this value.
        column_mapping: Defines the map between the variables and the manifest
            columns. It will overwrite the `default_column_mapping` with
            the provided values, so that `column_mapping` can contain only the
            values which are altered or missing.
        embeddings_transforms: A function/transform that transforms the embedding.
        target_transforms: A function/transform that transforms the target.
    """
    super().__init__(
        manifest_file=manifest_file,
        root=root,
        split=split,
        column_mapping=column_mapping,
        embeddings_transforms=embeddings_transforms,
        target_transforms=target_transforms,
    )

    self._multi_ids: List[int]