Skip to content

eva

Datasets

Datasets

Reference information for the Dataset base class.

`eva.core.data.Dataset`

Bases: TorchDataset

Base dataset class.

`prepare_data`

Encapsulates all disk related tasks.

This method is preferred for downloading and preparing the data, for example generate manifest files. If implemented, it will be called via :class:eva.core.data.datamodules.DataModule, which ensures that is called only within a single process, making it multi-processes safe.

Source code in src/eva/core/data/datasets/base.py

def prepare_data(self) -> None:
    """Encapsulates all disk related tasks.

    This method is preferred for downloading and preparing the data, for
    example generate manifest files. If implemented, it will be called via
    :class:`eva.core.data.datamodules.DataModule`, which ensures that is called
    only within a single process, making it multi-processes safe.
    """

`setup`

Setups the dataset.

This method is preferred for creating datasets or performing train/val/test splits. If implemented, it will be called via :class:eva.core.data.datamodules.DataModule at the beginning of fit (train + validate), validate, test, or predict and it will be called from every process (i.e. GPU) across all the nodes in DDP.

Source code in src/eva/core/data/datasets/base.py

def setup(self) -> None:
    """Setups the dataset.

    This method is preferred for creating datasets or performing
    train/val/test splits. If implemented, it will be called via
    :class:`eva.core.data.datamodules.DataModule` at the beginning of fit
    (train + validate), validate, test, or predict and it will be called
    from every process (i.e. GPU) across all the nodes in DDP.
    """
    self.configure()
    self.validate()

`configure`

Configures the dataset.

This method is preferred to configure the dataset; assign values to attributes, perform splits etc. This would be called from the method ::method::setup, before calling the ::method::validate.

Source code in src/eva/core/data/datasets/base.py

def configure(self):
    """Configures the dataset.

    This method is preferred to configure the dataset; assign values
    to attributes, perform splits etc. This would be called from the
    method ::method::`setup`, before calling the ::method::`validate`.
    """

`validate`

Validates the dataset.

This method aims to check the integrity of the dataset and verify that is configured properly. This would be called from the method ::method::setup, after calling the ::method::configure.

Source code in src/eva/core/data/datasets/base.py

def validate(self):
    """Validates the dataset.

    This method aims to check the integrity of the dataset and verify
    that is configured properly. This would be called from the method
    ::method::`setup`, after calling the ::method::`configure`.
    """

`teardown`

Cleans up the data artifacts.

Used to clean-up when the run is finished. If implemented, it will be called via :class:eva.core.data.datamodules.DataModule at the end of fit (train + validate), validate, test, or predict and it will be called from every process (i.e. GPU) across all the nodes in DDP.

Source code in src/eva/core/data/datasets/base.py

def teardown(self) -> None:
    """Cleans up the data artifacts.

    Used to clean-up when the run is finished. If implemented, it will
    be called via :class:`eva.core.data.datamodules.DataModule` at the end
    of fit (train + validate), validate, test, or predict and it will be
    called from every process (i.e. GPU) across all the nodes in DDP.
    """

Embeddings datasets

`eva.core.data.datasets.EmbeddingsClassificationDataset`

Bases: EmbeddingsDataset[Tensor]

Embeddings dataset class for classification tasks.

Expects a manifest file listing the paths of .pt files that contain tensor embeddings of shape [embedding_dim] or [1, embedding_dim].

Parameters:

Name	Type	Description	Default
`root`	`str`	Root directory of the dataset.	required
`manifest_file`	`str`	The path to the manifest file, which is relative to the `root` argument.	required
`split`	`Literal['train', 'val', 'test'] \| None`	The dataset split to use. The `split` column of the manifest file will be splitted based on this value.	`None`
`column_mapping`	`Dict[str, str]`	Defines the map between the variables and the manifest columns. It will overwrite the `default_column_mapping` with the provided values, so that `column_mapping` can contain only the values which are altered or missing.	`default_column_mapping`
`embeddings_transforms`	`Callable \| None`	A function/transform that transforms the embedding.	`None`
`target_transforms`	`Callable \| None`	A function/transform that transforms the target.	`None`

Source code in src/eva/core/data/datasets/embeddings.py

def __init__(
    self,
    root: str,
    manifest_file: str,
    split: Literal["train", "val", "test"] | None = None,
    column_mapping: Dict[str, str] = default_column_mapping,
    embeddings_transforms: Callable | None = None,
    target_transforms: Callable | None = None,
) -> None:
    """Initialize dataset.

    Expects a manifest file listing the paths of .pt files that contain
    tensor embeddings of shape [embedding_dim] or [1, embedding_dim].

    Args:
        root: Root directory of the dataset.
        manifest_file: The path to the manifest file, which is relative to
            the `root` argument.
        split: The dataset split to use. The `split` column of the manifest
            file will be splitted based on this value.
        column_mapping: Defines the map between the variables and the manifest
            columns. It will overwrite the `default_column_mapping` with
            the provided values, so that `column_mapping` can contain only the
            values which are altered or missing.
        embeddings_transforms: A function/transform that transforms the embedding.
        target_transforms: A function/transform that transforms the target.
    """
    super().__init__()

    self._root = root
    self._manifest_file = manifest_file
    self._split = split
    self._column_mapping = default_column_mapping | column_mapping
    self._embeddings_transforms = embeddings_transforms
    self._target_transforms = target_transforms

    self._data: pd.DataFrame

    self._set_multiprocessing_start_method()

`eva.core.data.datasets.MultiEmbeddingsClassificationDataset`

Bases: EmbeddingsDataset[Tensor]

Dataset class for where a sample corresponds to multiple embeddings.

Example use case: Slide level dataset where each slide has multiple patch embeddings.

Expects a manifest file listing the paths of .pt files containing tensor embeddings.

The manifest must have a column_mapping["multi_id"] column that contains the unique identifier group of embeddings. For oncology datasets, this would be usually the slide id. Each row in the manifest file points to a .pt file that can contain one or multiple embeddings (either as a list or stacked tensors). There can also be multiple rows for the same multi_id, in which case the embeddings from the different .pt files corresponding to that same multi_id will be stacked along the first dimension.

Parameters:

Name	Type	Description	Default
`root`	`str`	Root directory of the dataset.	required
`manifest_file`	`str`	The path to the manifest file, which is relative to the `root` argument.	required
`split`	`Literal['train', 'val', 'test']`	The dataset split to use. The `split` column of the manifest file will be splitted based on this value.	required
`column_mapping`	`Dict[str, str]`	Defines the map between the variables and the manifest columns. It will overwrite the `default_column_mapping` with the provided values, so that `column_mapping` can contain only the values which are altered or missing.	`default_column_mapping`
`embeddings_transforms`	`Callable \| None`	A function/transform that transforms the embedding.	`None`
`target_transforms`	`Callable \| None`	A function/transform that transforms the target.	`None`

Source code in src/eva/core/data/datasets/classification/multi_embeddings.py

def __init__(
    self,
    root: str,
    manifest_file: str,
    split: Literal["train", "val", "test"],
    column_mapping: Dict[str, str] = embeddings_base.default_column_mapping,
    embeddings_transforms: Callable | None = None,
    target_transforms: Callable | None = None,
):
    """Initialize dataset.

    Expects a manifest file listing the paths of `.pt` files containing tensor embeddings.

    The manifest must have a `column_mapping["multi_id"]` column that contains the
    unique identifier group of embeddings. For oncology datasets, this would be usually
    the slide id. Each row in the manifest file points to a .pt file that can contain
    one or multiple embeddings (either as a list or stacked tensors). There can also be
    multiple rows for the same `multi_id`, in which case the embeddings from the different
    .pt files corresponding to that same `multi_id` will be stacked along the first dimension.

    Args:
        root: Root directory of the dataset.
        manifest_file: The path to the manifest file, which is relative to
            the `root` argument.
        split: The dataset split to use. The `split` column of the manifest
            file will be splitted based on this value.
        column_mapping: Defines the map between the variables and the manifest
            columns. It will overwrite the `default_column_mapping` with
            the provided values, so that `column_mapping` can contain only the
            values which are altered or missing.
        embeddings_transforms: A function/transform that transforms the embedding.
        target_transforms: A function/transform that transforms the target.
    """
    super().__init__(
        manifest_file=manifest_file,
        root=root,
        split=split,
        column_mapping=column_mapping,
        embeddings_transforms=embeddings_transforms,
        target_transforms=target_transforms,
    )

    self._multi_ids: List[int]