Skip to content

Datasets

Reference information for the Dataset base class.

eva.data.Dataset

Bases: TorchDataset

Base dataset class.

prepare_data

Encapsulates all disk related tasks.

This method is preferred for downloading and preparing the data, for example generate manifest files. If implemented, it will be called via :class:eva.core.data.datamodules.DataModule, which ensures that is called only within a single process, making it multi-processes safe.

Source code in src/eva/core/data/datasets/base.py
def prepare_data(self) -> None:
    """Encapsulates all disk related tasks.

    This method is preferred for downloading and preparing the data, for
    example generate manifest files. If implemented, it will be called via
    :class:`eva.core.data.datamodules.DataModule`, which ensures that is called
    only within a single process, making it multi-processes safe.
    """

setup

Setups the dataset.

This method is preferred for creating datasets or performing train/val/test splits. If implemented, it will be called via :class:eva.core.data.datamodules.DataModule at the beginning of fit (train + validate), validate, test, or predict and it will be called from every process (i.e. GPU) across all the nodes in DDP.

Source code in src/eva/core/data/datasets/base.py
def setup(self) -> None:
    """Setups the dataset.

    This method is preferred for creating datasets or performing
    train/val/test splits. If implemented, it will be called via
    :class:`eva.core.data.datamodules.DataModule` at the beginning of fit
    (train + validate), validate, test, or predict and it will be called
    from every process (i.e. GPU) across all the nodes in DDP.
    """
    self.configure()
    self.validate()

configure

Configures the dataset.

This method is preferred to configure the dataset; assign values to attributes, perform splits etc. This would be called from the method ::method::setup, before calling the ::method::validate.

Source code in src/eva/core/data/datasets/base.py
def configure(self):
    """Configures the dataset.

    This method is preferred to configure the dataset; assign values
    to attributes, perform splits etc. This would be called from the
    method ::method::`setup`, before calling the ::method::`validate`.
    """

validate

Validates the dataset.

This method aims to check the integrity of the dataset and verify that is configured properly. This would be called from the method ::method::setup, after calling the ::method::configure.

Source code in src/eva/core/data/datasets/base.py
def validate(self):
    """Validates the dataset.

    This method aims to check the integrity of the dataset and verify
    that is configured properly. This would be called from the method
    ::method::`setup`, after calling the ::method::`configure`.
    """

teardown

Cleans up the data artifacts.

Used to clean-up when the run is finished. If implemented, it will be called via :class:eva.core.data.datamodules.DataModule at the end of fit (train + validate), validate, test, or predict and it will be called from every process (i.e. GPU) across all the nodes in DDP.

Source code in src/eva/core/data/datasets/base.py
def teardown(self) -> None:
    """Cleans up the data artifacts.

    Used to clean-up when the run is finished. If implemented, it will
    be called via :class:`eva.core.data.datamodules.DataModule` at the end
    of fit (train + validate), validate, test, or predict and it will be
    called from every process (i.e. GPU) across all the nodes in DDP.
    """

Embeddings datasets

eva.core.data.datasets.EmbeddingsClassificationDataset

Bases: Dataset

Embeddings classification dataset.

Expects a manifest file listing the paths of .pt files that contain tensor embeddings of shape [embedding_dim] or [1, embedding_dim].

Parameters:

Name Type Description Default
root str

Root directory of the dataset.

required
manifest_file str

The path to the manifest file, which is relative to the root argument.

required
split str | None

The dataset split to use. The split column of the manifest file will be splitted based on this value.

None
column_mapping Dict[str, str]

Defines the map between the variables and the manifest columns. It will overwrite the default_column_mapping with the provided values, so that column_mapping can contain only the values which are altered or missing.

default_column_mapping
embeddings_transforms Callable | None

A function/transform that transforms the embedding.

None
target_transforms Callable | None

A function/transform that transforms the target.

None
Source code in src/eva/core/data/datasets/classification/embeddings.py
def __init__(
    self,
    root: str,
    manifest_file: str,
    split: str | None = None,
    column_mapping: Dict[str, str] = default_column_mapping,
    embeddings_transforms: Callable | None = None,
    target_transforms: Callable | None = None,
) -> None:
    """Initialize dataset.

    Expects a manifest file listing the paths of .pt files that contain
    tensor embeddings of shape [embedding_dim] or [1, embedding_dim].

    Args:
        root: Root directory of the dataset.
        manifest_file: The path to the manifest file, which is relative to
            the `root` argument.
        split: The dataset split to use. The `split` column of the manifest
            file will be splitted based on this value.
        column_mapping: Defines the map between the variables and the manifest
            columns. It will overwrite the `default_column_mapping` with
            the provided values, so that `column_mapping` can contain only the
            values which are altered or missing.
        embeddings_transforms: A function/transform that transforms the embedding.
        target_transforms: A function/transform that transforms the target.
    """
    super().__init__()

    self._root = root
    self._manifest_file = manifest_file
    self._split = split
    self._column_mapping = self.default_column_mapping | column_mapping
    self._embeddings_transforms = embeddings_transforms
    self._target_transforms = target_transforms

    self._data: pd.DataFrame

default_column_mapping: Dict[str, str] = {'data': 'embeddings', 'target': 'target', 'split': 'split'} class-attribute instance-attribute

The default column mapping of the variables to the manifest columns.

filename

Returns the filename of the index'th data sample.

Note that this is the relative file path to the root.

Parameters:

Name Type Description Default
index int

The index of the data-sample to select.

required

Returns:

Type Description
str

The filename of the index'th data sample.

Source code in src/eva/core/data/datasets/classification/embeddings.py
def filename(self, index: int) -> str:
    """Returns the filename of the `index`'th data sample.

    Note that this is the relative file path to the root.

    Args:
        index: The index of the data-sample to select.

    Returns:
        The filename of the `index`'th data sample.
    """
    return self._data.at[index, self._column_mapping["data"]]