Skip to content

eva

Datasets

Datasets

Reference information for the Dataset base class.

`eva.data.Dataset`

Bases: TorchDataset

Base dataset class.

`prepare_data`

Encapsulates all disk related tasks.

This method is preferred for downloading and preparing the data, for example generate manifest files. If implemented, it will be called via :class:eva.core.data.datamodules.DataModule, which ensures that is called only within a single process, making it multi-processes safe.

Source code in src/eva/core/data/datasets/base.py

def prepare_data(self) -> None:
    """Encapsulates all disk related tasks.

    This method is preferred for downloading and preparing the data, for
    example generate manifest files. If implemented, it will be called via
    :class:`eva.core.data.datamodules.DataModule`, which ensures that is called
    only within a single process, making it multi-processes safe.
    """

`setup`

Setups the dataset.

This method is preferred for creating datasets or performing train/val/test splits. If implemented, it will be called via :class:eva.core.data.datamodules.DataModule at the beginning of fit (train + validate), validate, test, or predict and it will be called from every process (i.e. GPU) across all the nodes in DDP.

Source code in src/eva/core/data/datasets/base.py

def setup(self) -> None:
    """Setups the dataset.

    This method is preferred for creating datasets or performing
    train/val/test splits. If implemented, it will be called via
    :class:`eva.core.data.datamodules.DataModule` at the beginning of fit
    (train + validate), validate, test, or predict and it will be called
    from every process (i.e. GPU) across all the nodes in DDP.
    """
    self.configure()
    self.validate()

`configure`

Configures the dataset.

This method is preferred to configure the dataset; assign values to attributes, perform splits etc. This would be called from the method ::method::setup, before calling the ::method::validate.

Source code in src/eva/core/data/datasets/base.py

def configure(self):
    """Configures the dataset.

    This method is preferred to configure the dataset; assign values
    to attributes, perform splits etc. This would be called from the
    method ::method::`setup`, before calling the ::method::`validate`.
    """

`validate`

Validates the dataset.

This method aims to check the integrity of the dataset and verify that is configured properly. This would be called from the method ::method::setup, after calling the ::method::configure.

Source code in src/eva/core/data/datasets/base.py

def validate(self):
    """Validates the dataset.

    This method aims to check the integrity of the dataset and verify
    that is configured properly. This would be called from the method
    ::method::`setup`, after calling the ::method::`configure`.
    """

`teardown`

Cleans up the data artifacts.

Used to clean-up when the run is finished. If implemented, it will be called via :class:eva.core.data.datamodules.DataModule at the end of fit (train + validate), validate, test, or predict and it will be called from every process (i.e. GPU) across all the nodes in DDP.

Source code in src/eva/core/data/datasets/base.py

def teardown(self) -> None:
    """Cleans up the data artifacts.

    Used to clean-up when the run is finished. If implemented, it will
    be called via :class:`eva.core.data.datamodules.DataModule` at the end
    of fit (train + validate), validate, test, or predict and it will be
    called from every process (i.e. GPU) across all the nodes in DDP.
    """

Embeddings datasets

`eva.core.data.datasets.EmbeddingsClassificationDataset`

Bases: Dataset

Embeddings classification dataset.

Expects a manifest file listing the paths of .pt files that contain tensor embeddings of shape [embedding_dim] or [1, embedding_dim].

Parameters:

Name	Type	Description	Default
`root`	`str`	Root directory of the dataset.	required
`manifest_file`	`str`	The path to the manifest file, which is relative to the `root` argument.	required
`split`	`str \| None`	The dataset split to use. The `split` column of the manifest file will be splitted based on this value.	`None`
`column_mapping`	`Dict[str, str]`	Defines the map between the variables and the manifest columns. It will overwrite the `default_column_mapping` with the provided values, so that `column_mapping` can contain only the values which are altered or missing.	`default_column_mapping`
`embeddings_transforms`	`Callable \| None`	A function/transform that transforms the embedding.	`None`
`target_transforms`	`Callable \| None`	A function/transform that transforms the target.	`None`

Source code in src/eva/core/data/datasets/classification/embeddings.py

def __init__(
    self,
    root: str,
    manifest_file: str,
    split: str | None = None,
    column_mapping: Dict[str, str] = default_column_mapping,
    embeddings_transforms: Callable | None = None,
    target_transforms: Callable | None = None,
) -> None:
    """Initialize dataset.

    Expects a manifest file listing the paths of .pt files that contain
    tensor embeddings of shape [embedding_dim] or [1, embedding_dim].

    Args:
        root: Root directory of the dataset.
        manifest_file: The path to the manifest file, which is relative to
            the `root` argument.
        split: The dataset split to use. The `split` column of the manifest
            file will be splitted based on this value.
        column_mapping: Defines the map between the variables and the manifest
            columns. It will overwrite the `default_column_mapping` with
            the provided values, so that `column_mapping` can contain only the
            values which are altered or missing.
        embeddings_transforms: A function/transform that transforms the embedding.
        target_transforms: A function/transform that transforms the target.
    """
    super().__init__()

    self._root = root
    self._manifest_file = manifest_file
    self._split = split
    self._column_mapping = self.default_column_mapping | column_mapping
    self._embeddings_transforms = embeddings_transforms
    self._target_transforms = target_transforms

    self._data: pd.DataFrame

`default_column_mapping: Dict[str, str] = {'data': 'embeddings', 'target': 'target', 'split': 'split'}` `class-attribute` `instance-attribute`

The default column mapping of the variables to the manifest columns.

`filename`

Returns the filename of the index'th data sample.

Note that this is the relative file path to the root.

Parameters:

Name	Type	Description	Default
`index`	`int`	The index of the data-sample to select.	required

Returns:

Type	Description
`str`	The filename of the `index`'th data sample.

Source code in src/eva/core/data/datasets/classification/embeddings.py

def filename(self, index: int) -> str:
    """Returns the filename of the `index`'th data sample.

    Note that this is the relative file path to the root.

    Args:
        index: The index of the data-sample to select.

    Returns:
        The filename of the `index`'th data sample.
    """
    return self._data.at[index, self._column_mapping["data"]]