Skip to content

Networks

Reference information for the vision model Networks API.

eva.vision.models.networks.ABMIL

Bases: Module

ABMIL network for multiple instance learning classification tasks.

Takes an array of patch level embeddings per slide as input. This implementation supports batched inputs of shape (batch_size, n_instances, input_size). For slides with less than n_instances patches, you can apply padding and provide a mask tensor to the forward pass.

The original implementation from [1] was used as a reference: https://github.com/AMLab-Amsterdam/AttentionDeepMIL/blob/master/model.py

Notes
  • use_bias: The paper didn't use bias in their formalism, but their published example code inadvertently does.
  • To prevent dot product similarities near-equal due to concentration of measure as a consequence of large input embedding dimensionality (>128), we added the option to project the input embeddings to a lower dimensionality

[1] Maximilian Ilse, Jakub M. Tomczak, Max Welling, "Attention-based Deep Multiple Instance Learning", 2018 https://arxiv.org/abs/1802.04712

Parameters:

Name Type Description Default
input_size int

input embedding dimension

required
output_size int

number of classes

required
projected_input_size int | None

size of the projected input. if None, no projection is performed.

required
hidden_size_attention int

hidden dimension in attention network

128
hidden_sizes_mlp tuple

dimensions for hidden layers in last mlp

(128, 64)
use_bias bool

whether to use bias in the attention network

True
dropout_input_embeddings float

dropout rate for the input embeddings

0.0
dropout_attention float

dropout rate for the attention network and classifier

0.0
dropout_mlp float

dropout rate for the final MLP network

0.0
pad_value int | float | None

Value indicating padding in the input tensor. If specified, entries with this value in the will be masked. If set to None, no masking is applied.

float('-inf')
Source code in src/eva/vision/models/networks/abmil.py
def __init__(
    self,
    input_size: int,
    output_size: int,
    projected_input_size: int | None,
    hidden_size_attention: int = 128,
    hidden_sizes_mlp: tuple = (128, 64),
    use_bias: bool = True,
    dropout_input_embeddings: float = 0.0,
    dropout_attention: float = 0.0,
    dropout_mlp: float = 0.0,
    pad_value: int | float | None = float("-inf"),
) -> None:
    """Initializes the ABMIL network.

    Args:
        input_size: input embedding dimension
        output_size: number of classes
        projected_input_size: size of the projected input. if `None`, no projection is
            performed.
        hidden_size_attention: hidden dimension in attention network
        hidden_sizes_mlp: dimensions for hidden layers in last mlp
        use_bias: whether to use bias in the attention network
        dropout_input_embeddings: dropout rate for the input embeddings
        dropout_attention: dropout rate for the attention network and classifier
        dropout_mlp: dropout rate for the final MLP network
        pad_value: Value indicating padding in the input tensor. If specified, entries with
            this value in the will be masked. If set to `None`, no masking is applied.
    """
    super().__init__()

    self._pad_value = pad_value

    if projected_input_size:
        self.projector = nn.Sequential(
            nn.Linear(input_size, projected_input_size, bias=True),
            nn.Dropout(p=dropout_input_embeddings),
        )
        input_size = projected_input_size
    else:
        self.projector = nn.Dropout(p=dropout_input_embeddings)

    self.gated_attention = GatedAttention(
        input_dim=input_size,
        hidden_dim=hidden_size_attention,
        dropout=dropout_attention,
        n_classes=1,
        use_bias=use_bias,
    )

    self.classifier = MLP(
        input_size=input_size,
        output_size=output_size,
        hidden_layer_sizes=hidden_sizes_mlp,
        dropout=dropout_mlp,
        hidden_activation_fn=nn.ReLU,
    )

forward

Forward pass.

Parameters:

Name Type Description Default
input_tensor Tensor

Tensor with expected shape of (batch_size, n_instances, input_size).

required
Source code in src/eva/vision/models/networks/abmil.py
def forward(self, input_tensor: torch.Tensor) -> torch.Tensor:
    """Forward pass.

    Args:
        input_tensor: Tensor with expected shape of (batch_size, n_instances, input_size).
    """
    input_tensor, mask = self._mask_values(input_tensor, self._pad_value)

    # (batch_size, n_instances, input_size) -> (batch_size, n_instances, projected_input_size)
    input_tensor = self.projector(input_tensor)

    attention_logits = self.gated_attention(input_tensor)  # (batch_size, n_instances, 1)
    if mask is not None:
        # fill masked values with -inf, which will yield 0s after softmax
        attention_logits = attention_logits.masked_fill(mask, float("-inf"))

    attention_weights = nn.functional.softmax(attention_logits, dim=1)
    # (batch_size, n_instances, 1)

    attention_result = torch.matmul(torch.transpose(attention_weights, 1, 2), input_tensor)
    # (batch_size, 1, hidden_size_attention)

    attention_result = torch.squeeze(attention_result, 1)  # (batch_size, hidden_size_attention)

    return self.classifier(attention_result)  # (batch_size, output_size)

eva.vision.models.networks.decoders.Decoder

Bases: Module, ABC

Abstract base class for segmentation decoders.

forward abstractmethod

Forward pass of the decoder.

Source code in src/eva/vision/models/networks/decoders/segmentation/base.py
@abc.abstractmethod
def forward(self, decoder_inputs: DecoderInputs) -> torch.Tensor:
    """Forward pass of the decoder."""

eva.vision.models.networks.decoders.segmentation.Decoder2D

Bases: Decoder

Segmentation decoder for 2D applications.

Here the input nn layers will be directly applied to the features of shape (batch_size, hidden_size, n_patches_height, n_patches_width), where n_patches is image_size / patch_size. Note the n_patches is also known as grid_size.

Parameters:

Name Type Description Default
layers Module

The layers to be used as the decoder head.

required
combine_features bool

Whether to combine the features from different feature levels into one tensor before applying the decoder head.

True
Source code in src/eva/vision/models/networks/decoders/segmentation/decoder2d.py
def __init__(self, layers: nn.Module, combine_features: bool = True) -> None:
    """Initializes the based decoder head.

    Here the input nn layers will be directly applied to the
    features of shape (batch_size, hidden_size, n_patches_height,
    n_patches_width), where n_patches is image_size / patch_size.
    Note the n_patches is also known as grid_size.

    Args:
        layers: The layers to be used as the decoder head.
        combine_features: Whether to combine the features from different
            feature levels into one tensor before applying the decoder head.
    """
    super().__init__()

    self._layers = layers
    self._combine_features = combine_features

forward

Maps the patch embeddings to a segmentation mask of the image size.

Parameters:

Name Type Description Default
decoder_inputs DecoderInputs

Inputs required by the decoder.

required

Returns:

Type Description
Tensor

Tensor containing scores for all of the classes with shape

Tensor

(batch_size, n_classes, image_height, image_width).

Source code in src/eva/vision/models/networks/decoders/segmentation/decoder2d.py
def forward(self, decoder_inputs: DecoderInputs) -> torch.Tensor:
    """Maps the patch embeddings to a segmentation mask of the image size.

    Args:
        decoder_inputs: Inputs required by the decoder.

    Returns:
        Tensor containing scores for all of the classes with shape
        (batch_size, n_classes, image_height, image_width).
    """
    features, image_size, _ = DecoderInputs(*decoder_inputs)
    if self._combine_features:
        features = self._forward_features(features)
    logits = self._forward_head(features)
    return self._upscale(logits, image_size)

eva.vision.models.networks.decoders.segmentation.ConvDecoder1x1

Bases: Decoder2D

A convolutional decoder with a single 1x1 convolutional layer.

Parameters:

Name Type Description Default
in_features int

The hidden dimension size of the embeddings.

required
num_classes int

Number of output classes as channels.

required
Source code in src/eva/vision/models/networks/decoders/segmentation/semantic/common.py
def __init__(self, in_features: int, num_classes: int) -> None:
    """Initializes the decoder.

    Args:
        in_features: The hidden dimension size of the embeddings.
        num_classes: Number of output classes as channels.
    """
    super().__init__(
        layers=nn.Conv2d(
            in_channels=in_features,
            out_channels=num_classes,
            kernel_size=(1, 1),
        ),
    )

eva.vision.models.networks.decoders.segmentation.ConvDecoderMS

Bases: Decoder2D

A multi-stage convolutional decoder with upsampling and convolutional layers.

This decoder applies a series of upsampling and convolutional layers to transform the input features into output predictions with the desired spatial resolution.

This decoder is based on the +ms segmentation decoder from DINOv2 (https://arxiv.org/pdf/2304.07193)

Parameters:

Name Type Description Default
in_features int

The hidden dimension size of the embeddings.

required
num_classes int

Number of output classes as channels.

required
Source code in src/eva/vision/models/networks/decoders/segmentation/semantic/common.py
def __init__(self, in_features: int, num_classes: int) -> None:
    """Initializes the decoder.

    Args:
        in_features: The hidden dimension size of the embeddings.
        num_classes: Number of output classes as channels.
    """
    super().__init__(
        layers=nn.Sequential(
            nn.Upsample(scale_factor=2),
            nn.Conv2d(in_features, 64, kernel_size=(3, 3), padding=(1, 1)),
            nn.Upsample(scale_factor=2),
            nn.Conv2d(64, num_classes, kernel_size=(3, 3), padding=(1, 1)),
        ),
    )

eva.vision.models.networks.decoders.segmentation.LinearDecoder

Bases: Decoder

Linear decoder.

Here the input nn layers will be applied to the reshaped features (batch_size, patch_embeddings, hidden_size) from the input (batch_size, hidden_size, height, width) and then unwrapped again to (batch_size, n_classes, height, width).

Parameters:

Name Type Description Default
layers Module

The linear layers to be used as the decoder head.

required
Source code in src/eva/vision/models/networks/decoders/segmentation/linear.py
def __init__(self, layers: nn.Module) -> None:
    """Initializes the linear based decoder head.

    Here the input nn layers will be applied to the reshaped
    features (batch_size, patch_embeddings, hidden_size) from
    the input (batch_size, hidden_size, height, width) and then
    unwrapped again to (batch_size, n_classes, height, width).

    Args:
        layers: The linear layers to be used as the decoder head.
    """
    super().__init__()

    self._layers = layers

forward

Maps the patch embeddings to a segmentation mask of the image size.

Parameters:

Name Type Description Default
features List[Tensor]

List of multi-level image features of shape (batch_size, hidden_size, n_patches_height, n_patches_width).

required
image_size Tuple[int, int]

The target image size (height, width).

required

Returns:

Type Description
Tensor

Tensor containing scores for all of the classes with shape

Tensor

(batch_size, n_classes, image_height, image_width).

Source code in src/eva/vision/models/networks/decoders/segmentation/linear.py
def forward(
    self,
    features: List[torch.Tensor],
    image_size: Tuple[int, int],
) -> torch.Tensor:
    """Maps the patch embeddings to a segmentation mask of the image size.

    Args:
        features: List of multi-level image features of shape (batch_size,
            hidden_size, n_patches_height, n_patches_width).
        image_size: The target image size (height, width).

    Returns:
        Tensor containing scores for all of the classes with shape
        (batch_size, n_classes, image_height, image_width).
    """
    patch_embeddings = self._forward_features(features)
    logits = self._forward_head(patch_embeddings)
    return self._cls_seg(logits, image_size)

eva.vision.models.networks.decoders.segmentation.SingleLinearDecoder

Bases: LinearDecoder

A simple linear decoder with a single fully connected layer.

Parameters:

Name Type Description Default
in_features int

The hidden dimension size of the embeddings.

required
num_classes int

Number of output classes as channels.

required
Source code in src/eva/vision/models/networks/decoders/segmentation/semantic/common.py
def __init__(self, in_features: int, num_classes: int) -> None:
    """Initializes the decoder.

    Args:
        in_features: The hidden dimension size of the embeddings.
        num_classes: Number of output classes as channels.
    """
    super().__init__(
        layers=nn.Linear(
            in_features=in_features,
            out_features=num_classes,
        ),
    )

eva.vision.models.networks.decoders.segmentation.ConvDecoderWithImage

Bases: Decoder2D

A convolutional that in addition to encoded features, also takes the input image as input.

In a first stage, the input features are upsampled and passed through a convolutional layer, while in the second stage, the input image channels are concatenated with the upsampled features and passed through additional convolutional blocks in order to combine the image prior information with the encoded features. Lastly, a 1x1 conv operation reduces the number of channels to the number of classes.

Parameters:

Name Type Description Default
in_features int

The hidden dimension size of the embeddings.

required
num_classes int

Number of output classes as channels.

required
greyscale bool

Whether to convert input images to greyscale.

False
hidden_dims List[int] | None

List of hidden dimensions for the convolutional layers.

None
Source code in src/eva/vision/models/networks/decoders/segmentation/semantic/with_image.py
def __init__(
    self,
    in_features: int,
    num_classes: int,
    greyscale: bool = False,
    hidden_dims: List[int] | None = None,
) -> None:
    """Initializes the decoder.

    Args:
        in_features: The hidden dimension size of the embeddings.
        num_classes: Number of output classes as channels.
        greyscale: Whether to convert input images to greyscale.
        hidden_dims: List of hidden dimensions for the convolutional layers.
    """
    hidden_dims = hidden_dims or self._default_hidden_dims
    if len(hidden_dims) != 3:
        raise ValueError("Hidden dims must have 3 elements.")

    super().__init__(
        layers=nn.Sequential(
            nn.Upsample(scale_factor=2),
            Conv2dBnReLU(in_features, hidden_dims[0]),
        )
    )
    self.greyscale = greyscale

    additional_hidden_dims = 1 if greyscale else 3
    self.image_block = nn.Sequential(
        Conv2dBnReLU(hidden_dims[0] + additional_hidden_dims, hidden_dims[1]),
        Conv2dBnReLU(hidden_dims[1], hidden_dims[2]),
    )

    self.classifier = nn.Conv2d(hidden_dims[2], num_classes, kernel_size=1)