Skip to content

Wrappers

Reference information for the multimodal Wrappers API.

eva.multimodal.models.wrappers.VisionLanguageModel

Bases: BaseModel[TextImageBatch, ModelOutput]

Base class for multimodal models.

Classes that inherit from this should implement the following methods: - load_model: Loads & instantiates the model. - model_forward: Implements the forward pass of the model. For API models, this can be an API call. - format_inputs: Preprocesses and converts the input batch into the format expected by the model_forward method.

Parameters:

Name Type Description Default
system_prompt str | None

The system prompt to use for the model (optional).

required
output_transforms Callable | None

Optional transforms to apply to the output of the model's forward pass.

None
Source code in src/eva/multimodal/models/wrappers/base.py
def __init__(
    self, system_prompt: str | None, output_transforms: Callable | None = None
) -> None:
    """Creates a new model instance.

    Args:
        system_prompt: The system prompt to use for the model (optional).
        output_transforms: Optional transforms to apply to the output of
            the model's forward pass.
    """
    super().__init__(transforms=output_transforms)

    self.system_message = ModelSystemMessage(content=system_prompt) if system_prompt else None

forward

Forward pass of the model.

Source code in src/eva/multimodal/models/wrappers/base.py
@override
def forward(self, batch: TextImageBatch) -> ModelOutput:
    """Forward pass of the model."""
    inputs = self.format_inputs(batch)
    return super().forward(inputs)

format_inputs abstractmethod

Converts the inputs into the format expected by the model.

Source code in src/eva/multimodal/models/wrappers/base.py
@abc.abstractmethod
def format_inputs(self, batch: TextImageBatch) -> Any:
    """Converts the inputs into the format expected by the model."""
    raise NotImplementedError

eva.multimodal.models.wrappers.ModelFromRegistry

Bases: BaseModel[TextImageBatch, List[str]]

Wrapper class for vision backbone models.

This class can be used by load backbones available in eva's model registry by name. New backbones can be registered by using the @backbone_registry.register(model_name) decorator.

Parameters:

Name Type Description Default
model_name str

The name of the model to load.

required
model_kwargs Dict[str, Any] | None

The arguments used for instantiating the model.

None
model_extra_kwargs Dict[str, Any] | None

Extra arguments used for instantiating the model.

None
transforms Callable | None

The transforms to apply to the output tensor produced by the model.

None
Source code in src/eva/multimodal/models/wrappers/from_registry.py
def __init__(
    self,
    model_name: str,
    model_kwargs: Dict[str, Any] | None = None,
    model_extra_kwargs: Dict[str, Any] | None = None,
    transforms: Callable | None = None,
) -> None:
    """Initializes the model.

    Args:
        model_name: The name of the model to load.
        model_kwargs: The arguments used for instantiating the model.
        model_extra_kwargs: Extra arguments used for instantiating the model.
        transforms: The transforms to apply to the output tensor
            produced by the model.
    """
    super().__init__(transforms=transforms)

    self._model_name = model_name
    self._model_kwargs = model_kwargs or {}
    self._model_extra_kwargs = model_extra_kwargs or {}

    self.model = self.load_model()

eva.multimodal.models.wrappers.HuggingFaceModel

Bases: VisionLanguageModel

Lightweight wrapper for Huggingface VLMs.

Parameters:

Name Type Description Default
model_name_or_path str

The name of the model to use.

required
model_class str

The class of the model to use.

required
model_kwargs Dict[str, Any] | None

Additional model arguments.

None
processor_kwargs Dict[str, Any] | None

Additional processor arguments.

None
generation_kwargs Dict[str, Any] | None

Additional generation arguments.

None

Parameters:

Name Type Description Default
model_name_or_path str

The name or path of the model to use.

required
model_class str

The class of the model to use.

required
model_kwargs Dict[str, Any] | None

Additional model arguments.

None
system_prompt str | None

System prompt to use.

None
processor_kwargs Dict[str, Any] | None

Additional processor arguments.

None
generation_kwargs Dict[str, Any] | None

Additional generation arguments.

None
Source code in src/eva/multimodal/models/wrappers/huggingface.py
def __init__(
    self,
    model_name_or_path: str,
    model_class: str,
    model_kwargs: Dict[str, Any] | None = None,
    system_prompt: str | None = None,
    processor_kwargs: Dict[str, Any] | None = None,
    generation_kwargs: Dict[str, Any] | None = None,
):
    """Initialize the HuggingFace model wrapper.

    Args:
        model_name_or_path: The name or path of the model to use.
        model_class: The class of the model to use.
        model_kwargs: Additional model arguments.
        system_prompt: System prompt to use.
        processor_kwargs: Additional processor arguments.
        generation_kwargs: Additional generation arguments.
    """
    super().__init__(system_prompt=system_prompt)

    self.model_name_or_path = model_name_or_path
    self.model_kwargs = model_kwargs or {}
    self.base_model_class = model_class
    self.processor_kwargs = processor_kwargs or {}
    self.generation_kwargs = generation_kwargs or {}

    self.processor = self.load_processor()
    self.model = self.load_model()

format_inputs

Formats inputs for HuggingFace models.

Parameters:

Name Type Description Default
batch TextImageBatch | TextBatch

A batch of text and image inputs.

required

Returns:

Type Description
Dict[str, Tensor]

A dictionary produced by the provided processor following a format like:

Dict[str, Tensor]

{ "input_ids": ..., "attention_mask": ..., "pixel_values": ...

Dict[str, Tensor]

}

Source code in src/eva/multimodal/models/wrappers/huggingface.py
@override
def format_inputs(self, batch: TextImageBatch | TextBatch) -> Dict[str, torch.Tensor]:
    """Formats inputs for HuggingFace models.

    Args:
        batch: A batch of text and image inputs.

    Returns:
        A dictionary produced by the provided processor following a format like:
        {
            "input_ids": ...,
            "attention_mask": ...,
            "pixel_values": ...
        }
    """
    message_batch, image_batch, _, _ = unpack_batch(batch)
    with_images = image_batch is not None

    message_batch = language_message_utils.batch_insert_system_message(
        message_batch, self.system_message
    )
    message_batch = list(map(language_message_utils.combine_system_messages, message_batch))

    if self.processor.chat_template is not None:  # type: ignore
        templated_text = [
            self.processor.apply_chat_template(  # type: ignore
                message,
                add_generation_prompt=True,
                tokenize=False,
            )
            for message in map(
                functools.partial(
                    message_utils.format_huggingface_message,
                    with_images=with_images,
                ),
                message_batch,
            )
        ]
    else:
        raise NotImplementedError("Currently only chat models are supported.")

    processor_inputs = {
        "text": templated_text,
        "return_tensors": "pt",
        **self.processor_kwargs,
    }

    if with_images:
        processor_inputs["image"] = [[image] for image in image_batch]

    return self.processor(**processor_inputs).to(self.model.device)  # type: ignore

model_forward

Generates text output from the model. Is called by the generate method.

Parameters:

Name Type Description Default
batch Dict[str, Tensor]

A dictionary containing the input data, which may include: - "text": List of messages formatted for the model. - "image": List of image tensors.

required

Returns:

Type Description
ModelOutput

A dictionary containing the processed input and the model's output.

Source code in src/eva/multimodal/models/wrappers/huggingface.py
@override
def model_forward(self, batch: Dict[str, torch.Tensor]) -> ModelOutput:
    """Generates text output from the model. Is called by the `generate` method.

    Args:
        batch: A dictionary containing the input data, which may include:
            - "text": List of messages formatted for the model.
            - "image": List of image tensors.

    Returns:
        A dictionary containing the processed input and the model's output.
    """
    output_ids = self.model.generate(**batch, **self.generation_kwargs)  # type: ignore

    return ModelOutput(
        generated_text=self._decode_output(output_ids, batch["input_ids"].shape[-1]),
        input_ids=batch.get("input_ids"),
        output_ids=output_ids,
        attention_mask=batch.get("attention_mask"),
    )

load_model

Setting up the model. Used for delayed model initialization.

Raises:

Type Description
ValueError

If the model class is not found in transformers or if the model does not support gradient checkpointing but it is enabled.

Source code in src/eva/multimodal/models/wrappers/huggingface.py
@override
def load_model(self) -> nn.Module:
    """Setting up the model. Used for delayed model initialization.

    Raises:
        ValueError: If the model class is not found in transformers or if the model
            does not support gradient checkpointing but it is enabled.
    """
    logger.info(f"Configuring model: {self.model_name_or_path}")
    if hasattr(transformers, self.base_model_class):
        model_class = getattr(transformers, self.base_model_class)
    else:
        raise ValueError(f"Model class {self.base_model_class} not found in transformers")

    model = model_class.from_pretrained(self.model_name_or_path, **self.model_kwargs)

    if not hasattr(model, "generate"):
        raise ValueError(f"Model {self.model_name_or_path} does not support generation. ")

    return model

load_processor

Initialize the processor.

Source code in src/eva/multimodal/models/wrappers/huggingface.py
def load_processor(self) -> Callable:
    """Initialize the processor."""
    return transformers.AutoProcessor.from_pretrained(
        self.processor_kwargs.pop("model_name_or_path", self.model_name_or_path),
        **self.processor_kwargs,
    )

eva.multimodal.models.wrappers.LiteLLMModel

Bases: VisionLanguageModel

Wrapper class for LiteLLM vision-language models.

Parameters:

Name Type Description Default
model_name str

The name of the model to use.

required
model_kwargs Dict[str, Any] | None

Additional keyword arguments to pass during generation (e.g., temperature, max_tokens).

None
system_prompt str | None

The system prompt to use (optional).

None
log_level int | None

Optional logging level for LiteLLM. Defaults to WARNING.

INFO
Source code in src/eva/multimodal/models/wrappers/litellm.py
def __init__(
    self,
    model_name: str,
    model_kwargs: Dict[str, Any] | None = None,
    system_prompt: str | None = None,
    log_level: int | None = logging.INFO,
):
    """Initialize the LiteLLM Wrapper.

    Args:
        model_name: The name of the model to use.
        model_kwargs: Additional keyword arguments to pass during
            generation (e.g., `temperature`, `max_tokens`).
        system_prompt: The system prompt to use (optional).
        log_level: Optional logging level for LiteLLM. Defaults to WARNING.
    """
    super().__init__(system_prompt=system_prompt)

    self.language_model = language_wrappers.LiteLLMModel(
        model_name=model_name,
        model_kwargs=model_kwargs,
        system_prompt=system_prompt,
        log_level=log_level,
    )