Skip to content

Wrappers

Reference information for the multimodal Wrappers API.

eva.multimodal.models.wrappers.VisionLanguageModel

Bases: BaseModel[TextImageBatch, ModelOutput]

Base class for multimodal models.

Classes that inherit from this should implement the following methods: - load_model: Loads & instantiates the model. - model_forward: Implements the forward pass of the model. For API models, this can be an API call. - format_inputs: Preprocesses and converts the input batch into the format expected by the model_forward method.

Parameters:

Name Type Description Default
system_prompt str | None

The system prompt to use for the model (optional).

required
output_transforms Callable | None

Optional transforms to apply to the output of the model's forward pass.

None
Source code in src/eva/multimodal/models/wrappers/base.py
def __init__(
    self, system_prompt: str | None, output_transforms: Callable | None = None
) -> None:
    """Creates a new model instance.

    Args:
        system_prompt: The system prompt to use for the model (optional).
        output_transforms: Optional transforms to apply to the output of
            the model's forward pass.
    """
    super().__init__(transforms=output_transforms)

    self.system_message = ModelSystemMessage(content=system_prompt) if system_prompt else None

forward

Forward pass of the model.

Source code in src/eva/multimodal/models/wrappers/base.py
@override
def forward(self, batch: TextImageBatch) -> ModelOutput:
    """Forward pass of the model."""
    inputs = self.format_inputs(batch)
    return super().forward(inputs)

format_inputs abstractmethod

Converts the inputs into the format expected by the model.

Source code in src/eva/multimodal/models/wrappers/base.py
@abc.abstractmethod
def format_inputs(self, batch: TextImageBatch) -> Any:
    """Converts the inputs into the format expected by the model."""
    raise NotImplementedError

eva.multimodal.models.wrappers.ModelFromRegistry

Bases: BaseModel[TextImageBatch, List[str]]

Wrapper class for vision backbone models.

This class can be used by load backbones available in eva's model registry by name. New backbones can be registered by using the @backbone_registry.register(model_name) decorator.

Parameters:

Name Type Description Default
model_name str

The name of the model to load.

required
model_kwargs Dict[str, Any] | None

The arguments used for instantiating the model.

None
model_extra_kwargs Dict[str, Any] | None

Extra arguments used for instantiating the model.

None
transforms Callable | None

The transforms to apply to the output tensor produced by the model.

None
Source code in src/eva/multimodal/models/wrappers/from_registry.py
def __init__(
    self,
    model_name: str,
    model_kwargs: Dict[str, Any] | None = None,
    model_extra_kwargs: Dict[str, Any] | None = None,
    transforms: Callable | None = None,
) -> None:
    """Initializes the model.

    Args:
        model_name: The name of the model to load.
        model_kwargs: The arguments used for instantiating the model.
        model_extra_kwargs: Extra arguments used for instantiating the model.
        transforms: The transforms to apply to the output tensor
            produced by the model.
    """
    super().__init__(transforms=transforms)

    self._model_name = model_name
    self._model_kwargs = model_kwargs or {}
    self._model_extra_kwargs = model_extra_kwargs or {}

    self.model = self.load_model()

eva.multimodal.models.wrappers.HuggingFaceModel

Bases: VisionLanguageModel

Wrapper class for HuggingFace vision-language models.

Parameters:

Name Type Description Default
model_name_or_path str

The name or path of the model to use.

required
model_class str

The class of the model to use.

required
model_kwargs Dict[str, Any] | None

Additional model arguments.

None
system_prompt str | None

System prompt to use.

None
processor_kwargs Dict[str, Any] | None

Additional processor arguments.

None
generation_kwargs Dict[str, Any] | None

Additional generation arguments.

None
image_key str

The key used for image inputs in the chat template.

'image'
image_position Literal['before_text', 'after_text']

Position of the image in the input sequence.

'after_text'
chat_template str | None

Optional chat template name to use with the processor. If None, will use the template stored in the checkpoint's processor config.

None
Source code in src/eva/multimodal/models/wrappers/huggingface.py
def __init__(
    self,
    model_name_or_path: str,
    model_class: str,
    model_kwargs: Dict[str, Any] | None = None,
    system_prompt: str | None = None,
    processor_kwargs: Dict[str, Any] | None = None,
    generation_kwargs: Dict[str, Any] | None = None,
    image_key: str = "image",
    image_position: Literal["before_text", "after_text"] = "after_text",
    chat_template: str | None = None,
):
    """Initialize the HuggingFace model wrapper.

    Args:
        model_name_or_path: The name or path of the model to use.
        model_class: The class of the model to use.
        model_kwargs: Additional model arguments.
        system_prompt: System prompt to use.
        processor_kwargs: Additional processor arguments.
        generation_kwargs: Additional generation arguments.
        image_key: The key used for image inputs in the chat template.
        image_position: Position of the image in the input sequence.
        chat_template: Optional chat template name to use with the processor. If None,
            will use the template stored in the checkpoint's processor config.
    """
    super().__init__(system_prompt=system_prompt)

    self.image_key = image_key
    self.image_position: Literal["before_text", "after_text"] = image_position
    self.model_name_or_path = model_name_or_path
    self.model_class = model_class
    self.model_kwargs = model_kwargs or {}
    self.processor_kwargs = processor_kwargs or {}
    self.generation_kwargs = self._default_generation_kwargs | (generation_kwargs or {})
    self.chat_template = chat_template

    self.model: language_wrappers.HuggingFaceModel
    self.processor: Callable

configure_model

Use configure_model hook to load model in lazy fashion.

Source code in src/eva/multimodal/models/wrappers/huggingface.py
def configure_model(self) -> None:
    """Use configure_model hook to load model in lazy fashion."""
    if not hasattr(self, "model"):
        self.model = self.load_model()
        self.model.configure_model()
    if not hasattr(self, "processor"):
        self.processor = self.model.processor

format_inputs

Formats inputs for HuggingFace models.

Parameters:

Name Type Description Default
batch TextImageBatch | TextBatch

A batch of text and image inputs.

required

Returns:

Type Description
Dict[str, Tensor]

A dictionary produced by the provided processor following a format like:

Dict[str, Tensor]

{ "input_ids": ..., "attention_mask": ..., "pixel_values": ...

Dict[str, Tensor]

}

Source code in src/eva/multimodal/models/wrappers/huggingface.py
@override
def format_inputs(self, batch: TextImageBatch | TextBatch) -> Dict[str, torch.Tensor]:
    """Formats inputs for HuggingFace models.

    Args:
        batch: A batch of text and image inputs.

    Returns:
        A dictionary produced by the provided processor following a format like:
        {
            "input_ids": ...,
            "attention_mask": ...,
            "pixel_values": ...
        }
    """
    message_batch, image_batch, _, _ = unpack_batch(batch)

    message_batch = language_message_utils.batch_insert_system_message(
        message_batch, self.system_message
    )
    message_batch = list(map(language_message_utils.combine_system_messages, message_batch))

    if image_batch is None:
        image_batch = [None] * len(message_batch)

    if self.processor.chat_template is not None:  # type: ignore
        templated_text = [
            self.processor.apply_chat_template(  # type: ignore
                message_utils.format_huggingface_message(
                    message,
                    images=images,
                    image_position=self.image_position,
                ),
                add_generation_prompt=True,
                tokenize=False,
            )
            for message, images in zip(message_batch, image_batch, strict=True)
        ]
    else:
        raise NotImplementedError("Currently only chat models are supported.")

    processor_inputs: Dict[str, Any] = {
        "text": templated_text,
        "return_tensors": "pt",
        **self.processor_kwargs,
    }

    if any(image_batch):
        processor_inputs[self.image_key] = image_batch

    return self.processor(**processor_inputs).to(self.model.model.device)  # type: ignore

model_forward

Generates text output from the model.

Parameters:

Name Type Description Default
batch Dict[str, Tensor]

A dictionary containing the input data.

required

Returns:

Type Description
ModelOutput

The model output containing generated text.

Source code in src/eva/multimodal/models/wrappers/huggingface.py
@override
def model_forward(self, batch: Dict[str, torch.Tensor]) -> ModelOutput:
    """Generates text output from the model.

    Args:
        batch: A dictionary containing the input data.

    Returns:
        The model output containing generated text.
    """
    return self.model.model_forward(batch)

eva.multimodal.models.wrappers.LiteLLMModel

Bases: VisionLanguageModel

Wrapper class for LiteLLM vision-language models.

Parameters:

Name Type Description Default
model_name str

The name of the model to use.

required
model_kwargs Dict[str, Any] | None

Additional keyword arguments to pass during generation (e.g., temperature, max_tokens).

None
system_prompt str | None

The system prompt to use (optional).

None
image_position Literal['before_text', 'after_text']

Position of image relative to text ("before_text" or "after_text").

'after_text'
log_level int | None

Optional logging level for LiteLLM. Defaults to WARNING.

INFO
Source code in src/eva/multimodal/models/wrappers/litellm.py
def __init__(
    self,
    model_name: str,
    model_kwargs: Dict[str, Any] | None = None,
    system_prompt: str | None = None,
    image_position: Literal["before_text", "after_text"] = "after_text",
    log_level: int | None = logging.INFO,
):
    """Initialize the LiteLLM Wrapper.

    Args:
        model_name: The name of the model to use.
        model_kwargs: Additional keyword arguments to pass during
            generation (e.g., `temperature`, `max_tokens`).
        system_prompt: The system prompt to use (optional).
        image_position: Position of image relative to text ("before_text" or "after_text").
        log_level: Optional logging level for LiteLLM. Defaults to WARNING.
    """
    super().__init__(system_prompt=system_prompt)
    self.image_position: Literal["before_text", "after_text"] = image_position

    self.language_model = language_wrappers.LiteLLMModel(
        model_name=model_name,
        model_kwargs=model_kwargs,
        system_prompt=system_prompt,
        log_level=log_level,
    )