Wrappers

Reference information for the multimodal Wrappers API.

`eva.multimodal.models.wrappers.VisionLanguageModel`

Bases: BaseModel[TextImageBatch, ModelOutput]

Base class for multimodal models.

Classes that inherit from this should implement the following methods: - load_model: Loads & instantiates the model. - model_forward: Implements the forward pass of the model. For API models, this can be an API call. - format_inputs: Preprocesses and converts the input batch into the format expected by the model_forward method.

Parameters:

Name	Type	Description	Default
`system_prompt`	`str \| None`	The system prompt to use for the model (optional).	required
`output_transforms`	`Callable \| None`	Optional transforms to apply to the output of the model's forward pass.	`None`

Source code in src/eva/multimodal/models/wrappers/base.py

def __init__(
    self, system_prompt: str | None, output_transforms: Callable | None = None
) -> None:
    """Creates a new model instance.

    Args:
        system_prompt: The system prompt to use for the model (optional).
        output_transforms: Optional transforms to apply to the output of
            the model's forward pass.
    """
    super().__init__(transforms=output_transforms)

    self.system_message = ModelSystemMessage(content=system_prompt) if system_prompt else None

`forward`

Forward pass of the model.

Source code in src/eva/multimodal/models/wrappers/base.py

@override
def forward(self, batch: TextImageBatch) -> ModelOutput:
    """Forward pass of the model."""
    inputs = self.format_inputs(batch)
    return super().forward(inputs)

`format_inputs` `abstractmethod`

Converts the inputs into the format expected by the model.

Source code in src/eva/multimodal/models/wrappers/base.py

@abc.abstractmethod
def format_inputs(self, batch: TextImageBatch) -> Any:
    """Converts the inputs into the format expected by the model."""
    raise NotImplementedError

`eva.multimodal.models.wrappers.ModelFromRegistry`

Bases: BaseModel[TextImageBatch, List[str]]

Wrapper class for vision backbone models.

This class can be used by load backbones available in eva's model registry by name. New backbones can be registered by using the @backbone_registry.register(model_name) decorator.

Parameters:

Name	Type	Description	Default
`model_name`	`str`	The name of the model to load.	required
`model_kwargs`	`Dict[str, Any] \| None`	The arguments used for instantiating the model.	`None`
`model_extra_kwargs`	`Dict[str, Any] \| None`	Extra arguments used for instantiating the model.	`None`
`transforms`	`Callable \| None`	The transforms to apply to the output tensor produced by the model.	`None`

Source code in src/eva/multimodal/models/wrappers/from_registry.py

def __init__(
    self,
    model_name: str,
    model_kwargs: Dict[str, Any] | None = None,
    model_extra_kwargs: Dict[str, Any] | None = None,
    transforms: Callable | None = None,
) -> None:
    """Initializes the model.

    Args:
        model_name: The name of the model to load.
        model_kwargs: The arguments used for instantiating the model.
        model_extra_kwargs: Extra arguments used for instantiating the model.
        transforms: The transforms to apply to the output tensor
            produced by the model.
    """
    super().__init__(transforms=transforms)

    self._model_name = model_name
    self._model_kwargs = model_kwargs or {}
    self._model_extra_kwargs = model_extra_kwargs or {}

    self.model = self.load_model()

`eva.multimodal.models.wrappers.HuggingFaceModel`

Bases: VisionLanguageModel

Lightweight wrapper for Huggingface VLMs.

Parameters:

Name	Type	Description	Default
`model_name_or_path`	`str`	The name of the model to use.	required
`model_class`	`str`	The class of the model to use.	required
`model_kwargs`	`Dict[str, Any] \| None`	Additional model arguments.	`None`
`processor_kwargs`	`Dict[str, Any] \| None`	Additional processor arguments.	`None`
`generation_kwargs`	`Dict[str, Any] \| None`	Additional generation arguments.	`None`

Parameters:

Name	Type	Description	Default
`model_name_or_path`	`str`	The name or path of the model to use.	required
`model_class`	`str`	The class of the model to use.	required
`model_kwargs`	`Dict[str, Any] \| None`	Additional model arguments.	`None`
`system_prompt`	`str \| None`	System prompt to use.	`None`
`processor_kwargs`	`Dict[str, Any] \| None`	Additional processor arguments.	`None`
`generation_kwargs`	`Dict[str, Any] \| None`	Additional generation arguments.	`None`
`image_key`	`str`	The key used for image inputs in the chat template.	`'image'`

Source code in src/eva/multimodal/models/wrappers/huggingface.py

def __init__(
    self,
    model_name_or_path: str,
    model_class: str,
    model_kwargs: Dict[str, Any] | None = None,
    system_prompt: str | None = None,
    processor_kwargs: Dict[str, Any] | None = None,
    generation_kwargs: Dict[str, Any] | None = None,
    image_key: str = "image",
):
    """Initialize the HuggingFace model wrapper.

    Args:
        model_name_or_path: The name or path of the model to use.
        model_class: The class of the model to use.
        model_kwargs: Additional model arguments.
        system_prompt: System prompt to use.
        processor_kwargs: Additional processor arguments.
        generation_kwargs: Additional generation arguments.
        image_key: The key used for image inputs in the chat template.
    """
    super().__init__(system_prompt=system_prompt)

    self.model_name_or_path = model_name_or_path
    self.model_kwargs = model_kwargs or {}
    self.base_model_class = model_class
    self.processor_kwargs = processor_kwargs or {}
    self.generation_kwargs = self._default_generation_kwargs | (generation_kwargs or {})
    self.image_key = image_key

    self.processor = self.load_processor()
    self.model = self.load_model()

`format_inputs`

Formats inputs for HuggingFace models.

Parameters:

Name	Type	Description	Default
`batch`	`TextImageBatch \| TextBatch`	A batch of text and image inputs.	required

Returns:

Type	Description
`Dict[str, Tensor]`	A dictionary produced by the provided processor following a format like:
`Dict[str, Tensor]`	{ "input_ids": ..., "attention_mask": ..., "pixel_values": ...
`Dict[str, Tensor]`	}

Source code in src/eva/multimodal/models/wrappers/huggingface.py

@override
def format_inputs(self, batch: TextImageBatch | TextBatch) -> Dict[str, torch.Tensor]:
    """Formats inputs for HuggingFace models.

    Args:
        batch: A batch of text and image inputs.

    Returns:
        A dictionary produced by the provided processor following a format like:
        {
            "input_ids": ...,
            "attention_mask": ...,
            "pixel_values": ...
        }
    """
    message_batch, image_batch, _, _ = unpack_batch(batch)
    with_images = image_batch is not None

    message_batch = language_message_utils.batch_insert_system_message(
        message_batch, self.system_message
    )
    message_batch = list(map(language_message_utils.combine_system_messages, message_batch))

    if self.processor.chat_template is not None:  # type: ignore
        templated_text = [
            self.processor.apply_chat_template(  # type: ignore
                message,
                add_generation_prompt=True,
                tokenize=False,
            )
            for message in map(
                functools.partial(
                    message_utils.format_huggingface_message,
                    with_images=with_images,
                ),
                message_batch,
            )
        ]
    else:
        raise NotImplementedError("Currently only chat models are supported.")

    processor_inputs = {
        "text": templated_text,
        "return_tensors": "pt",
        **self.processor_kwargs,
    }

    if with_images:
        processor_inputs[self.image_key] = [[image] for image in image_batch]

    return self.processor(**processor_inputs).to(self.model.device)  # type: ignore

`model_forward`

Generates text output from the model. Is called by the generate method.

Parameters:

Name	Type	Description	Default
`batch`	`Dict[str, Tensor]`	A dictionary containing the input data, which may include: - "text": List of messages formatted for the model. - "image": List of image tensors.	required

Returns:

Type	Description
`ModelOutput`	A dictionary containing the processed input and the model's output.

Source code in src/eva/multimodal/models/wrappers/huggingface.py

@override
def model_forward(self, batch: Dict[str, torch.Tensor]) -> ModelOutput:
    """Generates text output from the model. Is called by the `generate` method.

    Args:
        batch: A dictionary containing the input data, which may include:
            - "text": List of messages formatted for the model.
            - "image": List of image tensors.

    Returns:
        A dictionary containing the processed input and the model's output.
    """
    output_ids = self.model.generate(**batch, **self.generation_kwargs)  # type: ignore

    return ModelOutput(
        generated_text=self._decode_output(output_ids, batch["input_ids"].shape[-1]),
        input_ids=batch.get("input_ids"),
        output_ids=output_ids,
        attention_mask=batch.get("attention_mask"),
    )

`load_model`

Setting up the model. Used for delayed model initialization.

Raises:

Type	Description
`ValueError`	If the model class is not found in transformers or if the model does not support gradient checkpointing but it is enabled.

Source code in src/eva/multimodal/models/wrappers/huggingface.py

@override
def load_model(self) -> nn.Module:
    """Setting up the model. Used for delayed model initialization.

    Raises:
        ValueError: If the model class is not found in transformers or if the model
            does not support gradient checkpointing but it is enabled.
    """
    logger.info(f"Configuring model: {self.model_name_or_path}")
    if hasattr(transformers, self.base_model_class):
        model_class = getattr(transformers, self.base_model_class)
    else:
        raise ValueError(f"Model class {self.base_model_class} not found in transformers")

    model = model_class.from_pretrained(self.model_name_or_path, **self.model_kwargs)

    if not hasattr(model, "generate"):
        raise ValueError(f"Model {self.model_name_or_path} does not support generation. ")

    return model

`load_processor`

Initialize the processor.

Source code in src/eva/multimodal/models/wrappers/huggingface.py

def load_processor(self) -> Callable:
    """Initialize the processor."""
    return transformers.AutoProcessor.from_pretrained(
        self.processor_kwargs.pop("model_name_or_path", self.model_name_or_path),
        **self.processor_kwargs,
    )

`eva.multimodal.models.wrappers.LiteLLMModel`

Bases: VisionLanguageModel

Wrapper class for LiteLLM vision-language models.

Parameters:

Name	Type	Description	Default
`model_name`	`str`	The name of the model to use.	required
`model_kwargs`	`Dict[str, Any] \| None`	Additional keyword arguments to pass during generation (e.g., `temperature`, `max_tokens`).	`None`
`system_prompt`	`str \| None`	The system prompt to use (optional).	`None`
`log_level`	`int \| None`	Optional logging level for LiteLLM. Defaults to WARNING.	`INFO`

Source code in src/eva/multimodal/models/wrappers/litellm.py

def __init__(
    self,
    model_name: str,
    model_kwargs: Dict[str, Any] | None = None,
    system_prompt: str | None = None,
    log_level: int | None = logging.INFO,
):
    """Initialize the LiteLLM Wrapper.

    Args:
        model_name: The name of the model to use.
        model_kwargs: Additional keyword arguments to pass during
            generation (e.g., `temperature`, `max_tokens`).
        system_prompt: The system prompt to use (optional).
        log_level: Optional logging level for LiteLLM. Defaults to WARNING.
    """
    super().__init__(system_prompt=system_prompt)

    self.language_model = language_wrappers.LiteLLMModel(
        model_name=model_name,
        model_kwargs=model_kwargs,
        system_prompt=system_prompt,
        log_level=log_level,
    )

Wrappers

eva.multimodal.models.wrappers.VisionLanguageModel

forward

format_inputs abstractmethod

eva.multimodal.models.wrappers.ModelFromRegistry

eva.multimodal.models.wrappers.HuggingFaceModel

format_inputs

model_forward

load_model

load_processor

eva.multimodal.models.wrappers.LiteLLMModel

`eva.multimodal.models.wrappers.VisionLanguageModel`

`forward`

`format_inputs` `abstractmethod`

`eva.multimodal.models.wrappers.ModelFromRegistry`

`eva.multimodal.models.wrappers.HuggingFaceModel`

`format_inputs`

`model_forward`

`load_model`

`load_processor`

`eva.multimodal.models.wrappers.LiteLLMModel`