Wrappers
Reference information for the multimodal Wrappers
API.
eva.multimodal.models.wrappers.VisionLanguageModel
Bases: BaseModel[TextImageBatch, ModelOutput]
Base class for multimodal models.
Classes that inherit from this should implement the following methods:
- load_model
: Loads & instantiates the model.
- model_forward
: Implements the forward pass of the model. For API models,
this can be an API call.
- format_inputs
: Preprocesses and converts the input batch into the format
expected by the model_forward
method.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
system_prompt |
str | None
|
The system prompt to use for the model (optional). |
required |
output_transforms |
Callable | None
|
Optional transforms to apply to the output of the model's forward pass. |
None
|
Source code in src/eva/multimodal/models/wrappers/base.py
forward
format_inputs
abstractmethod
eva.multimodal.models.wrappers.ModelFromRegistry
Bases: BaseModel[TextImageBatch, List[str]]
Wrapper class for vision backbone models.
This class can be used by load backbones available in eva's
model registry by name. New backbones can be registered by using
the @backbone_registry.register(model_name)
decorator.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model_name |
str
|
The name of the model to load. |
required |
model_kwargs |
Dict[str, Any] | None
|
The arguments used for instantiating the model. |
None
|
model_extra_kwargs |
Dict[str, Any] | None
|
Extra arguments used for instantiating the model. |
None
|
transforms |
Callable | None
|
The transforms to apply to the output tensor produced by the model. |
None
|
Source code in src/eva/multimodal/models/wrappers/from_registry.py
eva.multimodal.models.wrappers.HuggingFaceModel
Bases: VisionLanguageModel
Lightweight wrapper for Huggingface VLMs.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model_name_or_path |
str
|
The name of the model to use. |
required |
model_class |
str
|
The class of the model to use. |
required |
model_kwargs |
Dict[str, Any] | None
|
Additional model arguments. |
None
|
processor_kwargs |
Dict[str, Any] | None
|
Additional processor arguments. |
None
|
generation_kwargs |
Dict[str, Any] | None
|
Additional generation arguments. |
None
|
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model_name_or_path |
str
|
The name or path of the model to use. |
required |
model_class |
str
|
The class of the model to use. |
required |
model_kwargs |
Dict[str, Any] | None
|
Additional model arguments. |
None
|
system_prompt |
str | None
|
System prompt to use. |
None
|
processor_kwargs |
Dict[str, Any] | None
|
Additional processor arguments. |
None
|
generation_kwargs |
Dict[str, Any] | None
|
Additional generation arguments. |
None
|
Source code in src/eva/multimodal/models/wrappers/huggingface.py
format_inputs
Formats inputs for HuggingFace models.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
batch |
TextImageBatch | TextBatch
|
A batch of text and image inputs. |
required |
Returns:
Type | Description |
---|---|
Dict[str, Tensor]
|
A dictionary produced by the provided processor following a format like: |
Dict[str, Tensor]
|
{ "input_ids": ..., "attention_mask": ..., "pixel_values": ... |
Dict[str, Tensor]
|
} |
Source code in src/eva/multimodal/models/wrappers/huggingface.py
model_forward
Generates text output from the model. Is called by the generate
method.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
batch |
Dict[str, Tensor]
|
A dictionary containing the input data, which may include: - "text": List of messages formatted for the model. - "image": List of image tensors. |
required |
Returns:
Type | Description |
---|---|
ModelOutput
|
A dictionary containing the processed input and the model's output. |
Source code in src/eva/multimodal/models/wrappers/huggingface.py
load_model
Setting up the model. Used for delayed model initialization.
Raises:
Type | Description |
---|---|
ValueError
|
If the model class is not found in transformers or if the model does not support gradient checkpointing but it is enabled. |
Source code in src/eva/multimodal/models/wrappers/huggingface.py
load_processor
Initialize the processor.
Source code in src/eva/multimodal/models/wrappers/huggingface.py
eva.multimodal.models.wrappers.LiteLLMModel
Bases: VisionLanguageModel
Wrapper class for LiteLLM vision-language models.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model_name |
str
|
The name of the model to use. |
required |
model_kwargs |
Dict[str, Any] | None
|
Additional keyword arguments to pass during
generation (e.g., |
None
|
system_prompt |
str | None
|
The system prompt to use (optional). |
None
|
log_level |
int | None
|
Optional logging level for LiteLLM. Defaults to WARNING. |
INFO
|