PubMedQA Text Classification
This tutorial demonstrates how to evaluate large language models using eva's language module. For detailed information about the dataset, see the PubMedQA dataset documentation.
Before you start
If you haven't downloaded the config files yet, please download them from GitHub.
To enable automatic dataset download, set the environment variable DOWNLOAD_DATA="true"
when running eva. By default, all eva config files have download: false
to make sure users don't violate the license terms unintentionally. Additionally, you can set DATA_ROOT
to configure the location where the dataset will be downloaded to / loaded from during evaluation (the default is ./data/pubmedqa
).
Model Options
Eva supports three different ways to run language models:
- LiteLLM: For API-based models (OpenAI, Anthropic, Together.ai, etc.) - requires API keys
- HuggingFace: For running models directly on your computer (typically models under 8B parameters)
- vLLM: For larger models that require cloud infrastructure (user must set up the environment)
Running PubMedQA Classification
1. Using LiteLLM (API-based models)
First, set up your API key for the provider you want to use:
# For OpenAI models
export OPENAI_API_KEY=your_openai_api_key
# For Anthropic models
export ANTHROPIC_API_KEY=your_anthropic_api_key
# For Together.ai models
export TOGETHER_API_KEY=your_together_api_key
Then run with provider-prefixed model names. For example:
# Anthropic Claude models
MODEL_NAME=anthropic/claude-3-7-sonnet-latest eva validate --config configs/language/pubmedqa.yaml
2. Using HuggingFace models (local execution)
For smaller models (typically under 8B parameters) that can run locally on your machine:
First, update the config to use the HuggingFace wrapper:
model:
class_path: eva.language.models.TextModule
init_args:
model:
class_path: eva.language.models.HuggingFaceTextModel
init_args:
model_name_or_path: meta-llama/Llama-3.2-1B-Instruct
Then run:
Note: If you encounter the error TypeError: argument of type 'NoneType' is not iterable
when using HuggingFace models, this is due to a compatibility issue between transformers 4.45+ and PyTorch 2.3.0. To fix this, downgrade transformers:
3. Using vLLM (cloud/distributed execution)
For larger models that require specialized infrastructure, you'll need to:
- Set up a vLLM server in your cloud environment
- Update the config to use the vLLM wrapper:
model:
class_path: eva.language.models.TextModule
init_args:
model:
class_path: eva.language.models.VLLMTextModel
init_args:
model_name_or_path: meta-llama/Llama-2-70b-chat-hf
4. Basic evaluation with default configuration
The default PubMedQA config uses LiteLLM with Anthropic's Claude model. Run:
To enable automatic dataset download:
This command will:
- Load the manually curated test set of 1000 question-abstract pairs (automatically download if
DOWNLOAD_DATA="true"
is set) - Use the default Claude model to classify each question-abstract pair
- Show evaluation results including accuracy, precision, recall, and F1 scores
5. Customizing batch size and workers
For better performance or to work within API rate limits, you can adjust the batch size and number of workers:
Understanding the results
Once the evaluation is complete:
- Check the evaluation results in
logs/<model-name>/pubmedqa/<session-id>/results.json
- The results will include metrics computed on the 1000 manually annotated test examples:
- Accuracy: Overall classification accuracy across all three classes
- Precision/Recall/F1: Per-class and macro-averaged metrics
Key configuration components
The PubMedQA config demonstrates several important concepts:
Text prompting:
prompt: "Instruction: You are an expert in biomedical research. Please carefully read the question and the relevant context and answer with yes, no, or maybe. Only answer with one of these three words."
Model configuration (LiteLLM):
model:
class_path: eva.language.models.LiteLLMTextModel
init_args:
model_name_or_path: ${oc.env:MODEL_NAME, anthropic/claude-3-7-sonnet-latest}
Postprocessing:
postprocess:
predictions_transforms:
- class_path: eva.language.utils.str_to_int_tensor.CastStrToIntTensor
This converts the model's text output (yes/no/maybe) to integer tensors for evaluation using regex mapping:
- "no" → 0
- "yes" → 1
- "maybe" → 2
Custom mappings are also supported by providing a mapping dictionary during initialization:
postprocess:
predictions_transforms:
- class_path: eva.language.utils.str_to_int_tensor.CastStrToIntTensor
init_args:
mapping:
"positive|good": 1
"negative|bad": 0
Advanced usage
Custom prompts
You can experiment with different prompting strategies by modifying the prompt in the config file. For example, you might try:
- Chain-of-thought prompting
- Few-shot examples
- Different output formats
Model comparison
Run evaluations with multiple models to compare their performance on the 1000-question test set:
# Compare different API providers
MODEL_NAME=anthropic/claude-3-sonnet-20240229 eva validate --config configs/language/pubmedqa.yaml
MODEL_NAME=openai/gpt-4o eva validate --config configs/language/pubmedqa.yaml
# Compare model sizes within a provider
MODEL_NAME=anthropic/claude-3-haiku-20240307 eva validate --config configs/language/pubmedqa.yaml
MODEL_NAME=anthropic/claude-3-sonnet-20240229 eva validate --config configs/language/pubmedqa.yaml
Statistical evaluation with multiple runs
For more robust evaluation with statistical significance, you can run multiple evaluations and calculate mean and standard deviation across runs:
# Run 5 evaluations with different random seeds
N_RUNS=5 eva validate --config configs/language/pubmedqa.yaml
This will automatically run the evaluation 5 times and compute statistics (mean and standard deviation) across all runs for more reliable performance estimates.
The results from each run will be stored separately, allowing you to compare performance across different models and configurations.