PubMedQA Text Classification

This tutorial demonstrates how to evaluate large language models using eva's language module. For detailed information about the dataset, see the PubMedQA dataset documentation.

Before you start

If you haven't downloaded the config files yet, please download them from GitHub.

To enable automatic dataset download, set the environment variable DOWNLOAD_DATA="true" when running eva. By default, all eva config files have download: false to make sure users don't violate the license terms unintentionally. Additionally, you can set DATA_ROOT to configure the location where the dataset will be downloaded to / loaded from during evaluation (the default is ./data/pubmedqa).

Model Options

Eva supports three different ways to run language models:

LiteLLM: For API-based models (OpenAI, Anthropic, Together.ai, etc.) - requires API keys
HuggingFace: For running models directly on your computer (typically models under 8B parameters)
vLLM: For larger models that require cloud infrastructure (user must set up the environment)

Running PubMedQA Classification

1. Using LiteLLM (API-based models)

First, set up your API key for the provider you want to use:

# For OpenAI models
export OPENAI_API_KEY=your_openai_api_key

# For Anthropic models  
export ANTHROPIC_API_KEY=your_anthropic_api_key

# For Together.ai models
export TOGETHER_API_KEY=your_together_api_key

Then run with provider-prefixed model names. For example:

# Anthropic Claude models
MODEL_NAME=anthropic/claude-3-7-sonnet-latest eva validate --config configs/language/pubmedqa.yaml

2. Using HuggingFace models (local execution)

For smaller models (typically under 8B parameters) that can run locally on your machine:

First, update the config to use the HuggingFace wrapper:

model:
  class_path: eva.language.models.TextModule
  init_args:
    model:
      class_path: eva.language.models.HuggingFaceTextModel
      init_args:
        model_name_or_path: meta-llama/Llama-3.2-1B-Instruct

Then run:

MODEL_NAME=meta-llama/Llama-3.2-1B-Instruct eva validate --config configs/language/pubmedqa.yaml

Note: If you encounter the error TypeError: argument of type 'NoneType' is not iterable when using HuggingFace models, this is due to a compatibility issue between transformers 4.45+ and PyTorch 2.3.0. To fix this, downgrade transformers:

pip install "transformers<4.45"

3. Using vLLM (cloud/distributed execution)

For larger models that require specialized infrastructure, you'll need to:

Set up a vLLM server in your cloud environment
Update the config to use the vLLM wrapper:

model:
  class_path: eva.language.models.TextModule
  init_args:
    model:
      class_path: eva.language.models.VLLMTextModel
      init_args:
        model_name_or_path: meta-llama/Llama-2-70b-chat-hf

4. Basic evaluation with default configuration

The default PubMedQA config uses LiteLLM with Anthropic's Claude model. Run:

eva validate --config configs/language/pubmedqa.yaml

To enable automatic dataset download:

DOWNLOAD_DATA="true" eva validate --config configs/language/pubmedqa.yaml

This command will:

Load the manually curated test set of 1000 question-abstract pairs (automatically download if DOWNLOAD_DATA="true" is set)
Use the default Claude model to classify each question-abstract pair
Show evaluation results including accuracy, precision, recall, and F1 scores

5. Customizing batch size and workers

For better performance or to work within API rate limits, you can adjust the batch size and number of workers:

BATCH_SIZE=4 N_DATA_WORKERS=2 eva validate --config configs/language/pubmedqa.yaml

Understanding the results

Once the evaluation is complete:

Check the evaluation results in logs/<model-name>/pubmedqa/<session-id>/results.json
The results will include metrics computed on the 1000 manually annotated test examples:
Accuracy: Overall classification accuracy across all three classes
Precision/Recall/F1: Per-class and macro-averaged metrics

Key configuration components

The PubMedQA config demonstrates several important concepts:

Text prompting:

prompt: "Instruction: You are an expert in biomedical research. Please carefully read the question and the relevant context and answer with yes, no, or maybe. Only answer with one of these three words."

Model configuration (LiteLLM):

model:
  class_path: eva.language.models.LiteLLMTextModel
  init_args:
    model_name_or_path: ${oc.env:MODEL_NAME, anthropic/claude-3-7-sonnet-latest}

Postprocessing:

postprocess:
  predictions_transforms:
    - class_path: eva.language.utils.str_to_int_tensor.CastStrToIntTensor

This converts the model's text output (yes/no/maybe) to integer tensors for evaluation using regex mapping: - "no" → 0 - "yes" → 1
- "maybe" → 2

Custom mappings are also supported by providing a mapping dictionary during initialization:

postprocess:
  predictions_transforms:
    - class_path: eva.language.utils.str_to_int_tensor.CastStrToIntTensor
      init_args:
        mapping:
          "positive|good": 1
          "negative|bad": 0

Advanced usage

Custom prompts

You can experiment with different prompting strategies by modifying the prompt in the config file. For example, you might try:

Chain-of-thought prompting
Few-shot examples
Different output formats

Model comparison

Run evaluations with multiple models to compare their performance on the 1000-question test set:

# Compare different API providers
MODEL_NAME=anthropic/claude-3-sonnet-20240229 eva validate --config configs/language/pubmedqa.yaml
MODEL_NAME=openai/gpt-4o eva validate --config configs/language/pubmedqa.yaml

# Compare model sizes within a provider
MODEL_NAME=anthropic/claude-3-haiku-20240307 eva validate --config configs/language/pubmedqa.yaml
MODEL_NAME=anthropic/claude-3-sonnet-20240229 eva validate --config configs/language/pubmedqa.yaml

Statistical evaluation with multiple runs

For more robust evaluation with statistical significance, you can run multiple evaluations and calculate mean and standard deviation across runs:

# Run 5 evaluations with different random seeds
N_RUNS=5 eva validate --config configs/language/pubmedqa.yaml

This will automatically run the evaluation 5 times and compute statistics (mean and standard deviation) across all runs for more reliable performance estimates.

The results from each run will be stored separately, allowing you to compare performance across different models and configurations.