Speculator Model Evaluation with GuideLLM
Evaluate speculator models using vLLM and GuideLLM, and extract acceptance length metrics.
Quick Start
1. Install dependencies:
2. Run evaluation with a pre-configured model:
# Llama-3.1-8B EAGLE3 on math_reasoning dataset
./run_evaluation.sh -c configs/llama-3.1-8b-eagle3.env
# Llama-3.3-70B EAGLE3 on math_reasoning dataset
./run_evaluation.sh -c configs/llama-3.3-70b-eagle3.env
# GPT-OSS-20B EAGLE3 on math_reasoning dataset
./run_evaluation.sh -c configs/gpt-oss-20b-eagle3.env
# Qwen3-8B EAGLE3 on math_reasoning dataset
./run_evaluation.sh -c configs/qwen3-8b-eagle3.env
# Qwen3-32B EAGLE3 on math_reasoning dataset
./run_evaluation.sh -c configs/qwen3-32b-eagle3.env
Or run with custom parameters:
./run_evaluation.sh \
-b "meta-llama/Llama-3.1-8B-Instruct" \
-s "RedHatAI/Llama-3.1-8B-Instruct-speculator.eagle3" \
-d "emulated"
Results will be in a timestamped directory like eval_results_20251203_165432/.
Architecture
This framework uses vLLM's speculative decoding feature to evaluate speculator models. The evaluation setup consists of:
- Base Model: The main LLM that performs final token acceptance/rejection
- Speculator Model: A smaller, faster model that generates speculative tokens
- Speculative Decoding: The base model validates tokens proposed by the speculator, speeding up inference
The framework consists of modular scripts organized in a clean directory structure:
eval-guidellm/
├── run_evaluation.sh # Main controller
├── configs/ # Pre-configured evaluations
│ ├── llama-3.1-8b-eagle3.env # Llama-3.1-8B
│ ├── llama-3.3-70b-eagle3.env # Llama-3.3-70B
│ ├── gpt-oss-20b-eagle3.env # GPT-OSS-20B
│ ├── qwen3-8b-eagle3.env # Qwen3-8B
│ └── qwen3-32b-eagle3.env # Qwen3-32B
├── scripts/ # Utility scripts
│ ├── vllm_serve.sh
│ ├── vllm_stop.sh
│ ├── run_guidellm.sh
│ └── parse_logs.py
└── setup.sh # Install dependencies
Configuration
Pre-configured Models
The framework includes configs for common models:
# Llama-3.1-8B EAGLE3 on math_reasoning
./run_evaluation.sh -c configs/llama-3.1-8b-eagle3.env
# Llama-3.3-70B EAGLE3 on math_reasoning
./run_evaluation.sh -c configs/llama-3.3-70b-eagle3.env
# GPT-OSS-20B EAGLE3 on math_reasoning
./run_evaluation.sh -c configs/gpt-oss-20b-eagle3.env
# Qwen3-8B EAGLE3 on math_reasoning
./run_evaluation.sh -c configs/qwen3-8b-eagle3.env
# Qwen3-32B EAGLE3 on math_reasoning
./run_evaluation.sh -c configs/qwen3-32b-eagle3.env
Command Line Usage
./run_evaluation.sh -b BASE_MODEL -s SPECULATOR_MODEL -d DATASET [OPTIONS]
Required:
-b BASE_MODEL Base model (e.g., "meta-llama/Llama-3.1-8B-Instruct")
-s SPECULATOR_MODEL Speculator model (e.g., "RedHatAI/Llama-3.1-8B-Instruct-speculator.eagle3")
-d DATASET Dataset for benchmarking (see Dataset Options below)
Optional:
-c FILE Config file to use (e.g., configs/llama-eagle3.env)
-o DIR Output directory (default: eval_results_TIMESTAMP)
Creating Custom Configs
Create a new config file in configs/:
# configs/my-model.env
# Model configuration
BASE_MODEL="my-org/my-base-model"
SPECULATOR_MODEL="my-org/my-speculator-model"
NUM_SPEC_TOKENS=3
METHOD="eagle3"
# Dataset configuration
DATASET="RedHatAI/speculator_benchmarks:math_reasoning.jsonl"
# vLLM server settings
TENSOR_PARALLEL_SIZE=2
GPU_MEMORY_UTILIZATION=0.8
PORT=8000
HEALTH_CHECK_TIMEOUT=300
# Sampling parameters
TEMPERATURE=0.6
TOP_P=0.95
TOP_K=20
# Output settings
OUTPUT_DIR="eval_results_$(date +%Y%m%d_%H%M%S)"
Then run:
Configuration Options
| Option | Description | Default |
|---|---|---|
BASE_MODEL | Base model path or HuggingFace ID | (required) |
SPECULATOR_MODEL | Speculator model path or HuggingFace ID | (required) |
NUM_SPEC_TOKENS | Number of speculative tokens to generate | 3 |
METHOD | Speculative decoding method | eagle3 |
DATASET | Dataset for benchmarking (emulated, HF dataset, or file path) | (required) |
TENSOR_PARALLEL_SIZE | Number of GPUs for tensor parallelism | 2 |
GPU_MEMORY_UTILIZATION | GPU memory fraction to use | 0.8 |
PORT | Server port | 8000 |
HEALTH_CHECK_TIMEOUT | Server startup timeout (seconds) | 300 |
TEMPERATURE | Sampling temperature | 0.6 |
TOP_P | Top-p (nucleus) sampling parameter | 0.95 |
TOP_K | Top-k sampling parameter | 20 |
OUTPUT_DIR | Output directory | eval_results_TIMESTAMP |
Dataset Options
The framework supports five types of dataset inputs:
-
Built-in datasets:
emulated(included with guidellm) -
Example:
DATASET="emulated" -
HuggingFace datasets (all files):
org/dataset-name -
Automatically downloaded using HuggingFace CLI
- Runs benchmarks on all .jsonl files in the dataset
-
Example:
DATASET="RedHatAI/speculator_benchmarks" -
HuggingFace datasets (specific file):
org/dataset-name:filename.jsonl -
Downloads the dataset and uses only the specified file
- Use colon (
:) to separate dataset from filename -
Example:
DATASET="RedHatAI/speculator_benchmarks:math_reasoning.jsonl" -
Local directory: Path to a folder containing .jsonl files
-
Runs benchmarks on all .jsonl files in the directory
- Results are saved with dataset-specific filenames
-
Example:
DATASET="./my_datasets/" -
Local file: Path to a single .jsonl file
-
Runs benchmark on that specific file
- Example:
DATASET="./my_data.jsonl"
Advanced Usage
Manual Workflow
For debugging or running multiple benchmarks against the same server:
# Terminal 1: Start server
./scripts/vllm_serve.sh \
-b "meta-llama/Llama-3.1-8B-Instruct" \
-s "RedHatAI/Llama-3.1-8B-Instruct-speculator.eagle3" \
--num-spec-tokens 3 \
--method eagle3 \
--tensor-parallel-size 2 \
--gpu-memory-utilization 0.8 \
--log-file server.log \
--pid-file server.pid
# Terminal 2: Run benchmarks
./scripts/run_guidellm.sh -d "dataset1.jsonl" --output-file results1.json
./scripts/run_guidellm.sh -d "dataset2.jsonl" --output-file results2.json
# Parse acceptance metrics
python scripts/parse_logs.py server.log -o acceptance_stats.txt
# Terminal 1: Stop server
./scripts/vllm_stop.sh --pid-file server.pid
Output Files
All results are saved in a timestamped output directory.
Single Dataset
eval_results_20251203_165432/
├── vllm_server.log # vLLM server output (used for parsing)
├── guidellm_output.log # GuideLLM benchmark progress
├── guidellm_results.json # GuideLLM performance metrics
└── acceptance_analysis.txt # Acceptance length statistics
Multiple Datasets (Directory or HuggingFace)
When using a directory or HuggingFace dataset with multiple .jsonl files:
eval_results_20251203_165432/
├── vllm_server.log # vLLM server output (all benchmarks)
├── guidellm_output_dataset1.log # Benchmark progress for dataset1
├── guidellm_output_dataset2.log # Benchmark progress for dataset2
├── guidellm_results_dataset1.json # Performance metrics for dataset1
├── guidellm_results_dataset2.json # Performance metrics for dataset2
└── acceptance_analysis.txt # Combined acceptance statistics
Acceptance Metrics
The acceptance_analysis.txt contains:
- Weighted acceptance rates: Per-position acceptance rates weighted by draft tokens
- Conditional acceptance rates: Probability of accepting position N given position N-1 was accepted
These metrics help evaluate the effectiveness of speculative decoding.
Examples
Using Pre-configured Models
./run_evaluation.sh -c configs/llama-3.1-8b-eagle3.env
./run_evaluation.sh -c configs/llama-3.3-70b-eagle3.env
./run_evaluation.sh -c configs/gpt-oss-20b-eagle3.env
./run_evaluation.sh -c configs/qwen3-8b-eagle3.env
./run_evaluation.sh -c configs/qwen3-32b-eagle3.env
Quick Test with Emulated Dataset
./run_evaluation.sh \
-b "meta-llama/Llama-3.1-8B-Instruct" \
-s "RedHatAI/Llama-3.1-8B-Instruct-speculator.eagle3" \
-d "emulated"
HuggingFace Dataset (Specific File)
./run_evaluation.sh \
-b "meta-llama/Llama-3.1-8B-Instruct" \
-s "RedHatAI/Llama-3.1-8B-Instruct-speculator.eagle3" \
-d "RedHatAI/speculator_benchmarks:math_reasoning.jsonl"
HuggingFace Dataset (All Files)
./run_evaluation.sh \
-b "meta-llama/Llama-3.1-8B-Instruct" \
-s "RedHatAI/Llama-3.1-8B-Instruct-speculator.eagle3" \
-d "RedHatAI/speculator_benchmarks"
Local File or Directory
# Single file
./run_evaluation.sh \
-b "meta-llama/Llama-3.1-8B-Instruct" \
-s "RedHatAI/Llama-3.1-8B-Instruct-speculator.eagle3" \
-d "./my_data.jsonl"
# All .jsonl files in directory
./run_evaluation.sh \
-b "meta-llama/Llama-3.1-8B-Instruct" \
-s "RedHatAI/Llama-3.1-8B-Instruct-speculator.eagle3" \
-d "./my_datasets/"
Troubleshooting
Server won't start:
Dataset not found:
hf download DATASET --repo-type dataset # Test HF dataset download
./run_evaluation.sh -m MODEL -d emulated # Use built-in dataset
Server cleanup:
Dependencies
Required: Python 3.9+, vLLM, GuideLLM, HuggingFace CLI, curl