speculators.model
Base model classes for the Speculators library.
This module contains the base model classes for speculative decoding implementations in the Speculators library. These classes provide the foundation for creating speculator models that can perform speculative token generation with verifier models for accelerated inference.
The models extend Hugging Face's PreTrainedModel and GenerationMixin to maintain full compatibility with the transformers ecosystem while adding speculative decoding capabilities. They support automatic model registration and discovery, dynamic model loading based on configuration, and flexible verifier attachment.
Classes: SpeculatorModel: Abstract base class for all speculator models with transformers compatibility, automatic registry support, and speculative generation methods
Functions: reload_and_populate_models: Automatically populates the model registry for discovery and instantiation of registered speculator models
Classes:
-
SpeculatorModel–Abstract base class for all speculator models in the Speculators library.
Functions:
-
reload_and_populate_models–Triggers the automatic discovery and registration of all
SpeculatorModel
SpeculatorModel(
config: SpeculatorModelConfig,
verifier: str | PathLike | PreTrainedModel | None,
verifier_attachment_mode: Literal[
"detached", "full", "train_only"
]
| None,
**kwargs,
)
Bases: ClassRegistryMixin, PreTrainedModel, GenerationMixin
Abstract base class for all speculator models in the Speculators library.
This class provides the foundation for implementing speculative decoding models that can generate candidate tokens to be verified by a base verifier model. It combines the functionality of Hugging Face's PreTrainedModel and GenerationMixin with automatic model registration and discovery capabilities. All concrete speculator model implementations must inherit from this class, register with SpeculatorModel.register(NAME), and implement the abstract forward method.
Example:
# Load a speculator model with automatic class resolution
model = SpeculatorModel.from_pretrained("path/to/speculator")
# Optionally attach a new verifier model
verifier = AutoModel.from_pretrained("path/to/verifier")
model.attach_verifier(verifier)
# Generate with speculative decoding
outputs = model.generate(input_ids, max_length=100)
Initialize a SpeculatorModel instance.
Sets up the basic structure for a speculator model, including configuration storage and optional verifier model attachment. The verifier model is used during speculative decoding to validate the tokens proposed by the speculator.
If no verifier is provided during initialization, it must be attached later using the attach_verifier method before calling generate.
Parameters:
-
(configSpeculatorModelConfig) –The configuration for the speculator model. Must be a SpeculatorModelConfig instance containing model hyperparameters and speculative decoding settings.
-
(verifierstr | PathLike | PreTrainedModel | None) –The verifier model to attach. This can be a path to a local model directory, a Hugging Face model identifier, or an instance of PreTrainedModel. If provided, the speculator will use this verifier for speculative decoding. If None, the speculator will load the verifier from the config if specified, or it must be attached later using the
attach_verifiermethod. -
(verifier_attachment_modeLiteral['detached', 'full', 'train_only'] | None) –Optional mode for how the verifier is attached to the speculator. If "detach", any verifier passed in or resolved from the config will not be attached. If "full", the verifier is fully integrated into the speculator's forward pass and generation methods. If "train_only", only the portions of the verifier needed for training are attached, allowing for better resource utilization during training. If None and a verifier is provided, it defaults to "full". If a verifier is not provided and None is found in the config, this parameter is ignored.
-
–kwargsAdditional keyword arguments passed to the parent PreTrainedModel constructor.
Methods:
-
attach_verifier–Attach a verifier model for the speculator that is used to attach to
-
detach_verifier–Removes the reference to the attached verifier model and frees up the
-
forward–Defines the forward pass computation for the speculator model.
-
from_pretrained–Load a pretrained speculator model from the Hugging Face Hub or local directory.
-
from_training_args–Create model instance from training arguments.
-
generate–Generate text using speculative decoding with the attached verifier model.
-
get_trainer_kwargs–Get algorithm-specific kwargs for training and validation.
-
registered_model_class_from_config–Looks up the appropriate speculator model class from the registry
-
resolve_verifier–Resolves the verifier model from a given path or identifier.
-
state_dict–Overrides the state_dict method from PyTorch to ensure that save pathways
-
verify_training_compatible–Verify that a model instance is compatible with training infrastructure.
Source code in speculators/model.py
attach_verifier
attach_verifier(
verifier: str | PathLike | PreTrainedModel,
mode: Literal["full", "train_only"] | None = None,
)
Attach a verifier model for the speculator that is used to attach to for running inference/training with the speculator and validates the candidate tokens generated by the speculator during the speculative decoding process. It should be compatible with the speculator's configuration in terms of vocabulary, architecture, and tokenization.
Example:
# Load and attach a verifier
verifier = AutoModel.from_pretrained("meta-llama/Llama-2-7b-hf")
speculator.attach_verifier(verifier)
# Now ready for generation
outputs = speculator.generate(input_ids)
Parameters:
-
(verifierstr | PathLike | PreTrainedModel) –The verifier model to attach. This can be a path to a local model directory, a Hugging Face model identifier, or an instance of PreTrainedModel. If a path or identifier is provided, the model will be loaded automatically. If an instance is provided, it will be used directly.
-
(modeLiteral['full', 'train_only'] | None, default:None) –Optional mode for how the verifier is attached to the speculator. If "full", the verifier is fully integrated into the speculator's forward pass and generation methods. If "train_only", only the portions of the verifier needed for training are attached, allowing for better resource utilization during training. If None, defaults to "full".
Returns:
- –
The PreTrainedModel instance for the verifier that was attached.
Source code in speculators/model.py
detach_verifier
Removes the reference to the attached verifier model and frees up the associated memory. After calling this method, the speculator will not be able to perform forward passes or generation until a new verifier is attached.
Source code in speculators/model.py
forward
Defines the forward pass computation for the speculator model.
This method must be implemented by all concrete speculator model subclasses. It defines how the model processes inputs to generate candidate tokens or logits specifically for training pipelines.
Use model.generate for generation tasks, which will handle speculative decoding with the attached verifier.
Parameters:
-
–argsPositional arguments for the forward pass, typically including input_ids and potentially attention_mask, position_ids, etc.
-
–kwargsKeyword arguments for the forward pass, which may include various model-specific parameters and options.
Returns:
- –
Model outputs, typically including logits or candidate token sequences, depending on the specific speculator implementation.
Source code in speculators/model.py
from_pretrained classmethod
from_pretrained(
pretrained_model_name_or_path: str | PathLike | None,
*model_args,
verifier: str
| PathLike
| PreTrainedModel
| None = None,
verifier_attachment_mode: Literal[
"detached", "full", "train_only"
]
| None = None,
config: PretrainedConfig | str | PathLike | None = None,
cache_dir: str | PathLike | None = None,
ignore_mismatched_sizes: bool = False,
force_download: bool = False,
local_files_only: bool = False,
token: str | bool | None = None,
revision: str = "main",
use_safetensors: bool | None = None,
weights_only: bool = True,
**kwargs,
) -> SpeculatorModel
Load a pretrained speculator model from the Hugging Face Hub or local directory.
This method automatically resolves the correct speculator model class based on the configuration type and loads the model with the appropriate weights. If called on the base SpeculatorModel class, it will automatically determine and instantiate the correct subclass based on the model configuration.
Example:
# Load with automatic class resolution
model = SpeculatorModel.from_pretrained("RedHatAI/speculator-llama-7b")
# Load from local directory
model = SpeculatorModel.from_pretrained("./my_speculator")
# Load with custom config
config = SpeculatorModelConfig.from_pretrained("RedHatAI/eagle-llama-7b")
model = SpeculatorModel.from_pretrained(
None, config=config, state_dict=state_dict
)
Parameters:
-
(pretrained_model_name_or_pathstr | PathLike | None) –The model identifier on Hugging Face Hub, or path to a local directory containing the model files. Can be None if config is provided as a path.
-
–model_argsAdditional positional arguments passed to the model constructor.
-
(verifierstr | PathLike | PreTrainedModel | None, default:None) –Optional verifier model to attach the speculator to. Can be a path to a local model directory, a Hugging Face model identifier, or an instance of PreTrainedModel. If provided, the speculator will use this verifier for speculative decoding. If None, the speculator will load the verifier from the config if specified, or it must be attached later using the
attach_verifiermethod. -
(verifier_attachment_modeLiteral['detached', 'full', 'train_only'] | None, default:None) –Optional mode for how the verifier is attached to the speculator. If "detached", any verifier passed in or resolved from the config will not be ignored. If "full", the verifier is fully integrated into the speculator's forward pass and generation methods. If "train_only", only the portions of the verifier needed for training are attached, allowing for better resource utilization during training. If None and a verifier is provided, it defaults to "full". If a verifier is not provided and None is found in the config, this parameter is ignored.
-
(configPretrainedConfig | str | PathLike | None, default:None) –Optional configuration for the model. Can be a SpeculatorModelConfig instance, a path to a config file, or None to load from model directory.
-
(cache_dirstr | PathLike | None, default:None) –Directory to cache downloaded files. If None, uses default transformers cache directory.
-
(ignore_mismatched_sizesbool, default:False) –Whether to ignore size mismatches when loading pretrained weights. Useful for loading models with different architectures.
-
(force_downloadbool, default:False) –Whether to force re-download of model files even if they exist in cache.
-
(local_files_onlybool, default:False) –Whether to avoid downloading files and only use local cached files. Raises an error if files are not found locally.
-
(tokenstr | bool | None, default:None) –Optional authentication token for accessing private models on Hugging Face Hub. Can be a string token or True to use saved token.
-
(revisionstr, default:'main') –The specific model revision to load (branch name, tag, or commit hash). Defaults to "main".
-
(use_safetensorsbool | None, default:None) –Whether to use safetensors format for loading weights. If None, automatically detects the available format.
-
(weights_onlybool, default:True) –Whether to only load model weights without optimizer states or other training artifacts.
-
–kwargsAdditional keyword arguments passed to the model constructor and loading process.
Returns:
-
SpeculatorModel–A SpeculatorModel instance of the appropriate subclass, loaded with the pretrained weights and configuration.
Source code in speculators/model.py
83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 | |
from_training_args abstractmethod classmethod
Create model instance from training arguments.
This factory method is used by the training script to instantiate models from command-line arguments. Each algorithm must implement this to support the training infrastructure.
Args: verifier_config: Configuration from the verifier/base model. **kwargs: Training arguments as keyword arguments. Each algorithm extracts the parameters it needs.
Returns: Initialized model instance ready for training.
Example:
@classmethod
def from_training_args(cls, verifier_config, **kwargs):
config = MySpeculatorConfig(
transformer_layer_config=verifier_config,
num_layers=kwargs['num_layers'],
...
)
return cls(config=config, t2d=kwargs.get('t2d'), d2t=kwargs.get('d2t'))
Source code in speculators/model.py
generate
generate(
inputs: Tensor | None = None,
generation_config: GenerationConfig | None = None,
logits_processor: LogitsProcessorList | None = None,
stopping_criteria: StoppingCriteriaList | None = None,
prefix_allowed_tokens_fn: Callable[
[int, Tensor], list[int]
]
| None = None,
synced_gpus: bool | None = None,
assistant_model: Optional[PreTrainedModel] = None,
streamer: Optional[BaseStreamer] = None,
negative_prompt_ids: Tensor | None = None,
negative_prompt_attention_mask: Tensor | None = None,
use_model_defaults: bool | None = None,
custom_generate: str | Callable[..., Any] | None = None,
**kwargs,
) -> GenerateOutput | torch.LongTensor
Generate text using speculative decoding with the attached verifier model. The method follows the standard transformers generation interface, making it compatible with existing generation workflows while adding speculative decoding capabilities allowing for faster generation.
Parameters:
-
(inputsTensor | None, default:None) –The input token IDs to generate from. Can be None if input_ids are provided in kwargs.
-
(generation_configGenerationConfig | None, default:None) –Configuration for generation parameters like max_length, temperature, top_p, etc. If None, uses model defaults.
-
(logits_processorLogitsProcessorList | None, default:None) –List of logits processors to apply during generation for tasks like repetition penalty, top-k filtering, etc.
-
(stopping_criteriaStoppingCriteriaList | None, default:None) –List of stopping criteria to determine when to stop generation (e.g., max length, end-of-sequence tokens).
-
(prefix_allowed_tokens_fnCallable[[int, Tensor], list[int]] | None, default:None) –Function to constrain generation to allowed tokens based on the current prefix. Useful for structured generation.
-
(synced_gpusbool | None, default:None) –Whether to synchronize GPUs during distributed generation. Relevant for multi-GPU setups.
-
(assistant_modelOptional[PreTrainedModel], default:None) –An assistant model to use for generation. This parameter maintains compatibility with transformers but may not be used in speculative decoding.
-
(streamerOptional[BaseStreamer], default:None) –A streamer to output tokens as they are generated, enabling real-time streaming of the generation process.
-
(negative_prompt_idsTensor | None, default:None) –Token IDs for negative prompting to steer generation away from certain content.
-
(negative_prompt_attention_maskTensor | None, default:None) –Attention mask for negative prompt tokens to properly handle padding.
-
(use_model_defaultsbool | None, default:None) –Whether to use model-specific default generation parameters instead of transformers defaults.
-
–kwargsAdditional keyword arguments for generation, including input_ids, attention_mask, max_length, etc.
Returns:
-
GenerateOutput | LongTensor–Generated token sequences as either a GenerateOutput object (containing additional metadata) or a LongTensor of token IDs.
Source code in speculators/model.py
get_trainer_kwargs abstractmethod staticmethod
Get algorithm-specific kwargs for training and validation.
This method extracts algorithm-specific parameters from the training arguments and returns separate kwargs dictionaries for training and validation forward passes.
Args: **kwargs: Training arguments containing algorithm-specific parameters.
Returns: Tuple of (train_kwargs, val_kwargs) where: - train_kwargs: Dict passed to model.forward() during training - val_kwargs: Dict passed to model.forward() during validation
Example:
@staticmethod
def get_trainer_kwargs(**kwargs):
train_kwargs = {
"num_steps": kwargs["num_steps"],
"use_special_mode": True,
}
val_kwargs = {
"num_steps": kwargs["num_steps"],
"use_special_mode": False,
}
return train_kwargs, val_kwargs
Source code in speculators/model.py
registered_model_class_from_config classmethod
Looks up the appropriate speculator model class from the registry based on the configuration type. It matches the config class to the corresponding model class that was registered during auto-discovery or manual registration.
Parameters:
-
(configSpeculatorModelConfig) –The configuration for which to find the registered model class. Must be an instance of a SpeculatorModelConfig subclass.
Returns:
-
type[SpeculatorModel]–The registered model class that matches the configuration type.
Source code in speculators/model.py
resolve_verifier
Resolves the verifier model from a given path or identifier.
This method loads the verifier model from a specified path or identifier, ensuring it is compatible with the speculator's configuration. If the verifier is already attached, it returns the existing verifier instance.
Parameters:
-
(verifierstr | PathLike | PreTrainedModel) –The verifier model to resolve. Can be a path to a local model directory, a Hugging Face model identifier, or an instance of PreTrainedModel.
Returns:
-
PreTrainedModel–The resolved PreTrainedModel instance for the verifier.
Source code in speculators/model.py
state_dict
Overrides the state_dict method from PyTorch to ensure that save pathways within Transformers PreTrainedModel do not include the verifier model's parameters. This is important to ensure that the speculator model can be saved and loaded without including the verifier's state, which is expected to be managed separately.
Parameters:
-
(destinationdict[str, Any], default:None) –Optional dictionary to store the state.
-
(prefixstr, default:'') –Optional prefix for parameter names.
-
(keep_varsbool, default:False) –Whether to keep Variables in the state_dict.
Returns:
- –
A dictionary containing the state of the speculator model, excluding the verifier model's parameters. This dictionary can be used to save the model's state to disk or for further processing.
Source code in speculators/model.py
verify_training_compatible classmethod
Verify that a model instance is compatible with training infrastructure.
This method validates that the given model is: 1. An instance of SpeculatorModel 2. Registered in the SpeculatorModel registry 3. Has a layers attribute (required for FSDP wrapping)
Args: model: The model instance to verify
Raises: TypeError: If model is not a SpeculatorModel instance ValueError: If model's class is not in the registry AttributeError: If model doesn't have a layers attribute
Source code in speculators/model.py
reload_and_populate_models
Triggers the automatic discovery and registration of all SpeculatorModel subclasses found in the speculators.models package that have been registered with SpeculatorModel.register(NAME). This enables dynamic model loading and instantiation based on configuration types without requiring explicit imports.