All about Speculative Decoding

prajwal.balapure · May 30, 2025, 9:52am

Speculative Decoding (SD) is a token generation acceleration technique where a faster, smaller draft model proposes future tokens, which are then validated by a larger, accurate target model. This approach significantly boosts generation speed while maintaining model quality.

End-to-End Workflow of SD

Here’s the step-by-step operation of speculative decoding:

User input received by the model server.
The target model generates the first output token using normal inference.
The target model sends the input + first output token to the draft model.
The draft model proposes N draft tokens (e.g., 4 tokens).
The target model validates each token one-by-one:
- If a token matches what the target would generate, it’s accepted.
- If a token doesn’t match, it’s rejected.
Regardless of validation, the target model generates one new token.
The cycle continues: validated tokens + new token are fed again to the draft model.
This loop ends when the full output sequence is complete.

Why Use Speculative Decoding?

Faster inference without retraining the target model
Efficient use of compute resources
Works end-to-end with existing architectures
Helps scale large model deployments

Compatibility Criteria (Required)

While it is technically possible to use a draft model with a different tokenizer than the target model, doing so is not best practice and may lead to decoding inconsistencies. When configuring speculative decoding in SambaStudio, ensure the following:

Maximum Sequence Length Match
- The draft model must support the same maximum sequence length as the target model to prevent truncation or overflow issues.
Smaller Model Size
- Choose a draft model with a lower parameter count than the target model for faster inference and overall performance gains.
Text Output Capability
- The draft model must be able to generate coherent text, since its tokens feed into the target model’s validation step.
Token ID Compatibility
- Token IDs in the draft model’s tokenizer should match those of the target model, excluding special tokens.
Tokenizer Class Match
- Ensure that the tokenizer_class in the draft model’s tokenizer_config.json matches the one used by the target model.

Best Practices (Recommended)

While not mandatory, the following practices improve compatibility and performance:

Use the Same Model Family
- Prefer draft models from the same model series as the target model.
  - Example: If your target model is Llama-3.1-405B-Instruct, a compatible draft could be Llama-3.1-8B-Instruct.
Match Fine-Tuning Dataset
- If the target is fine-tuned, either:
  - Fine-tune a smaller model on the same data to use as the draft, or
  - Use distillation to train a smaller version that mimics the behavior of the fine-tuned target model.