Speculative Decoding (SD) is a token generation acceleration technique where a faster, smaller draft model proposes future tokens, which are then validated by a larger, accurate target model. This approach significantly boosts generation speed while maintaining model quality.
End-to-End Workflow of SD
Here’s the step-by-step operation of speculative decoding:
- User input received by the model server.
- The target model generates the first output token using normal inference.
- The target model sends the input + first output token to the draft model.
- The draft model proposes N draft tokens (e.g., 4 tokens).
- The target model validates each token one-by-one:
- If a token matches what the target would generate, it’s accepted.
- If a token doesn’t match, it’s rejected.
- Regardless of validation, the target model generates one new token.
- The cycle continues: validated tokens + new token are fed again to the draft model.
- This loop ends when the full output sequence is complete.
Why Use Speculative Decoding?
- Faster inference without retraining the target model
- Efficient use of compute resources
- Works end-to-end with existing architectures
- Helps scale large model deployments
Compatibility Criteria (Required)
While it is technically possible to use a draft model with a different tokenizer than the target model, doing so is not best practice and may lead to decoding inconsistencies. When configuring speculative decoding in SambaStudio, ensure the following:
- Maximum Sequence Length Match
- The draft model must support the same maximum sequence length as the target model to prevent truncation or overflow issues.
- Smaller Model Size
- Choose a draft model with a lower parameter count than the target model for faster inference and overall performance gains.
- Text Output Capability
- The draft model must be able to generate coherent text, since its tokens feed into the target model’s validation step.
- Token ID Compatibility
- Token IDs in the draft model’s tokenizer should match those of the target model, excluding special tokens.
- Tokenizer Class Match
- Ensure that the
tokenizer_class
in the draft model’stokenizer_config.json
matches the one used by the target model.
- Ensure that the
Best Practices (Recommended)
While not mandatory, the following practices improve compatibility and performance:
- Use the Same Model Family
- Prefer draft models from the same model series as the target model.
- Example: If your target model is
Llama-3.1-405B-Instruct
, a compatible draft could beLlama-3.1-8B-Instruct
.
- Example: If your target model is
- Prefer draft models from the same model series as the target model.
- Match Fine-Tuning Dataset
- If the target is fine-tuned, either:
- Fine-tune a smaller model on the same data to use as the draft, or
- Use distillation to train a smaller version that mimics the behavior of the fine-tuned target model.
- If the target is fine-tuned, either: