All about Speculative Decoding

Speculative Decoding (SD) is a token generation acceleration technique where a faster, smaller draft model proposes future tokens, which are then validated by a larger, accurate target model. This approach significantly boosts generation speed while maintaining model quality.

End-to-End Workflow of SD

Here’s the step-by-step operation of speculative decoding:

  • User input received by the model server.
  • The target model generates the first output token using normal inference.
  • The target model sends the input + first output token to the draft model.
  • The draft model proposes N draft tokens (e.g., 4 tokens).
  • The target model validates each token one-by-one:
    • If a token matches what the target would generate, it’s accepted.
    • If a token doesn’t match, it’s rejected.
  • Regardless of validation, the target model generates one new token.
  • The cycle continues: validated tokens + new token are fed again to the draft model.
  • This loop ends when the full output sequence is complete.

Why Use Speculative Decoding?

  • Faster inference without retraining the target model
  • Efficient use of compute resources
  • Works end-to-end with existing architectures
  • Helps scale large model deployments

Compatibility Criteria (Required)

While it is technically possible to use a draft model with a different tokenizer than the target model, doing so is not best practice and may lead to decoding inconsistencies. When configuring speculative decoding in SambaStudio, ensure the following:

  • Maximum Sequence Length Match
    • The draft model must support the same maximum sequence length as the target model to prevent truncation or overflow issues.
  • Smaller Model Size
    • Choose a draft model with a lower parameter count than the target model for faster inference and overall performance gains.
  • Text Output Capability
    • The draft model must be able to generate coherent text, since its tokens feed into the target model’s validation step.
  • Token ID Compatibility
    • Token IDs in the draft model’s tokenizer should match those of the target model, excluding special tokens.
  • Tokenizer Class Match
    • Ensure that the tokenizer_class in the draft model’s tokenizer_config.json matches the one used by the target model.

Best Practices (Recommended)

While not mandatory, the following practices improve compatibility and performance:

  • Use the Same Model Family
    • Prefer draft models from the same model series as the target model.
      • Example: If your target model is Llama-3.1-405B-Instruct, a compatible draft could be Llama-3.1-8B-Instruct.
  • Match Fine-Tuning Dataset
    • If the target is fine-tuned, either:
      • Fine-tune a smaller model on the same data to use as the draft, or
      • Use distillation to train a smaller version that mimics the behavior of the fine-tuned target model.