SambaNova Text Generation

saurabh.patil · June 15, 2025, 5:23am

Overview:

SambaNova Cloud offers advanced text generation capabilities via an OpenAI-compatible API interface. Supported modes include:

Non-Streaming (standard)
Streaming (token-by-token)
Async (non-blocking)

This guide explains how to use these capabilities internally.

Supported Models & API Info:

Models available:

Meta-Llama-3.1-8B-Instruct
Meta-Llama-3.1-70B-Instruct

API Details:

Base URL: https://api.sambanova.ai/v1
Authentication: API Key (ask DevOps or Admin)

Sample Code – Basic Text Completion:

from openai import OpenAI

client = OpenAI(
    base_url="https://api.sambanova.ai/v1",
    api_key="YOUR_API_KEY"
)

response = client.chat.completions.create(
    model="Meta-Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain the benefits of prompt tuning."}
    ]
)

print(response.choices[0].message.content)

Streaming Generation Example:

stream = client.chat.completions.create(
    model="Meta-Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "You're a real-time assistant."},
        {"role": "user", "content": "Summarize this document in real-time."}
    ],
    stream=True
)

for chunk in stream:
    print(chunk.choices[0].delta.content or "", end="")

Async Usage:

from openai import AsyncOpenAI
import asyncio

client = AsyncOpenAI(
    base_url="https://api.sambanova.ai/v1",
    api_key="YOUR_API_KEY"
)

async def main():
    response = await client.chat.completions.create(
        model="Meta-Llama-3.1-8B-Instruct",
        messages=[
            {"role": "user", "content": "What's few-shot learning?"}
        ]
    )
    print(response.choices[0].message.content)

asyncio.run(main())

Prompt Engineering Tips:

Define the assistant’s role clearly.
Specify output format (JSON, list, etc.).
Include examples or context if needed.
Use chain-of-thought prompts for reasoning.
Ensure prompt + history stays within the model’s token limit.

Example Prompt:

System: You are an expert in AI tuning.
User: Explain “prompt tuning” with a practical use-case in bullet points.

Model Selection Guidelines:

Model	Use Case	Speed	Cost
Llama 3.1 - 8B Instruct	General, fast, low cost	Fast	Low
Llama 3.1 - 70B Instruct	High-depth, complex reasoning	Slow	High