SambaNova Cloud Audio API Documentation
Introduction
We are launching our first speech reasoning model on SambaNova Cloud, extending our multi-modal AI capabilities beyond vision to include advanced audio processing and understanding, offering OpenAI-compatible endpoints that enable real-time reasoning, transcriptions and translations.
*Please note that this model is currently in Beta. We are actively adding model serving features to maximize its speech task capabilities, with a production release planned for January 2025. Your feedback during this Beta phase is important to us. Please feel free to leave your comments in our Community page or reach out to carol.zhu@sambanovasystems.com.
Underlying Models
Model ID | Model | Supported Language(s) | Description |
---|---|---|---|
qwen2-audio-7b-instruct |
Qwen2-Audio Instruct | Multilingual | Instruction-tuned large audio language model. Built on Qwen-7B with Whisper-large-v3 audio encoder (8.2B parameters). |
Key Features and Use Cases
Core Capabilities
- Transform Audio into Intelligence: Build GPT-4-like Voice Applications in Minutes
- Direct question-answering for any audio input
- Comprehensive audio processing including real-time conversation, transcription, translation, and analysis through a single unified model
Customization & Control
- System-Level Prompting. Use “Assistant Prompt”* to customize model behavior for specific requirements
See “Message” parameter in the Request Parameters section for more details
- Brand-specific formatting (e.g., “BrandName” vs “brandname”)
- Domain-specific terminology
- Response style and tone control
Audio Processing
- Silence Detection: Intelligent identification of meaningful pauses and gaps in speech
- Noise Cancellation: Advanced noise filtering and clean audio processing
- Multilingual Processing: Support for multiple languages with automatic language detection
Analysis Capabilities
- Sentiment Analysis: Detect and analyze emotional content in speech
- Multi-Speaker Handling: Process conversations with multiple participants
- Mixed Audio Understanding: Comprehend speech, music, and environmental sounds
Performance Metrics
Speech Recognition Performance (WER%)
Lower is better
Language | Dataset | Qwen2-Audio | Whisper-large-v3 | Improvement |
---|---|---|---|---|
English | Common Voice 15 | 8.6% | 9.3% | +7.5% |
Chinese | Common Voice 15 | 6.9% | 12.8% | +46.1% |
*Metrics from published Qwen2-Audio paper benchmarks
API Reference
1. Audio Reasoning API
Enables advanced audio analysis with optional text instructions.
Endpoint
POST https://api.sambanova.ai/v1/audio/reasoning
Request Format
cURL
curl --location 'https://api.sambanova.ai/v1/audio/reasoning' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer YOUR_API_KEY' \
--data '{
"messages": [
{"role": "assistant", "content": "you are a helpful assistant"},
{"role": "user", "content":[
{
"type": "audio_content",
"audio_content": {
"content": "data:audio/mp3;base64,<base64_audio>"
}
}
]
},
{"role": "user", "content": "what is the audio about"}
],
"max_tokens": 1024,
"model": "Qwen2-Audio-7B-Instruct",
"temperature": 0.01,
"stream": true // Optional
}'
Python
import requests
import base64
def analyze_audio(audio_file_path, api_key):
with open(audio_file_path, "rb") as audio_file:
base64_audio = base64.b64encode(audio_file.read()).decode('utf-8')
headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {api_key}"
}
data = {
"messages": [
{"role": "assistant", "content": "you are a helpful assistant"},
{"role": "user", "content": [
{
"type": "audio_content",
"audio_content": {
"content": f"data:audio/mp3;base64,{base64_audio}"
}
}
]},
{"role": "user", "content": "what is the audio about"}
],
"model": "Qwen2-Audio-7B-Instruct",
"max_tokens": 1024,
"temperature": 0.01,
"stream": True # Optional
}
response = requests.post(
"https://api.sambanova.ai/v1/audio/reasoning",
headers=headers,
json=data
)
return response.json()
Response Format
{
"choices": [{
"delta": {
"content": "The sound is that of ",
"role": "assistant"
},
"finish_reason": null,
"index": 0,
"logprobs": null
}],
"created": 1732317298,
"id": "211b9a22-58cf-4b90-94e9-1fed8d0d9d0a",
"model": "Qwen2-Audio-7B-Instruct",
}
2. Transcription API
Converts audio to text in the specified language.
Endpoint
POST https://api.sambanova.ai/v1/audio/transcriptions
Request Format
cURL
curl --location 'https://api.sambanova.ai/v1/audio/transcriptions' \
--header 'Authorization: Bearer YOUR_API_KEY' \
--form 'model="Qwen2-Audio-7B-Instruct"' \
--form 'language="spanish"' \
--form 'response_format="json"' \
--form 'temperature="0.01"' \
--form 'file=@"/path/to/audio/file.mp3"' \
--form 'stream="true"'
Python
import requests
def transcribe_audio(audio_file_path, api_key, language="english"):
headers = {
"Authorization": f"Bearer {api_key}"
}
files = {
'file': open(audio_file_path, 'rb')
}
data = {
'model': 'Qwen2-Audio-7B-Instruct',
'language': language,
'response_format': 'json',
'temperature': 0.01,
'stream': true # Optional
}
response = requests.post(
"https://api.sambanova.ai/v1/audio/transcriptions",
headers=headers,
files=files,
data=data
)
return response.json()
Response Format
JSON
{
"text": "It's a sound effect of a bell chiming, specifically a church bell."
}
Text
It's a sound effect of a bell chiming, specifically a church bell.
3. Translation API
Translates audio content to specified language.
Endpoint
POST https://api.sambanova.ai/v1/audio/translations
Request Format
cURL
curl --location 'https://api.sambanova.ai/v1/audio/translations' \
--header 'Authorization: Bearer YOUR_API_KEY' \
--form 'model="Qwen2-Audio-7B-Instruct"' \
--form 'language="spanish"' \
--form 'response_format="json"' \
--form 'temperature="0.01"' \
--form 'file=@"/path/to/audio/file.mp3"' \
--form 'stream="true"'
Python
import requests
def translate_audio(audio_file_path, api_key, target_language="spanish"):
headers = {
"Authorization": f"Bearer {api_key}"
}
files = {
'file': open(audio_file_path, 'rb')
}
data = {
'model': 'Qwen2-Audio-7B-Instruct',
'language': target_language,
'response_format': 'json',
'temperature': 0.01,
'stream': True # Optional
}
response = requests.post(
"https://api.sambanova.ai/v1/audio/translations",
headers=headers,
files=files,
data=data
)
return response.json()
Response Format
JSON
{
"text": "Es un efecto de sonido de una campana sonando, especĂficamente una campana de iglesia."
}
Text
Es un efecto de sonido de una campana sonando, especĂficamente una campana de iglesia.
Request Parameters
Parameter | Type | Default | Description | Endpoints |
---|---|---|---|---|
model |
string | Required | The ID of the model to use. Only Qwen2-Audio-7B-Instruct is currently available. | All |
messages |
Message | Required | A list of messages containing role (user/system/assistant), type (text/audio_content), and audio_content (base64 audio content). | All |
response_format |
string | “json” | The output format, either “json” or “text”. | All |
temperature |
number | 0 | Sampling temperature between 0 and 1. Higher values (e.g., 0.8) increase randomness, while lower values (e.g., 0.2) make output more focused. | All |
max_tokens |
number | 1000 | The maximum number of tokens to generate. | All |
file |
file | Required | Audio file in flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm format. Each single file must not exceed 30 seconds in duration. | All |
language |
string | Optional | The target language for transcription or translation. | Transcription, Translation |
stream |
boolean | false | Enables streaming responses. | All |
stream_options |
object | Optional | Additional streaming configuration (e.g., {“include_usage”: true}). | All |
Streaming Responses*
When streaming is enabled, the API returns a series of data chunks in the following format:
data: {"choices":[{"delta":{"content":"","role":"assistant"},"finish_reason":null,"index":0,"logprobs":null}],"created":1732317298,"id":"211b9a22-58cf-4b90-94e9-1fed8d0d9d0a","model":"Qwen2-Audio-7B-Instruct","object":"chat.completion.chunk","system_fingerprint":"fastcoe"}
data: {"choices":[{"delta":{"content":"The sound is that of ","role":"assistant"},"finish_reason":null,"index":0,"logprobs":null}],"created":1732317298,"id":"211b9a22-58cf-4b90-94e9-1fed8d0d9d0a","model":"Qwen2-Audio-7B-Instruct","object":"chat.completion.chunk","system_fingerprint":"fastcoe"}
*Please note that Streaming Responses are only available for the Reasoning endpoint for Beta launch. We will enable it for all three APIs during our upcoming release.