Speech Reasoning Function Release

SambaNova Cloud Audio API Documentation

Introduction

We are launching our first speech reasoning model on SambaNova Cloud, extending our multi-modal AI capabilities beyond vision to include advanced audio processing and understanding, offering OpenAI-compatible endpoints that enable real-time reasoning, transcriptions and translations.

*Please note that this model is currently in Beta. We are actively adding model serving features to maximize its speech task capabilities, with a production release planned for January 2025. Your feedback during this Beta phase is important to us. Please feel free to leave your comments in our Community page or reach out to carol.zhu@sambanovasystems.com.

Underlying Models

Model ID Model Supported Language(s) Description
qwen2-audio-7b-instruct Qwen2-Audio Instruct Multilingual Instruction-tuned large audio language model. Built on Qwen-7B with Whisper-large-v3 audio encoder (8.2B parameters).

Key Features and Use Cases

Core Capabilities

  • Transform Audio into Intelligence: Build GPT-4-like Voice Applications in Minutes
  • Direct question-answering for any audio input
  • Comprehensive audio processing including real-time conversation, transcription, translation, and analysis through a single unified model

Customization & Control

  • System-Level Prompting. Use “Assistant Prompt”* to customize model behavior for specific requirements

See “Message” parameter in the Request Parameters section for more details

  • Brand-specific formatting (e.g., “BrandName” vs “brandname”)
  • Domain-specific terminology
  • Response style and tone control

Audio Processing

  • Silence Detection: Intelligent identification of meaningful pauses and gaps in speech
  • Noise Cancellation: Advanced noise filtering and clean audio processing
  • Multilingual Processing: Support for multiple languages with automatic language detection

Analysis Capabilities

  • Sentiment Analysis: Detect and analyze emotional content in speech
  • Multi-Speaker Handling: Process conversations with multiple participants
  • Mixed Audio Understanding: Comprehend speech, music, and environmental sounds

Performance Metrics

Speech Recognition Performance (WER%)

Lower is better

Language Dataset Qwen2-Audio Whisper-large-v3 Improvement
English Common Voice 15 8.6% 9.3% +7.5%
Chinese Common Voice 15 6.9% 12.8% +46.1%

*Metrics from published Qwen2-Audio paper benchmarks

API Reference

1. Audio Reasoning API

Enables advanced audio analysis with optional text instructions.

Endpoint

POST https://api.sambanova.ai/v1/audio/reasoning

Request Format

cURL
curl --location 'https://api.sambanova.ai/v1/audio/reasoning' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer YOUR_API_KEY' \
--data '{
    "messages": [
        {"role": "assistant", "content": "you are a helpful assistant"},  
        {"role": "user", "content":[
                {
                    "type": "audio_content",
                    "audio_content": {
                        "content": "data:audio/mp3;base64,<base64_audio>"
                    }
                }
            ]
        },
        {"role": "user", "content": "what is the audio about"}
    ],   
    "max_tokens": 1024,
    "model": "Qwen2-Audio-7B-Instruct",
    "temperature": 0.01,
    "stream": true // Optional
}'
Python
import requests
import base64

def analyze_audio(audio_file_path, api_key):
    with open(audio_file_path, "rb") as audio_file:
        base64_audio = base64.b64encode(audio_file.read()).decode('utf-8')
    
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {api_key}"
    }
    
    data = {
        "messages": [
            {"role": "assistant", "content": "you are a helpful assistant"},
            {"role": "user", "content": [
                {
                    "type": "audio_content",
                    "audio_content": {
                        "content": f"data:audio/mp3;base64,{base64_audio}"
                    }
                }
            ]},
            {"role": "user", "content": "what is the audio about"}
        ],
        "model": "Qwen2-Audio-7B-Instruct",
        "max_tokens": 1024,
        "temperature": 0.01,
        "stream": True  # Optional
    }
    
    response = requests.post(
        "https://api.sambanova.ai/v1/audio/reasoning",
        headers=headers,
        json=data
    )
    
    return response.json()

Response Format

{
    "choices": [{
        "delta": {
            "content": "The sound is that of ",
            "role": "assistant"
        },
        "finish_reason": null,
        "index": 0,
        "logprobs": null
    }],
    "created": 1732317298,
    "id": "211b9a22-58cf-4b90-94e9-1fed8d0d9d0a",
    "model": "Qwen2-Audio-7B-Instruct",
}

2. Transcription API

Converts audio to text in the specified language.

Endpoint

POST https://api.sambanova.ai/v1/audio/transcriptions

Request Format

cURL
curl --location 'https://api.sambanova.ai/v1/audio/transcriptions' \
--header 'Authorization: Bearer YOUR_API_KEY' \
--form 'model="Qwen2-Audio-7B-Instruct"' \
--form 'language="spanish"' \
--form 'response_format="json"' \
--form 'temperature="0.01"' \
--form 'file=@"/path/to/audio/file.mp3"' \
--form 'stream="true"'
Python
import requests

def transcribe_audio(audio_file_path, api_key, language="english"):
    headers = {
        "Authorization": f"Bearer {api_key}"
    }
    
    files = {
        'file': open(audio_file_path, 'rb')
    }
    
    data = {
        'model': 'Qwen2-Audio-7B-Instruct',
        'language': language,
        'response_format': 'json',
        'temperature': 0.01,
        'stream': true  # Optional
    }
    
    response = requests.post(
        "https://api.sambanova.ai/v1/audio/transcriptions",
        headers=headers,
        files=files,
        data=data
    )
    
    return response.json()

Response Format

JSON
{
    "text": "It's a sound effect of a bell chiming, specifically a church bell."
}
Text
It's a sound effect of a bell chiming, specifically a church bell.

3. Translation API

Translates audio content to specified language.

Endpoint

POST https://api.sambanova.ai/v1/audio/translations

Request Format

cURL
curl --location 'https://api.sambanova.ai/v1/audio/translations' \
--header 'Authorization: Bearer YOUR_API_KEY' \
--form 'model="Qwen2-Audio-7B-Instruct"' \
--form 'language="spanish"' \
--form 'response_format="json"' \
--form 'temperature="0.01"' \
--form 'file=@"/path/to/audio/file.mp3"' \
--form 'stream="true"'
Python
import requests

def translate_audio(audio_file_path, api_key, target_language="spanish"):
    headers = {
        "Authorization": f"Bearer {api_key}"
    }
    
    files = {
        'file': open(audio_file_path, 'rb')
    }
    
    data = {
        'model': 'Qwen2-Audio-7B-Instruct',
        'language': target_language,
        'response_format': 'json',
        'temperature': 0.01,
        'stream': True  # Optional
    }
    
    response = requests.post(
        "https://api.sambanova.ai/v1/audio/translations",
        headers=headers,
        files=files,
        data=data
    )
    
    return response.json()

Response Format

JSON
{
    "text": "Es un efecto de sonido de una campana sonando, especĂ­ficamente una campana de iglesia."
}
Text
Es un efecto de sonido de una campana sonando, especĂ­ficamente una campana de iglesia.

Request Parameters

Parameter Type Default Description Endpoints
model string Required The ID of the model to use. Only Qwen2-Audio-7B-Instruct is currently available. All
messages Message Required A list of messages containing role (user/system/assistant), type (text/audio_content), and audio_content (base64 audio content). All
response_format string “json” The output format, either “json” or “text”. All
temperature number 0 Sampling temperature between 0 and 1. Higher values (e.g., 0.8) increase randomness, while lower values (e.g., 0.2) make output more focused. All
max_tokens number 1000 The maximum number of tokens to generate. All
file file Required Audio file in flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm format. Each single file must not exceed 30 seconds in duration. All
language string Optional The target language for transcription or translation. Transcription, Translation
stream boolean false Enables streaming responses. All
stream_options object Optional Additional streaming configuration (e.g., {“include_usage”: true}). All

Streaming Responses*

When streaming is enabled, the API returns a series of data chunks in the following format:

data: {"choices":[{"delta":{"content":"","role":"assistant"},"finish_reason":null,"index":0,"logprobs":null}],"created":1732317298,"id":"211b9a22-58cf-4b90-94e9-1fed8d0d9d0a","model":"Qwen2-Audio-7B-Instruct","object":"chat.completion.chunk","system_fingerprint":"fastcoe"}

data: {"choices":[{"delta":{"content":"The sound is that of ","role":"assistant"},"finish_reason":null,"index":0,"logprobs":null}],"created":1732317298,"id":"211b9a22-58cf-4b90-94e9-1fed8d0d9d0a","model":"Qwen2-Audio-7B-Instruct","object":"chat.completion.chunk","system_fingerprint":"fastcoe"}

*Please note that Streaming Responses are only available for the Reasoning endpoint for Beta launch. We will enable it for all three APIs during our upcoming release.

4 Likes