A Step-by-Step Guide to AI Inference Using Oumi and SambaNova

What is Oumi?

Oumi is an open-source AI development platform for training, fine-tuning, deploying, evaluating, and serving large language models (LLMs).
It supports models like LLaMA, Mistral, OpenHermes, TinyLlama, and connects to inference frameworks like vLLM or SGLang.

Oumi is a Python framework designed to work with multiple language model providers (such as OpenAI, SambaNova, etc.) through a unified interface. You can define models, set parameters, and call inference in a structured way.

1. Basic Concepts You Need to Know

Concept Description
ModelParams Specifies which model you’re using (e.g., SambaNova’s LLMs).
GenerationParams Controls generation (e.g., temperature, max tokens).
RemoteParams Contains API keys and base URLs for remote access.
InferenceEngine This is the class that connects your prompt with the provider and returns output.
Conversation Wrapper to structure your prompt + generated response.

2. Requirements

  • Python 3.8+
  • Install oumi (if it’s a local package or available via pip)
  • Have your SambaNova API key available

3.Step-by-Step Setup with SambaNova

Step 1: Install Oumi

If it’s a local package, install it using:

!pip install "oumi[gpu]"

Step 2: Set Your API Key

You can either:

export SAMBANOVA_API_KEY=<your-api-key>

Or set it in your Python code (less secure):

import os os.environ["SAMBANOVA_API_KEY"] = "<your-api-key>"

Step 3: Import Required Classes

from oumi.inference import SambanovaInferenceEngine
from oumi.inference import SambanovaInferenceEngine
# from oumi.core.configs import InferenceConfig, ModelParams
from oumi.core.types.conversation import Conversation, Message, Role

Step 4.Initialize with a small, free model

# Initialize with a small, free model
engine = SambanovaInferenceEngine(
    ModelParams(
        model_name="Llama-4-Maverick-17B-128E-Instruct",
        model_kwargs={"device_map": "auto"}

Step 5.Create a conversation


# Create a conversation
from oumi.core.types.conversation import Conversation, Message, Role
conversation = Conversation(messages=[
    Message(role=Role.USER, content="What is quantum computing?")
])

You can also use multimodal models for it also by using sambanova inference .

Batch inference refers to processing multiple prompts or conversations at once to optimize performance (especially on GPU backends or inference servers).

Oumi supports this pattern where you can send a list of Conversation objects and get all responses together — ideal for evaluation, benchmarking, or serving multiple users.

conversations = [Conversation(...), Conversation(...)]
responses = engine.generate_batch(conversations)

Oumi also integrates with evaluation components (e.g., accuracy, BLEU, ROUGE) to measure:

  • Prompt quality
  • Generation coherence
  • Fine-tuning effectiveness

Happy learning …..! :grinning:

3 Likes