What is Oumi?
Oumi is an open-source AI development platform for training, fine-tuning, deploying, evaluating, and serving large language models (LLMs).
It supports models like LLaMA, Mistral, OpenHermes, TinyLlama, and connects to inference frameworks like vLLM or SGLang.
Oumi is a Python framework designed to work with multiple language model providers (such as OpenAI, SambaNova, etc.) through a unified interface. You can define models, set parameters, and call inference in a structured way.
1. Basic Concepts You Need to Know
Concept | Description |
---|---|
ModelParams | Specifies which model you’re using (e.g., SambaNova’s LLMs). |
GenerationParams | Controls generation (e.g., temperature, max tokens). |
RemoteParams | Contains API keys and base URLs for remote access. |
InferenceEngine | This is the class that connects your prompt with the provider and returns output. |
Conversation | Wrapper to structure your prompt + generated response. |
2. Requirements
- Python 3.8+
- Install
oumi
(if it’s a local package or available via pip) - Have your SambaNova API key available
3.Step-by-Step Setup with SambaNova
Step 1: Install Oumi
If it’s a local package, install it using:
!pip install "oumi[gpu]"
Step 2: Set Your API Key
You can either:
export SAMBANOVA_API_KEY=<your-api-key>
Or set it in your Python code (less secure):
import os os.environ["SAMBANOVA_API_KEY"] = "<your-api-key>"
Step 3: Import Required Classes
from oumi.inference import SambanovaInferenceEngine
from oumi.inference import SambanovaInferenceEngine
# from oumi.core.configs import InferenceConfig, ModelParams
from oumi.core.types.conversation import Conversation, Message, Role
Step 4.Initialize with a small, free model
# Initialize with a small, free model
engine = SambanovaInferenceEngine(
ModelParams(
model_name="Llama-4-Maverick-17B-128E-Instruct",
model_kwargs={"device_map": "auto"}
Step 5.Create a conversation
# Create a conversation
from oumi.core.types.conversation import Conversation, Message, Role
conversation = Conversation(messages=[
Message(role=Role.USER, content="What is quantum computing?")
])
You can also use multimodal models for it also by using sambanova inference .
Batch inference refers to processing multiple prompts or conversations at once to optimize performance (especially on GPU backends or inference servers).
Oumi supports this pattern where you can send a list of Conversation
objects and get all responses together — ideal for evaluation, benchmarking, or serving multiple users.
conversations = [Conversation(...), Conversation(...)]
responses = engine.generate_batch(conversations)
Oumi also integrates with evaluation components (e.g., accuracy, BLEU, ROUGE) to measure:
- Prompt quality
- Generation coherence
- Fine-tuning effectiveness