OpenAI-compatible inference API. Powered by IBM Granite and Google Gemma on dedicated hardware.
Authentication
All API requests require an API key passed via the Authorization header.
Generate keys at /account/keys after signing in.
Authorization: Bearer sr_your_key_here
Base URL
All endpoints are served from a single base URL. Use this as your OpenAI SDK base_url.
https://streamrift.ai/api/v1
POST /v1/chat/completions
Create a chat completion. Fully compatible with the OpenAI chat completions API format.
Supports streaming (SSE) and non-streaming responses.
curl https://streamrift.ai/api/v1/chat/completions \
-H "Authorization: Bearer sr_your_key" \
-H "Content-Type: application/json" \
-d '{
"model": "streamrift-fast",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello"}
],
"stream": true,
"temperature": 0.7,
"max_tokens": 2048
}'
Available Models
| model | base | context | speed | use |
|---|
| streamrift-fast | IBM Granite 3.3 (2.5B) | 16K | Fastest | Quick tasks, code completion, chat |
| streamrift-thinking | Google Gemma 4 (4B) | 8K–128K | Balanced | Reasoning, analysis, creative writing |
Request Parameters
modelstring *"streamrift-fast" or "streamrift-thinking"
messagesarray *Array of {role, content} objects. Roles: "system", "user", "assistant"
streambooleanEnable SSE streaming. Default: true
temperaturenumber0.0–2.0. Default: 0.7
top_pnumberNucleus sampling. Default: 0.9
max_tokensnumberMaximum tokens to generate. Capped by model context.
Response Format
Non-streaming responses return a standard chat completion object.
{
"id": "sr-1713367200000",
"object": "chat.completion",
"model": "streamrift-fast",
"choices": [{
"index": 0,
"message": {
"role": "assistant",
"content": "Hello! How can I help you?"
},
"finish_reason": "stop"
}],
"usage": {
"prompt_tokens": 12,
"completion_tokens": 8,
"total_tokens": 20
}
}
Streaming (SSE)
When stream: true, responses are delivered as Server-Sent Events. Each event contains a delta with partial content.
data: {"id":"sr-1713367200001","object":"chat.completion.chunk","model":"streamrift-fast","choices":[{"index":0,"delta":{"content":"Hello"},"finish_reason":null}]}
data: {"id":"sr-1713367200002","object":"chat.completion.chunk","model":"streamrift-fast","choices":[{"index":0,"delta":{"content":"!"},"finish_reason":"stop"}]}
data: [DONE]
Rate Limits
Rate limits are per-account and determined by your plan. Exceeding the limit returns HTTP 429 with a Retry-After header.
| model | base | context | speed | use |
|---|
| Free | 10 rpm | 500K tokens/mo | 1 concurrent | $0 |
| Sovereign | 60 rpm | 10M tokens/mo | 3 concurrent | $20/mo |
Error Codes
Errors return a typed envelope so clients can branch on error.type without parsing message text.
401auth_errorMissing or invalid API key.
403api_disabledAPI access requires a paid plan. Free tier is chat-UI only.
403tier_not_availableThe thinking tier is not available on the current plan.
429rate_limit_exceededRequests-per-minute bucket drained. Check Retry-After header.
429wallet_emptyToken wallet below the 500-token floor required to start a turn. Wallet refills at 1,000/24h up to a 10,000 ceiling.
502fleet_errorBackend request failed mid-flight.
503fleet_unavailableNo healthy backends in the pool.
Data Privacy
StreamRift does not log, store, or retain the content of API requests or responses. Your prompts and completions exist only in transit between your client and the inference fleet.
We track only usage metadata for billing: token counts, timestamps, and model selection. See our Terms of Service for full details.
SDK Compatibility
StreamRift is compatible with any OpenAI SDK client. Change the base_url parameter and use your StreamRift API key.
# Python
from openai import OpenAI
client = OpenAI(base_url="https://streamrift.ai/api/v1", api_key="sr_...")
# Node.js
import OpenAI from "openai";
const client = new OpenAI({ baseURL: "https://streamrift.ai/api/v1", apiKey: "sr_..." });
# curl
curl https://streamrift.ai/api/v1/chat/completions -H "Authorization: Bearer sr_..."