Inference Platform
Run IBM Granite and Google Gemma models on a fleet of dedicated machines. OpenAI-compatible API. No data retention. No training on your inputs.
A conversation interface with purpose-built assistants for wellness, legal process, creative writing, technical learning, and more. Conversations are ephemeral โ nothing persists beyond your browser session.
OpenAI-compatible inference endpoint. Drop in as a replacement for any OpenAI SDK client โ change your base_urland you're running on dedicated hardware with zero data retention.
2.5B parameter model from IBM Research. Fast inference, efficient on 16GB hardware. Apache 2.0 licensed. Serves the streamrift-fast tier.
Google DeepMind's efficient 4B model. Strong reasoning and instruction following. Apache 2.0 licensed. Serves the streamrift-thinking tier.
Requests are load-balanced across dedicated machines with health checks, circuit breakers, and automatic failover. No shared tenancy.
Use the OpenAI SDK you already have. Change your base_url and API key. Streaming, function calling, and all standard parameters work out of the box.
from openai import OpenAI
client = OpenAI(
base_url="https://streamrift.ai/api/v1",
api_key="sr_your_key_here"
)
response = client.chat.completions.create(
model="streamrift-fast",
messages=[{"role": "user", "content": "Hello"}]
)curl https://streamrift.ai/api/v1/chat/completions \
-H "Authorization: Bearer sr_your_key" \
-H "Content-Type: application/json" \
-d '{"model":"streamrift-fast",
"messages":[{"role":"user",
"content":"Hello"}]}'Requests are proxied to the inference fleet and streamed back. No prompt, completion, or conversation content is written to disk or database at any point in the pipeline.
We run open-source models from IBM and Google as-is. Your inputs are never used for fine-tuning, RLHF, or any form of model improvement.
Chat state exists in your browser's memory only. Close the tab and it's gone. There is no server-side conversation history to retrieve, export, or subpoena.
We track token counts, timestamps, and model selection for billing and rate limiting. We never log the content of your requests or responses.
Free tier included. Scale to production with higher rate limits and priority routing.