Inference Platform

Managed inference
on dedicated hardware.

Run IBM Granite and Google Gemma models on a fleet of dedicated machines. OpenAI-compatible API. No data retention. No training on your inputs.

Start building Read the docs

Products

💬

StreamRift Chat

A conversation interface with purpose-built assistants for wellness, legal process, creative writing, technical learning, and more. Conversations are ephemeral — nothing persists beyond your browser session.

ClarityConnieMirrorMuseTed+2 more

Open Chat →

⚡

StreamRift API

OpenAI-compatible inference endpoint. Drop in as a replacement for any OpenAI SDK client — change your base_urland you're running on dedicated hardware with zero data retention.

/v1/chat/completionsstreaming128K context

API Documentation →

Infrastructure

128K

Context window

up to 128K tokens per request

Bytes logged

requests exist in transit only

100%

Open source

Apache 2.0 licensed models

IBM Granite 3.3

2.5B parameter model from IBM Research. Fast inference, efficient on 16GB hardware. Apache 2.0 licensed. Serves the streamrift-fast tier.

Google Gemma 4

Google DeepMind's efficient 4B model. Strong reasoning and instruction following. Apache 2.0 licensed. Serves the streamrift-thinking tier.

Fleet Architecture

Requests are load-balanced across dedicated machines with health checks, circuit breakers, and automatic failover. No shared tenancy.

Quick Start

One line to switch.

Use the OpenAI SDK you already have. Change your base_url and API key. Streaming, function calling, and all standard parameters work out of the box.

pythonopenai sdk

from openai import OpenAI

client = OpenAI(
    base_url="https://streamrift.ai/api/v1",
    api_key="sr_your_key_here"
)

response = client.chat.completions.create(
    model="streamrift-fast",
    messages=[{"role": "user", "content": "Hello"}]
)

curlrest api

curl https://streamrift.ai/api/v1/chat/completions \
  -H "Authorization: Bearer sr_your_key" \
  -H "Content-Type: application/json" \
  -d '{"model":"streamrift-fast",
       "messages":[{"role":"user",
       "content":"Hello"}]}'

Full API reference →Generate your API key →

Data Architecture

Zero data retention

Requests are proxied to the inference fleet and streamed back. No prompt, completion, or conversation content is written to disk or database at any point in the pipeline.

No training on inputs

We run open-source models from IBM and Google as-is. Your inputs are never used for fine-tuning, RLHF, or any form of model improvement.

Ephemeral conversations

Chat state exists in your browser's memory only. Close the tab and it's gone. There is no server-side conversation history to retrieve, export, or subpoena.

Usage metadata only

We track token counts, timestamps, and model selection for billing and rate limiting. We never log the content of your requests or responses.

Start building today.

Free tier included. Scale to production with higher rate limits and priority routing.

Create account View pricing

Managed inferenceon dedicated hardware.