Inference Platform

Managed inference
on dedicated hardware.

Run IBM Granite and Google Gemma models on a fleet of dedicated machines. OpenAI-compatible API. No data retention. No training on your inputs.

Products
๐Ÿ’ฌ

StreamRift Chat

A conversation interface with purpose-built assistants for wellness, legal process, creative writing, technical learning, and more. Conversations are ephemeral โ€” nothing persists beyond your browser session.

ClarityConnieMirrorMuseTed+2 more
Open Chat โ†’
โšก

StreamRift API

OpenAI-compatible inference endpoint. Drop in as a replacement for any OpenAI SDK client โ€” change your base_urland you're running on dedicated hardware with zero data retention.

/v1/chat/completionsstreaming128K context
API Documentation โ†’
Infrastructure
128K
Context window
up to 128K tokens per request
0
Bytes logged
requests exist in transit only
100%
Open source
Apache 2.0 licensed models

IBM Granite 3.3

2.5B parameter model from IBM Research. Fast inference, efficient on 16GB hardware. Apache 2.0 licensed. Serves the streamrift-fast tier.

Google Gemma 4

Google DeepMind's efficient 4B model. Strong reasoning and instruction following. Apache 2.0 licensed. Serves the streamrift-thinking tier.

Fleet Architecture

Requests are load-balanced across dedicated machines with health checks, circuit breakers, and automatic failover. No shared tenancy.

Quick Start

One line to switch.

Use the OpenAI SDK you already have. Change your base_url and API key. Streaming, function calling, and all standard parameters work out of the box.

pythonopenai sdk
from openai import OpenAI

client = OpenAI(
    base_url="https://streamrift.ai/api/v1",
    api_key="sr_your_key_here"
)

response = client.chat.completions.create(
    model="streamrift-fast",
    messages=[{"role": "user", "content": "Hello"}]
)
curlrest api
curl https://streamrift.ai/api/v1/chat/completions \
  -H "Authorization: Bearer sr_your_key" \
  -H "Content-Type: application/json" \
  -d '{"model":"streamrift-fast",
       "messages":[{"role":"user",
       "content":"Hello"}]}'
Data Architecture

Zero data retention

Requests are proxied to the inference fleet and streamed back. No prompt, completion, or conversation content is written to disk or database at any point in the pipeline.

No training on inputs

We run open-source models from IBM and Google as-is. Your inputs are never used for fine-tuning, RLHF, or any form of model improvement.

Ephemeral conversations

Chat state exists in your browser's memory only. Close the tab and it's gone. There is no server-side conversation history to retrieve, export, or subpoena.

Usage metadata only

We track token counts, timestamps, and model selection for billing and rate limiting. We never log the content of your requests or responses.

Start building today.

Free tier included. Scale to production with higher rate limits and priority routing.