Inference Endpoint

A hosted REST API that serves LLM predictions — accepting prompts as input and returning generated text. The bridge between a trained model and production applications.

An inference endpoint is the serving layer for a trained model. After training (or downloading) an LLM, you need infrastructure to accept requests, run the forward pass, and return outputs at scale. That infrastructure — whether it's Hugging Face Inference Endpoints, AWS SageMaker, your own Vllm deployment, or a managed service like OpenAI — is the inference endpoint.

Request Flow

  1. Client sends HTTP POST with prompt, model params (temperature, max_tokens)
  2. Endpoint tokenizes the prompt (Tokenization)
  3. Model runs forward pass on GPU/CPU
  4. Output tokens stream back (SSE) or return in one response
  5. Client receives generated text

Key Metrics

  • Time to First Token (TTFT) — latency before streaming starts; affects perceived responsiveness
  • Tokens per Second (TPS) — throughput once streaming begins
  • Requests per Second (RPS) — concurrent request capacity
  • P99 latency — tail latency; critical for SLA

Managed vs Self-Hosted

Managed: OpenAI, Anthropic, Hugging Face Inference Endpoints — pay per token, zero infrastructure management, model choice limited to provider's catalog.

Self-hosted with Vllm: Full control, any model, predictable cost at scale, but requires GPU infrastructure, on-call rotation, and ops work. The economics favor self-hosting above ~$10k/month in model API spend.

Local with Ollama: Runs on developer hardware — no cost, no latency from the network, but limited throughput (typically 1 concurrent user).

Aggregated via Openrouter: Multiple providers through one API — convenient, adds small markup, limited to providers in the catalog.

Streaming

Production UX almost always requires streaming — showing tokens as they arrive rather than waiting for the full response. Endpoints implement this via Server-Sent Events (SSE). SDKs handle stream parsing automatically. Streaming dramatically improves perceived latency for long responses.

Context and Batching

Inference endpoints handle batching: grouping concurrent requests for efficient GPU utilization. Vllm's continuous batching processes new requests as they arrive rather than waiting for a full batch, dramatically improving throughput and reducing average wait time.

  • Vllm — open-source engine for building high-throughput inference endpoints
  • Ollama — local inference endpoint for development
  • Openrouter — managed aggregator for multiple inference endpoints
  • Tokenization — the first processing step on every inference endpoint
  • Kubernetes — standard orchestration for self-hosted inference endpoint clusters

Subscribe to Sahil's Playbook

Clear thinking on product, engineering, and building at scale. No noise. One email when there's something worth sharing.
[email protected]
Subscribe
Mastodon