By Sahil Kapoor - 22 Apr 2026

Inference Endpoint

An inference endpoint is the serving layer for a trained model. After training (or downloading) an LLM, you need infrastructure to accept requests, run the forward pass, and return outputs at scale. That infrastructure, whether it's Hugging Face Inference Endpoints, AWS SageMaker, your own Vllm deployment, or a managed service like OpenAI, is the inference endpoint.

Request Flow

Client sends HTTP POST with prompt, model params (temperature, max_tokens)
Endpoint tokenizes the prompt (Tokenization)
Model runs forward pass on GPU/CPU
Output tokens stream back (SSE) or return in one response
Client receives generated text

Key Metrics

Time to First Token (TTFT), latency before streaming starts; affects perceived responsiveness
Tokens per Second (TPS), throughput once streaming begins
Requests per Second (RPS), concurrent request capacity
P99 latency, tail latency; critical for SLA

Managed vs Self-Hosted

Managed: OpenAI, Anthropic, Hugging Face Inference Endpoints, pay per token, zero infrastructure management, model choice limited to provider's catalog.

Self-hosted with Vllm: Full control, any model, predictable cost at scale, but requires GPU infrastructure, on-call rotation, and ops work. The economics favor self-hosting above ~$10k/month in model API spend.

Local with Ollama: Runs on developer hardware, no cost, no latency from the network, but limited throughput (typically 1 concurrent user).

Aggregated via Openrouter: Multiple providers through one API, convenient, adds small markup, limited to providers in the catalog.

Streaming

Production UX almost always requires streaming, showing tokens as they arrive rather than waiting for the full response. Endpoints implement this via Server-Sent Events (SSE). SDKs handle stream parsing automatically. Streaming dramatically improves perceived latency for long responses.

Context and Batching

Inference endpoints handle batching: grouping concurrent requests for efficient GPU utilization. Vllm's continuous batching processes new requests as they arrive rather than waiting for a full batch, dramatically improving throughput and reducing average wait time.

Vllm, open-source engine for building high-throughput inference endpoints
Ollama, local inference endpoint for development
Openrouter, managed aggregator for multiple inference endpoints
Tokenization, the first processing step on every inference endpoint
Kubernetes, standard orchestration for self-hosted inference endpoint clusters

Request Flow

Key Metrics

Managed vs Self-Hosted

Streaming

Context and Batching

Related Terms

Subscribe to Sahil's Playbook