Inference Endpoint
A hosted REST API that serves LLM predictions — accepting prompts as input and returning generated text. The bridge between a trained model and production applications.
An inference endpoint is the serving layer for a trained model. After training (or downloading) an LLM, you need infrastructure to accept requests, run the forward pass, and return outputs at scale. That infrastructure — whether it's Hugging Face Inference Endpoints, AWS SageMaker, your own Vllm deployment, or a managed service like OpenAI — is the inference endpoint.
Request Flow
- Client sends HTTP POST with prompt, model params (temperature, max_tokens)
- Endpoint tokenizes the prompt (Tokenization)
- Model runs forward pass on GPU/CPU
- Output tokens stream back (SSE) or return in one response
- Client receives generated text
Key Metrics
- Time to First Token (TTFT) — latency before streaming starts; affects perceived responsiveness
- Tokens per Second (TPS) — throughput once streaming begins
- Requests per Second (RPS) — concurrent request capacity
- P99 latency — tail latency; critical for SLA
Managed vs Self-Hosted
Managed: OpenAI, Anthropic, Hugging Face Inference Endpoints — pay per token, zero infrastructure management, model choice limited to provider's catalog.
Self-hosted with Vllm: Full control, any model, predictable cost at scale, but requires GPU infrastructure, on-call rotation, and ops work. The economics favor self-hosting above ~$10k/month in model API spend.
Local with Ollama: Runs on developer hardware — no cost, no latency from the network, but limited throughput (typically 1 concurrent user).
Aggregated via Openrouter: Multiple providers through one API — convenient, adds small markup, limited to providers in the catalog.
Streaming
Production UX almost always requires streaming — showing tokens as they arrive rather than waiting for the full response. Endpoints implement this via Server-Sent Events (SSE). SDKs handle stream parsing automatically. Streaming dramatically improves perceived latency for long responses.
Context and Batching
Inference endpoints handle batching: grouping concurrent requests for efficient GPU utilization. Vllm's continuous batching processes new requests as they arrive rather than waiting for a full batch, dramatically improving throughput and reducing average wait time.
Related Terms
- Vllm — open-source engine for building high-throughput inference endpoints
- Ollama — local inference endpoint for development
- Openrouter — managed aggregator for multiple inference endpoints
- Tokenization — the first processing step on every inference endpoint
- Kubernetes — standard orchestration for self-hosted inference endpoint clusters