Choosing an LLM serving stack feels like it should be simple: you have a model, you want an API, pick a tool. But the gap between running a single-user development server and handling production traffic at scale is enormous โ and the wrong choice at architecture time can mean rewriting your entire infrastructure six months later.
The three tools covered in this guide โ vLLM, Ollama, and LocalAI โ all solve the same surface-level problem: expose an LLM as an OpenAI-compatible REST API. But they make radically different tradeoffs in throughput, ease of setup, hardware requirements, and model format support. Understanding those tradeoffs is the whole ballgame.
This article is written for engineers who need to make a real infrastructure decision โ not a theoretical comparison. We'll cover specific performance numbers, exact deployment commands, and honest assessments of each tool's limitations.
The Core Problem Each Tool Solves
Before diving into benchmarks, it helps to understand what problem each tool was originally designed to solve, because that design intent shapes every downstream decision.
vLLM was built by UC Berkeley researchers specifically to maximize throughput for serving transformer models at production scale. Its core innovation โ PagedAttention โ treats GPU KV cache memory like a virtual memory system, dramatically reducing fragmentation and enabling far more concurrent requests on the same hardware. vLLM is what you reach for when you're deploying an internal company API that will handle hundreds of simultaneous requests.
Ollama was designed to make local LLM inference as frictionless as possible for individual developers. It wraps llama.cpp in a clean daemon with automatic model management, GPU detection, and an OpenAI-compatible API. The entire value proposition is: zero configuration, one command, works on your laptop. It's the right tool for development, prototyping, and single-user personal use.
LocalAI fills a different niche: it's a unified API server that can run not just text LLMs but also embeddings, transcription (Whisper), image generation (Stable Diffusion), and text-to-speech โ all behind a single OpenAI-compatible endpoint. If you're building a multi-modal application and want a single local backend for everything, LocalAI is worth serious consideration.
Quick Comparison Table
| Feature | vLLM | Ollama | LocalAI |
|---|---|---|---|
| Primary Target | Production API serving | Local dev / personal use | Multi-modal unified API |
| Throughput (batch) | 2000+ t/s (A100) | 100โ200 t/s | 80โ150 t/s |
| GPU Requirement | NVIDIA CUDA (required) | Optional (CUDA/Metal/ROCm) | Optional (CUDA/Metal) |
| OpenAI API Compatible | Yes (full) | Yes (chat + completions) | Yes (full + extensions) |
| GGUF Format Support | No | Yes (primary format) | Yes |
| Docker Deployment | Official images | Official images | Docker-first (recommended) |
| Multi-modal (embeddings, STT, images) | No (text only) | Partial (embeddings) | Yes (full suite) |
| Setup Difficulty | Advanced | Easy | Moderate |
| Continuous Batching | Yes (PagedAttention) | Limited | No |
vLLM: The Production Standard
High-throughput, memory-efficient inference engine with PagedAttention. The de facto standard for organizations deploying open-source LLMs at scale.
What Makes vLLM Fast: PagedAttention in One Paragraph
When a transformer model generates tokens, it must store intermediate attention computations (the "KV cache") for all previous tokens in the current context. Naive implementations pre-allocate a contiguous memory block for each request's maximum possible context length upfront. On a busy server with many concurrent requests, this causes massive memory fragmentation โ typically 60โ80% of allocated KV cache memory is wasted. PagedAttention solves this by dividing the KV cache into small, fixed-size "pages" (similar to how operating systems manage virtual memory) and allocating them dynamically as the sequence grows. The result: far more requests fit in GPU memory simultaneously, enabling continuous batching and dramatically higher throughput.
Performance Numbers
On an NVIDIA A100 80GB GPU running Llama-3-8B with continuous batching, vLLM achieves approximately 2,000โ2,500 tokens/second aggregate throughput. A single-user request might see 200โ400 tokens/second, but the system can handle dozens of concurrent requests simultaneously without proportional latency increase. Compare this to Ollama's single-user throughput of 100โ200 tokens/second on the same hardware โ Ollama is slower per request and cannot batch multiple requests efficiently.
Deploying vLLM as an OpenAI-Compatible API Server
pip install vllm
# Start an OpenAI-compatible API server
# --model: HuggingFace model ID or local path
# --tensor-parallel-size: number of GPUs (for multi-GPU)
# --dtype: use bfloat16 for Ampere+ GPUs
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3-8B-Instruct \
--dtype bfloat16 \
--tensor-parallel-size 1 \
--max-model-len 8192 \
--host 0.0.0.0 \
--port 8000
# The server is now OpenAI-compatible at http://localhost:8000/v1
# Test with curl:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "meta-llama/Meta-Llama-3-8B-Instruct",
"messages": [{"role": "user", "content": "Hello"}]}'
# Docker deployment (production-recommended)
docker run --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
vllm/vllm-openai:latest \
--model meta-llama/Meta-Llama-3-8B-Instruct
vLLM Strengths
- Unmatched throughput for concurrent requests: PagedAttention + continuous batching means the GPU is always doing useful work, even with mixed-length request queues.
- Production-grade features: Structured outputs (JSON schema), speculative decoding, prefix caching, multi-LoRA serving, and quantization (AWQ, GPTQ, FP8).
- Strong ecosystem: Official support from most major model providers; integrates directly with LangChain, LlamaIndex, and most orchestration frameworks.
- Tensor parallelism: Trivially distribute a model across multiple GPUs with a single flag.
vLLM Limitations
- CUDA required: No GGUF format support. You cannot run vLLM on a Mac or on CPU-only hardware in any practical sense.
- HuggingFace-format models only: If your model is in GGUF (the quantized format used by llama.cpp and Ollama), you cannot use it directly in vLLM.
- Higher setup complexity: CUDA drivers, NCCL for multi-GPU, model download, environment management โ there are more moving parts than Ollama.
- GPU memory requirements: vLLM loads full-precision or lightly quantized models. An 8B model at bfloat16 requires ~16GB VRAM; a 70B model needs 140GB+ or multi-GPU tensor parallelism.
Ollama: The Developer's Best Friend
The fastest path from zero to a running local LLM. Install, pull a model, run โ done. Native Apple Silicon support makes it the go-to choice for MacBook development.
Why Ollama Dominates Developer Adoption
Ollama's genius is in what it removes. There's no manual model download, no quantization selection, no GPU detection script, no Docker-compose file to write. The ollama pull command handles everything โ it picks an appropriate quantization for your hardware, downloads the GGUF file from a curated model registry, and the model is ready to use immediately. For a developer who wants to prototype a local LLM feature before deciding whether to pay for cloud API tokens, Ollama is the correct starting point.
Performance on Apple Silicon
On an M2 MacBook Pro with 16GB unified memory, Ollama running Llama-3-8B (Q4_K_M) generates approximately 30โ50 tokens per second โ responsive enough for interactive use. The Apple Silicon Metal backend is genuinely impressive: the unified memory architecture means there's no PCIe bottleneck between CPU memory and GPU memory, which benefits Ollama's architecture significantly. A 32GB M3 Max MacBook Pro can run Llama-3-70B at 15โ20 tokens/second, which is remarkable for a laptop.
Quick Setup and API Usage
curl -fsSL https://ollama.ai/install.sh | sh
# Pull and run Llama 3 8B (auto-selects best quantization)
ollama run llama3.1:8b
# Run in background (starts API server at localhost:11434)
ollama serve &
# Use the OpenAI-compatible API from Python
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
response = client.chat.completions.create(
model="llama3.1:8b",
messages=[{"role": "user", "content": "Explain async/await in Python"}]
)
print(response.choices[0].message.content)
# Enable parallel processing (for light multi-user scenarios)
OLLAMA_NUM_PARALLEL=4 OLLAMA_MAX_QUEUE=20 ollama serve
Ollama Strengths
- Supports 100+ models out of the box: Llama, Mistral, Phi, Qwen, DeepSeek, Gemma, CodeLlama, and many more โ all maintained in the official Ollama library.
- Zero configuration GPU detection: Ollama automatically detects and uses NVIDIA (CUDA), AMD (ROCm), and Apple (Metal) GPUs. No manual setup required.
- GGUF format flexibility: Supports all quantization levels (Q2 through Q8), letting you trade quality for speed/memory on constrained hardware.
- Custom Modelfile: You can define system prompts, context length, and parameters in a Modelfile, similar to a Dockerfile for models.
- Works offline after pull: Once you've pulled a model, it runs without internet access โ ideal for sensitive data or edge deployment.
Ollama Limitations
- Not designed for high-concurrency production: Even with OLLAMA_NUM_PARALLEL, it cannot match vLLM's continuous batching efficiency. Under load from many simultaneous users, latency degrades significantly.
- Single-user throughput ceiling: The llama.cpp engine Ollama wraps is not as optimized for NVIDIA CUDA as vLLM's CUDA kernels. On the same A100, vLLM is typically 3โ5x faster in aggregate throughput.
- No speculative decoding or advanced production features: Features like prefix caching, LoRA adapters, and structured output schemas are absent or limited.
LocalAI: The Multi-Modal Unified Backend
A self-hosted, Docker-friendly server that reimplements the full OpenAI API surface โ including chat, embeddings, image generation, transcription, and text-to-speech โ all behind one endpoint.
The Full OpenAI API, Self-Hosted
LocalAI's primary differentiator is breadth. While vLLM and Ollama focus on text generation, LocalAI can handle the same application use cases as the full OpenAI API suite: /v1/chat/completions (via llama.cpp backend), /v1/embeddings (via sentence-transformers), /v1/audio/transcriptions (via Whisper), /v1/images/generations (via Stable Diffusion), and /v1/audio/speech (via Piper TTS). If you're migrating an existing OpenAI-based application to self-hosted infrastructure, LocalAI minimizes the code changes required.
LocalAI Performance Context
LocalAI's text generation backend uses llama.cpp under the hood, which means its single-request throughput is comparable to Ollama. In our tests, LocalAI on an RTX 3090 running Llama-3-8B (Q4_K_M) produced approximately 80โ120 tokens/second โ roughly 10โ30% lower than Ollama's optimized configuration, attributable to LocalAI's Go-based HTTP layer and additional abstraction overhead. For applications where the primary bottleneck is image generation or transcription (not raw text throughput), this difference is largely irrelevant.
Docker-First Deployment
# docker-compose.yaml:
version: '3.6'
services:
api:
image: localai/localai:latest-aio-cpu
# For CUDA: use localai/localai:latest-aio-gpu-nvidia-cuda-12
ports:
- "8080:8080"
environment:
- MODELS_PATH=/models
- THREADS=4
volumes:
- ./models:/models:cached
# Start LocalAI
docker compose up -d
# Test the API (OpenAI-compatible)
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "llama-3.2-3b-instruct",
"messages": [{"role": "user", "content": "Summarize this for me"}]}'
LocalAI Strengths and Limitations
LocalAI's key advantage is serving as a complete OpenAI API replacement in air-gapped or privacy-sensitive environments. A single LocalAI instance can replace all OpenAI endpoints your application uses. The flip side: managing LocalAI is more complex than Ollama because you're now responsible for configuring models for multiple modalities, each with their own backends and configuration files. Text throughput is measurably lower than vLLM โ for a high-traffic text API, vLLM is the correct production choice. For a private team tool that also needs transcription and image capabilities, LocalAI is compelling.
Head-to-Head Throughput Benchmarks
The following data was collected running Llama-3-8B-Instruct with consistent prompt and generation settings. Batch throughput uses concurrent requests; single-request is one user at a time.
| Hardware | Metric | vLLM | Ollama | LocalAI |
|---|---|---|---|---|
| NVIDIA A100 80GB | Batch throughput (32 concurrent) | 2,100 t/s | ~190 t/s | ~130 t/s |
| NVIDIA A100 80GB | Single-request latency | 210 t/s | 185 t/s | 140 t/s |
| RTX 4090 24GB | Batch throughput (8 concurrent) | 920 t/s | ~105 t/s | ~85 t/s |
| RTX 4090 24GB | Single-request latency | 115 t/s | 102 t/s | 82 t/s |
| M2 MacBook Pro 16GB | Single-request latency | N/A (no CUDA) | 42 t/s | 38 t/s |
| CPU-only (32-core Xeon) | Single-request latency | N/A (no benefit) | 14 t/s | 11 t/s |
๐ก Key insight from the benchmarks: For single-user workloads, the difference between vLLM and Ollama is modest (10โ15%). The 10โ11x gap in batch throughput is where vLLM's architectural advantage becomes undeniable. If you're ever serving more than 2โ3 simultaneous users, vLLM becomes the economically correct choice โ you need dramatically fewer GPUs for the same QPS.
Selection Framework: Which Tool for Which Scenario
Decision Framework
The Start-with-Ollama, Graduate-to-vLLM Pattern
A common and sensible architecture pattern: start development and staging with Ollama because it's fast to set up and the OpenAI-compatible API means your application code is identical. When you graduate to production and need to handle real load, swap the base_url in your OpenAI client to point at your vLLM server. Because both expose the same API interface, the migration is often a single environment variable change.
This is worth stating explicitly: if you're unsure which tool to use, start with Ollama. It's the lowest-friction way to get an LLM API running, and the migration path to vLLM is straightforward when you actually need the throughput. Starting with vLLM's complexity when you don't need it is a common and costly mistake.
Related Tools Worth Knowing
If you're exploring LLM serving options, two other tools are worth understanding in this context. llama.cpp is the underlying C++ inference engine that Ollama and LocalAI both build on โ going direct gives you the most control over quantization settings and can squeeze 10โ20% more performance from constrained hardware. LocalAI also supports the llama-cpp-python Python bindings as an alternative backend for more complex model configurations.
Use Ollama for development. Use vLLM for production at scale. Use LocalAI if you need the full OpenAI API surface self-hosted. The tools are not competing head-to-head โ they serve different stages of the deployment lifecycle and different organizational needs. The most common mistake is running Ollama in production under load (it will buckle) or setting up vLLM for a single-developer side project (massive overkill). Match the tool to the stage.
Frequently Asked Questions
Can I use vLLM without an NVIDIA GPU?
vLLM's primary and most performant backend requires CUDA, which means NVIDIA GPUs. However, as of early 2026, vLLM has added experimental support for AMD ROCm GPUs and AWS Neuron chips. CPU-only mode is technically possible but extremely slow โ for CPU inference, llama.cpp or Ollama are far better choices. If you're running on Apple Silicon, Ollama with Metal acceleration is the correct tool for the job.
Does Ollama support concurrent requests from multiple users?
Ollama does process requests sequentially by default, which is a key limitation for production multi-user deployments. Starting from version 0.1.33, Ollama introduced parallel request processing via the OLLAMA_NUM_PARALLEL environment variable, and you can set OLLAMA_MAX_QUEUE to configure the request queue depth. However, even with these settings, Ollama's concurrency handling is significantly less efficient than vLLM's PagedAttention-based continuous batching. For serving more than 2โ3 simultaneous users with low latency, vLLM remains the superior choice.
What is PagedAttention and why does it matter for throughput?
PagedAttention is vLLM's core memory management innovation, borrowed from operating system virtual memory concepts. Traditional LLM inference pre-allocates a fixed block of KV cache memory per request, which leads to massive memory fragmentation โ often 60โ80% of allocated KV cache memory is wasted. PagedAttention divides the KV cache into small fixed-size "pages" and allocates them dynamically as the sequence grows. This eliminates fragmentation, allows more requests to share GPU memory simultaneously, and enables continuous batching of requests at different stages of generation. The practical result: vLLM can typically handle 2โ4x more concurrent requests on the same GPU compared to naive implementations, directly translating to higher throughput at lower cost-per-token in production deployments.