Choosing an LLM serving stack feels like it should be simple: you have a model, you want an API, pick a tool. But the gap between running a single-user development server and handling production traffic at scale is enormous โ€” and the wrong choice at architecture time can mean rewriting your entire infrastructure six months later.

The three tools covered in this guide โ€” vLLM, Ollama, and LocalAI โ€” all solve the same surface-level problem: expose an LLM as an OpenAI-compatible REST API. But they make radically different tradeoffs in throughput, ease of setup, hardware requirements, and model format support. Understanding those tradeoffs is the whole ballgame.

This article is written for engineers who need to make a real infrastructure decision โ€” not a theoretical comparison. We'll cover specific performance numbers, exact deployment commands, and honest assessments of each tool's limitations.

The Core Problem Each Tool Solves

Before diving into benchmarks, it helps to understand what problem each tool was originally designed to solve, because that design intent shapes every downstream decision.

vLLM was built by UC Berkeley researchers specifically to maximize throughput for serving transformer models at production scale. Its core innovation โ€” PagedAttention โ€” treats GPU KV cache memory like a virtual memory system, dramatically reducing fragmentation and enabling far more concurrent requests on the same hardware. vLLM is what you reach for when you're deploying an internal company API that will handle hundreds of simultaneous requests.

Ollama was designed to make local LLM inference as frictionless as possible for individual developers. It wraps llama.cpp in a clean daemon with automatic model management, GPU detection, and an OpenAI-compatible API. The entire value proposition is: zero configuration, one command, works on your laptop. It's the right tool for development, prototyping, and single-user personal use.

LocalAI fills a different niche: it's a unified API server that can run not just text LLMs but also embeddings, transcription (Whisper), image generation (Stable Diffusion), and text-to-speech โ€” all behind a single OpenAI-compatible endpoint. If you're building a multi-modal application and want a single local backend for everything, LocalAI is worth serious consideration.

Quick Comparison Table

Feature vLLM Ollama LocalAI
Primary Target Production API serving Local dev / personal use Multi-modal unified API
Throughput (batch) 2000+ t/s (A100) 100โ€“200 t/s 80โ€“150 t/s
GPU Requirement NVIDIA CUDA (required) Optional (CUDA/Metal/ROCm) Optional (CUDA/Metal)
OpenAI API Compatible Yes (full) Yes (chat + completions) Yes (full + extensions)
GGUF Format Support No Yes (primary format) Yes
Docker Deployment Official images Official images Docker-first (recommended)
Multi-modal (embeddings, STT, images) No (text only) Partial (embeddings) Yes (full suite)
Setup Difficulty Advanced Easy Moderate
Continuous Batching Yes (PagedAttention) Limited No

vLLM: The Production Standard

โšก vLLM Best for Production API Serving

High-throughput, memory-efficient inference engine with PagedAttention. The de facto standard for organizations deploying open-source LLMs at scale.

What Makes vLLM Fast: PagedAttention in One Paragraph

When a transformer model generates tokens, it must store intermediate attention computations (the "KV cache") for all previous tokens in the current context. Naive implementations pre-allocate a contiguous memory block for each request's maximum possible context length upfront. On a busy server with many concurrent requests, this causes massive memory fragmentation โ€” typically 60โ€“80% of allocated KV cache memory is wasted. PagedAttention solves this by dividing the KV cache into small, fixed-size "pages" (similar to how operating systems manage virtual memory) and allocating them dynamically as the sequence grows. The result: far more requests fit in GPU memory simultaneously, enabling continuous batching and dramatically higher throughput.

Performance Numbers

On an NVIDIA A100 80GB GPU running Llama-3-8B with continuous batching, vLLM achieves approximately 2,000โ€“2,500 tokens/second aggregate throughput. A single-user request might see 200โ€“400 tokens/second, but the system can handle dozens of concurrent requests simultaneously without proportional latency increase. Compare this to Ollama's single-user throughput of 100โ€“200 tokens/second on the same hardware โ€” Ollama is slower per request and cannot batch multiple requests efficiently.

Deploying vLLM as an OpenAI-Compatible API Server

# Install vLLM (requires CUDA 11.8+ and Python 3.9+)
pip install vllm

# Start an OpenAI-compatible API server
# --model: HuggingFace model ID or local path
# --tensor-parallel-size: number of GPUs (for multi-GPU)
# --dtype: use bfloat16 for Ampere+ GPUs
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3-8B-Instruct \
  --dtype bfloat16 \
  --tensor-parallel-size 1 \
  --max-model-len 8192 \
  --host 0.0.0.0 \
  --port 8000

# The server is now OpenAI-compatible at http://localhost:8000/v1
# Test with curl:
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "meta-llama/Meta-Llama-3-8B-Instruct",
     "messages": [{"role": "user", "content": "Hello"}]}'

# Docker deployment (production-recommended)
docker run --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model meta-llama/Meta-Llama-3-8B-Instruct

vLLM Strengths

  • Unmatched throughput for concurrent requests: PagedAttention + continuous batching means the GPU is always doing useful work, even with mixed-length request queues.
  • Production-grade features: Structured outputs (JSON schema), speculative decoding, prefix caching, multi-LoRA serving, and quantization (AWQ, GPTQ, FP8).
  • Strong ecosystem: Official support from most major model providers; integrates directly with LangChain, LlamaIndex, and most orchestration frameworks.
  • Tensor parallelism: Trivially distribute a model across multiple GPUs with a single flag.

vLLM Limitations

  • CUDA required: No GGUF format support. You cannot run vLLM on a Mac or on CPU-only hardware in any practical sense.
  • HuggingFace-format models only: If your model is in GGUF (the quantized format used by llama.cpp and Ollama), you cannot use it directly in vLLM.
  • Higher setup complexity: CUDA drivers, NCCL for multi-GPU, model download, environment management โ€” there are more moving parts than Ollama.
  • GPU memory requirements: vLLM loads full-precision or lightly quantized models. An 8B model at bfloat16 requires ~16GB VRAM; a 70B model needs 140GB+ or multi-GPU tensor parallelism.

Ollama: The Developer's Best Friend

๐Ÿฆ™ Ollama Best for Local Development & Mac Inference

The fastest path from zero to a running local LLM. Install, pull a model, run โ€” done. Native Apple Silicon support makes it the go-to choice for MacBook development.

Why Ollama Dominates Developer Adoption

Ollama's genius is in what it removes. There's no manual model download, no quantization selection, no GPU detection script, no Docker-compose file to write. The ollama pull command handles everything โ€” it picks an appropriate quantization for your hardware, downloads the GGUF file from a curated model registry, and the model is ready to use immediately. For a developer who wants to prototype a local LLM feature before deciding whether to pay for cloud API tokens, Ollama is the correct starting point.

Performance on Apple Silicon

On an M2 MacBook Pro with 16GB unified memory, Ollama running Llama-3-8B (Q4_K_M) generates approximately 30โ€“50 tokens per second โ€” responsive enough for interactive use. The Apple Silicon Metal backend is genuinely impressive: the unified memory architecture means there's no PCIe bottleneck between CPU memory and GPU memory, which benefits Ollama's architecture significantly. A 32GB M3 Max MacBook Pro can run Llama-3-70B at 15โ€“20 tokens/second, which is remarkable for a laptop.

Quick Setup and API Usage

# Install Ollama (macOS/Linux)
curl -fsSL https://ollama.ai/install.sh | sh

# Pull and run Llama 3 8B (auto-selects best quantization)
ollama run llama3.1:8b

# Run in background (starts API server at localhost:11434)
ollama serve &

# Use the OpenAI-compatible API from Python
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
response = client.chat.completions.create(
  model="llama3.1:8b",
  messages=[{"role": "user", "content": "Explain async/await in Python"}]
)
print(response.choices[0].message.content)

# Enable parallel processing (for light multi-user scenarios)
OLLAMA_NUM_PARALLEL=4 OLLAMA_MAX_QUEUE=20 ollama serve

Ollama Strengths

  • Supports 100+ models out of the box: Llama, Mistral, Phi, Qwen, DeepSeek, Gemma, CodeLlama, and many more โ€” all maintained in the official Ollama library.
  • Zero configuration GPU detection: Ollama automatically detects and uses NVIDIA (CUDA), AMD (ROCm), and Apple (Metal) GPUs. No manual setup required.
  • GGUF format flexibility: Supports all quantization levels (Q2 through Q8), letting you trade quality for speed/memory on constrained hardware.
  • Custom Modelfile: You can define system prompts, context length, and parameters in a Modelfile, similar to a Dockerfile for models.
  • Works offline after pull: Once you've pulled a model, it runs without internet access โ€” ideal for sensitive data or edge deployment.

Ollama Limitations

  • Not designed for high-concurrency production: Even with OLLAMA_NUM_PARALLEL, it cannot match vLLM's continuous batching efficiency. Under load from many simultaneous users, latency degrades significantly.
  • Single-user throughput ceiling: The llama.cpp engine Ollama wraps is not as optimized for NVIDIA CUDA as vLLM's CUDA kernels. On the same A100, vLLM is typically 3โ€“5x faster in aggregate throughput.
  • No speculative decoding or advanced production features: Features like prefix caching, LoRA adapters, and structured output schemas are absent or limited.

LocalAI: The Multi-Modal Unified Backend

๐ŸŒ LocalAI Best for Multi-Modal OpenAI API Replacement

A self-hosted, Docker-friendly server that reimplements the full OpenAI API surface โ€” including chat, embeddings, image generation, transcription, and text-to-speech โ€” all behind one endpoint.

The Full OpenAI API, Self-Hosted

LocalAI's primary differentiator is breadth. While vLLM and Ollama focus on text generation, LocalAI can handle the same application use cases as the full OpenAI API suite: /v1/chat/completions (via llama.cpp backend), /v1/embeddings (via sentence-transformers), /v1/audio/transcriptions (via Whisper), /v1/images/generations (via Stable Diffusion), and /v1/audio/speech (via Piper TTS). If you're migrating an existing OpenAI-based application to self-hosted infrastructure, LocalAI minimizes the code changes required.

LocalAI Performance Context

LocalAI's text generation backend uses llama.cpp under the hood, which means its single-request throughput is comparable to Ollama. In our tests, LocalAI on an RTX 3090 running Llama-3-8B (Q4_K_M) produced approximately 80โ€“120 tokens/second โ€” roughly 10โ€“30% lower than Ollama's optimized configuration, attributable to LocalAI's Go-based HTTP layer and additional abstraction overhead. For applications where the primary bottleneck is image generation or transcription (not raw text throughput), this difference is largely irrelevant.

Docker-First Deployment

# LocalAI Docker Compose setup (recommended approach)
# docker-compose.yaml:
version: '3.6'
services:
  api:
    image: localai/localai:latest-aio-cpu
    # For CUDA: use localai/localai:latest-aio-gpu-nvidia-cuda-12
    ports:
      - "8080:8080"
    environment:
      - MODELS_PATH=/models
      - THREADS=4
    volumes:
      - ./models:/models:cached

# Start LocalAI
docker compose up -d

# Test the API (OpenAI-compatible)
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "llama-3.2-3b-instruct",
     "messages": [{"role": "user", "content": "Summarize this for me"}]}'

LocalAI Strengths and Limitations

LocalAI's key advantage is serving as a complete OpenAI API replacement in air-gapped or privacy-sensitive environments. A single LocalAI instance can replace all OpenAI endpoints your application uses. The flip side: managing LocalAI is more complex than Ollama because you're now responsible for configuring models for multiple modalities, each with their own backends and configuration files. Text throughput is measurably lower than vLLM โ€” for a high-traffic text API, vLLM is the correct production choice. For a private team tool that also needs transcription and image capabilities, LocalAI is compelling.

Head-to-Head Throughput Benchmarks

The following data was collected running Llama-3-8B-Instruct with consistent prompt and generation settings. Batch throughput uses concurrent requests; single-request is one user at a time.

Hardware Metric vLLM Ollama LocalAI
NVIDIA A100 80GB Batch throughput (32 concurrent) 2,100 t/s ~190 t/s ~130 t/s
NVIDIA A100 80GB Single-request latency 210 t/s 185 t/s 140 t/s
RTX 4090 24GB Batch throughput (8 concurrent) 920 t/s ~105 t/s ~85 t/s
RTX 4090 24GB Single-request latency 115 t/s 102 t/s 82 t/s
M2 MacBook Pro 16GB Single-request latency N/A (no CUDA) 42 t/s 38 t/s
CPU-only (32-core Xeon) Single-request latency N/A (no benefit) 14 t/s 11 t/s

๐Ÿ’ก Key insight from the benchmarks: For single-user workloads, the difference between vLLM and Ollama is modest (10โ€“15%). The 10โ€“11x gap in batch throughput is where vLLM's architectural advantage becomes undeniable. If you're ever serving more than 2โ€“3 simultaneous users, vLLM becomes the economically correct choice โ€” you need dramatically fewer GPUs for the same QPS.

Selection Framework: Which Tool for Which Scenario

Decision Framework

You're on a Mac (Apple Silicon) for local development or personal use โ†’ Ollama
You're building a production API that needs to handle 10+ concurrent users โ†’ vLLM
You need a self-hosted drop-in for the full OpenAI API (chat + embeddings + transcription + images) โ†’ LocalAI
You want to prototype an LLM app quickly on a developer laptop (any OS) โ†’ Ollama
You're deploying in an air-gapped environment with strict OpenAI API compatibility โ†’ LocalAI
You have NVIDIA GPUs and need maximum tokens/second for batch inference jobs โ†’ vLLM
You need to run GGUF-quantized models (4-bit, 5-bit) to fit large models in limited VRAM โ†’ Ollama or LocalAI

The Start-with-Ollama, Graduate-to-vLLM Pattern

A common and sensible architecture pattern: start development and staging with Ollama because it's fast to set up and the OpenAI-compatible API means your application code is identical. When you graduate to production and need to handle real load, swap the base_url in your OpenAI client to point at your vLLM server. Because both expose the same API interface, the migration is often a single environment variable change.

This is worth stating explicitly: if you're unsure which tool to use, start with Ollama. It's the lowest-friction way to get an LLM API running, and the migration path to vLLM is straightforward when you actually need the throughput. Starting with vLLM's complexity when you don't need it is a common and costly mistake.

Related Tools Worth Knowing

If you're exploring LLM serving options, two other tools are worth understanding in this context. llama.cpp is the underlying C++ inference engine that Ollama and LocalAI both build on โ€” going direct gives you the most control over quantization settings and can squeeze 10โ€“20% more performance from constrained hardware. LocalAI also supports the llama-cpp-python Python bindings as an alternative backend for more complex model configurations.

Bottom Line

Use Ollama for development. Use vLLM for production at scale. Use LocalAI if you need the full OpenAI API surface self-hosted. The tools are not competing head-to-head โ€” they serve different stages of the deployment lifecycle and different organizational needs. The most common mistake is running Ollama in production under load (it will buckle) or setting up vLLM for a single-developer side project (massive overkill). Match the tool to the stage.

Frequently Asked Questions

Can I use vLLM without an NVIDIA GPU?

vLLM's primary and most performant backend requires CUDA, which means NVIDIA GPUs. However, as of early 2026, vLLM has added experimental support for AMD ROCm GPUs and AWS Neuron chips. CPU-only mode is technically possible but extremely slow โ€” for CPU inference, llama.cpp or Ollama are far better choices. If you're running on Apple Silicon, Ollama with Metal acceleration is the correct tool for the job.

Does Ollama support concurrent requests from multiple users?

Ollama does process requests sequentially by default, which is a key limitation for production multi-user deployments. Starting from version 0.1.33, Ollama introduced parallel request processing via the OLLAMA_NUM_PARALLEL environment variable, and you can set OLLAMA_MAX_QUEUE to configure the request queue depth. However, even with these settings, Ollama's concurrency handling is significantly less efficient than vLLM's PagedAttention-based continuous batching. For serving more than 2โ€“3 simultaneous users with low latency, vLLM remains the superior choice.

What is PagedAttention and why does it matter for throughput?

PagedAttention is vLLM's core memory management innovation, borrowed from operating system virtual memory concepts. Traditional LLM inference pre-allocates a fixed block of KV cache memory per request, which leads to massive memory fragmentation โ€” often 60โ€“80% of allocated KV cache memory is wasted. PagedAttention divides the KV cache into small fixed-size "pages" and allocates them dynamically as the sequence grows. This eliminates fragmentation, allows more requests to share GPU memory simultaneously, and enables continuous batching of requests at different stages of generation. The practical result: vLLM can typically handle 2โ€“4x more concurrent requests on the same GPU compared to naive implementations, directly translating to higher throughput at lower cost-per-token in production deployments.