Moving from a working vLLM deployment to a production system that handles thousands of requests efficiently requires careful tuning. This guide covers the optimization techniques and architectural patterns we use at Sajima Solutions for high-throughput Qwen deployments.
Understanding vLLM’s Architecture
Before optimizing, understand what makes vLLM fast:
- PagedAttention: Manages KV cache memory like OS virtual memory—no fragmentation
- Continuous batching: Dynamically adds requests mid-generation
- CUDA kernels: Optimized attention and sampling operations
- Prefix caching: Reuses KV cache for repeated prefixes
Baseline Configuration
Start with a sensible baseline for Qwen2.5-7B:
1
2
3
4
5
6
| python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-7B-Instruct \
--port 8000 \
--gpu-memory-utilization 0.90 \
--max-model-len 8192 \
--enable-prefix-caching
|
Then measure before tuning further:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
| import time
import asyncio
import aiohttp
async def benchmark(url: str, num_requests: int, concurrency: int):
"""Measure throughput and latency."""
payload = {
"model": "Qwen/Qwen2.5-7B-Instruct",
"messages": [{"role": "user", "content": "Explain what an API is in two sentences."}],
"max_tokens": 100
}
semaphore = asyncio.Semaphore(concurrency)
latencies = []
async def single_request(session):
async with semaphore:
start = time.perf_counter()
async with session.post(f"{url}/v1/chat/completions", json=payload) as resp:
await resp.json()
latencies.append(time.perf_counter() - start)
async with aiohttp.ClientSession() as session:
start = time.perf_counter()
await asyncio.gather(*[single_request(session) for _ in range(num_requests)])
total_time = time.perf_counter() - start
latencies.sort()
print(f"Throughput: {num_requests / total_time:.2f} req/s")
print(f"P50 latency: {latencies[len(latencies)//2]*1000:.0f}ms")
print(f"P99 latency: {latencies[int(len(latencies)*0.99)]*1000:.0f}ms")
asyncio.run(benchmark("http://localhost:8000", num_requests=500, concurrency=50))
|
Memory Optimization
KV Cache Sizing
The KV cache consumes most GPU memory after model weights. Calculate requirements:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
| def kv_cache_memory_gb(
num_layers: int,
num_kv_heads: int,
head_dim: int,
max_seq_len: int,
max_batch_size: int,
dtype_bytes: int = 2 # float16
) -> float:
"""Estimate KV cache memory in GB."""
# 2 for K and V
bytes_per_token = 2 * num_layers * num_kv_heads * head_dim * dtype_bytes
total_bytes = bytes_per_token * max_seq_len * max_batch_size
return total_bytes / (1024 ** 3)
# Qwen2.5-7B: 28 layers, 4 KV heads, 128 head dim
kv_mem = kv_cache_memory_gb(28, 4, 128, max_seq_len=8192, max_batch_size=64)
print(f"KV cache for 64 concurrent 8K context requests: {kv_mem:.1f} GB")
|
Choosing gpu-memory-utilization
1
2
3
4
5
6
7
8
| # Conservative (leaves headroom for spikes)
--gpu-memory-utilization 0.85
# Aggressive (maximize throughput)
--gpu-memory-utilization 0.95
# For multi-tenant systems with variable load
--gpu-memory-utilization 0.90
|
Higher utilization means more concurrent requests, but leaves less room for variance. Start at 0.90 and adjust based on OOM frequency.
Quantization for Memory Efficiency
AWQ quantization halves memory while maintaining quality:
1
2
3
4
5
6
| python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-7B-Instruct-AWQ \
--quantization awq \
--gpu-memory-utilization 0.90 \
--max-model-len 16384 \
--port 8000
|
Memory savings allow longer context or more concurrent requests on the same hardware.
Throughput Optimization
Continuous Batching Tuning
1
2
3
4
5
| # Maximum sequences to batch together
--max-num-seqs 256
# Maximum tokens in a batch (controls memory per batch)
--max-num-batched-tokens 32768
|
Higher values increase throughput but also latency. Profile your workload:
1
2
3
4
5
6
7
| # Short outputs (chatbots, Q&A)
--max-num-seqs 256
--max-num-batched-tokens 16384
# Long outputs (content generation)
--max-num-seqs 64
--max-num-batched-tokens 32768
|
Prefix Caching for Repeated Prompts
If your system prompt is consistent across requests:
1
2
3
4
| python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-7B-Instruct \
--enable-prefix-caching \
--port 8000
|
Benchmark impact:
1
2
3
4
5
| # Without prefix caching: system prompt processed every request
# With prefix caching: system prompt KV cache reused
# For a 500-token system prompt at 50 req/s:
# Savings: ~25,000 tokens/second of prefill computation
|
Chunked Prefill for Lower Latency
Prevent long prompts from blocking short ones:
1
2
| --enable-chunked-prefill \
--max-num-batched-tokens 2048
|
This interleaves prefill (processing input) with decode (generating output), reducing latency variance.
Speculative Decoding
Use a smaller model to predict tokens, verified by the large model:
1
2
3
4
5
| python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-7B-Instruct \
--speculative-model Qwen/Qwen2.5-0.5B-Instruct \
--num-speculative-tokens 5 \
--port 8000
|
Best for:
- Low-concurrency, latency-sensitive workloads
- When draft model predictions align well with the main model
Not recommended when:
- High concurrency (batching is more efficient)
- Diverse, unpredictable outputs
Multi-GPU Scaling
Tensor Parallelism (Single Node)
Split model across GPUs within one machine:
1
2
3
4
5
| # 4x A100 for Qwen2.5-72B
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-72B-Instruct \
--tensor-parallel-size 4 \
--port 8000
|
Pipeline Parallelism
For models that don’t divide evenly:
1
| --pipeline-parallel-size 2
|
Horizontal Scaling (Multiple Nodes)
Deploy multiple vLLM instances behind a load balancer:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
| # docker-compose.yml
services:
vllm-1:
image: vllm/vllm-openai:latest
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
command: >
--model Qwen/Qwen2.5-7B-Instruct
--port 8000
ports:
- "8001:8000"
vllm-2:
image: vllm/vllm-openai:latest
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ['1']
capabilities: [gpu]
command: >
--model Qwen/Qwen2.5-7B-Instruct
--port 8000
ports:
- "8002:8000"
nginx:
image: nginx:alpine
ports:
- "8000:80"
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf
depends_on:
- vllm-1
- vllm-2
|
Load balancer configuration:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
| upstream vllm_cluster {
least_conn;
server vllm-1:8000;
server vllm-2:8000;
}
server {
listen 80;
location / {
proxy_pass http://vllm_cluster;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_read_timeout 300s;
}
}
|
Production Architecture
Health Checks and Readiness
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
| from fastapi import FastAPI, Response
import httpx
app = FastAPI()
VLLM_URL = "http://localhost:8000"
@app.get("/health")
async def health():
"""Liveness probe—is the process running?"""
return {"status": "ok"}
@app.get("/ready")
async def ready():
"""Readiness probe—can we serve requests?"""
async with httpx.AsyncClient() as client:
try:
resp = await client.get(f"{VLLM_URL}/health", timeout=5.0)
if resp.status_code == 200:
return {"status": "ready"}
except Exception:
pass
return Response(status_code=503)
|
Request Queuing and Backpressure
Protect the system from overload:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
| import asyncio
from fastapi import FastAPI, HTTPException
from contextlib import asynccontextmanager
MAX_CONCURRENT = 100
semaphore = asyncio.Semaphore(MAX_CONCURRENT)
@asynccontextmanager
async def rate_limit():
if semaphore.locked():
# Check queue depth
if sum(1 for _ in range(MAX_CONCURRENT) if semaphore.locked()) > MAX_CONCURRENT * 2:
raise HTTPException(503, "Server overloaded")
async with semaphore:
yield
@app.post("/v1/chat/completions")
async def chat_completions(request: ChatRequest):
async with rate_limit():
return await forward_to_vllm(request)
|
Timeout Handling
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
| import httpx
from fastapi import HTTPException
async def generate_with_timeout(payload: dict, timeout: float = 60.0):
async with httpx.AsyncClient() as client:
try:
response = await client.post(
"http://localhost:8000/v1/chat/completions",
json=payload,
timeout=timeout
)
response.raise_for_status()
return response.json()
except httpx.TimeoutException:
raise HTTPException(504, "Generation timed out")
except httpx.HTTPStatusError as e:
raise HTTPException(e.response.status_code, str(e))
|
Observability
Prometheus Metrics
vLLM exposes metrics at /metrics:
1
2
3
4
| python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-7B-Instruct \
--disable-log-requests \
--port 8000
|
Key metrics to monitor:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
| # Prometheus alerts
groups:
- name: vllm
rules:
- alert: HighGPUMemory
expr: vllm_gpu_memory_usage_bytes / vllm_gpu_memory_total_bytes > 0.95
for: 5m
- alert: HighLatency
expr: histogram_quantile(0.99, vllm_request_latency_seconds_bucket) > 10
for: 5m
- alert: QueueBacklog
expr: vllm_pending_requests > 100
for: 2m
|
Structured Logging
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
| import structlog
from datetime import datetime
logger = structlog.get_logger()
async def log_request(request_id: str, payload: dict, response: dict, duration: float):
logger.info(
"llm_request",
request_id=request_id,
model=payload.get("model"),
input_tokens=response.get("usage", {}).get("prompt_tokens"),
output_tokens=response.get("usage", {}).get("completion_tokens"),
duration_ms=duration * 1000,
timestamp=datetime.utcnow().isoformat()
)
|
Kubernetes Deployment
Production-ready Kubernetes manifest:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
| apiVersion: apps/v1
kind: Deployment
metadata:
name: qwen-vllm
labels:
app: qwen-vllm
spec:
replicas: 2
selector:
matchLabels:
app: qwen-vllm
template:
metadata:
labels:
app: qwen-vllm
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8000"
prometheus.io/path: "/metrics"
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- "--model"
- "Qwen/Qwen2.5-7B-Instruct"
- "--gpu-memory-utilization"
- "0.90"
- "--max-model-len"
- "8192"
- "--enable-prefix-caching"
- "--port"
- "8000"
ports:
- containerPort: 8000
name: http
resources:
limits:
nvidia.com/gpu: 1
memory: "32Gi"
requests:
nvidia.com/gpu: 1
memory: "24Gi"
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 120
periodSeconds: 10
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 180
periodSeconds: 30
env:
- name: VLLM_LOGGING_LEVEL
value: "WARNING"
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
---
apiVersion: v1
kind: Service
metadata:
name: qwen-vllm
spec:
selector:
app: qwen-vllm
ports:
- port: 80
targetPort: 8000
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: qwen-vllm
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: qwen-vllm
minReplicas: 2
maxReplicas: 8
metrics:
- type: Pods
pods:
metric:
name: vllm_pending_requests
target:
type: AverageValue
averageValue: "50"
|
Before going to production:
- Memory: Set
gpu-memory-utilization based on load testing - Context: Limit
max-model-len to actual requirements - Batching: Tune
max-num-seqs for your concurrency patterns - Caching: Enable prefix caching if system prompts are consistent
- Quantization: Use AWQ for memory-constrained deployments
- Monitoring: Configure Prometheus metrics and alerts
- Health checks: Implement proper readiness/liveness probes
- Timeouts: Set appropriate client and server timeouts
Conclusion
High-performance vLLM deployments require understanding the trade-offs between throughput, latency, and resource utilization. Key optimizations:
- Memory: Quantization and careful KV cache sizing
- Throughput: Continuous batching and prefix caching
- Latency: Chunked prefill and speculative decoding
- Scale: Tensor parallelism within nodes, load balancing across nodes
At Sajima Solutions, we architect and optimize AI infrastructure for production workloads. Contact us to deploy high-performance LLM services for your organization.