Scaling vLLM with Qwen: Production Optimization and Architecture Patterns | Sajima Solutions

Moving from a working vLLM deployment to a production system that handles thousands of requests efficiently requires careful tuning. This guide covers the optimization techniques and architectural patterns we use at Sajima Solutions for high-throughput Qwen deployments.

Understanding vLLM’s Architecture

Before optimizing, understand what makes vLLM fast:

PagedAttention: Manages KV cache memory like OS virtual memory—no fragmentation
Continuous batching: Dynamically adds requests mid-generation
CUDA kernels: Optimized attention and sampling operations
Prefix caching: Reuses KV cache for repeated prefixes

Baseline Configuration

Start with a sensible baseline for Qwen2.5-7B:

1
2
3
4
5
6
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen2.5-7B-Instruct \
  --port 8000 \
  --gpu-memory-utilization 0.90 \
  --max-model-len 8192 \
  --enable-prefix-caching

Then measure before tuning further:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
import time
import asyncio
import aiohttp

async def benchmark(url: str, num_requests: int, concurrency: int):
    """Measure throughput and latency."""
    
    payload = {
        "model": "Qwen/Qwen2.5-7B-Instruct",
        "messages": [{"role": "user", "content": "Explain what an API is in two sentences."}],
        "max_tokens": 100
    }
    
    semaphore = asyncio.Semaphore(concurrency)
    latencies = []
    
    async def single_request(session):
        async with semaphore:
            start = time.perf_counter()
            async with session.post(f"{url}/v1/chat/completions", json=payload) as resp:
                await resp.json()
            latencies.append(time.perf_counter() - start)
    
    async with aiohttp.ClientSession() as session:
        start = time.perf_counter()
        await asyncio.gather(*[single_request(session) for _ in range(num_requests)])
        total_time = time.perf_counter() - start
    
    latencies.sort()
    print(f"Throughput: {num_requests / total_time:.2f} req/s")
    print(f"P50 latency: {latencies[len(latencies)//2]*1000:.0f}ms")
    print(f"P99 latency: {latencies[int(len(latencies)*0.99)]*1000:.0f}ms")

asyncio.run(benchmark("http://localhost:8000", num_requests=500, concurrency=50))

Memory Optimization

KV Cache Sizing

The KV cache consumes most GPU memory after model weights. Calculate requirements:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
def kv_cache_memory_gb(
    num_layers: int,
    num_kv_heads: int,
    head_dim: int,
    max_seq_len: int,
    max_batch_size: int,
    dtype_bytes: int = 2  # float16
) -> float:
    """Estimate KV cache memory in GB."""
    # 2 for K and V
    bytes_per_token = 2 * num_layers * num_kv_heads * head_dim * dtype_bytes
    total_bytes = bytes_per_token * max_seq_len * max_batch_size
    return total_bytes / (1024 ** 3)

# Qwen2.5-7B: 28 layers, 4 KV heads, 128 head dim
kv_mem = kv_cache_memory_gb(28, 4, 128, max_seq_len=8192, max_batch_size=64)
print(f"KV cache for 64 concurrent 8K context requests: {kv_mem:.1f} GB")

Choosing gpu-memory-utilization

1
2
3
4
5
6
7
8
# Conservative (leaves headroom for spikes)
--gpu-memory-utilization 0.85

# Aggressive (maximize throughput)
--gpu-memory-utilization 0.95

# For multi-tenant systems with variable load
--gpu-memory-utilization 0.90

Higher utilization means more concurrent requests, but leaves less room for variance. Start at 0.90 and adjust based on OOM frequency.

Quantization for Memory Efficiency

AWQ quantization halves memory while maintaining quality:

1
2
3
4
5
6
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen2.5-7B-Instruct-AWQ \
  --quantization awq \
  --gpu-memory-utilization 0.90 \
  --max-model-len 16384 \
  --port 8000

Memory savings allow longer context or more concurrent requests on the same hardware.

Throughput Optimization

Continuous Batching Tuning

1
2
3
4
5
# Maximum sequences to batch together
--max-num-seqs 256

# Maximum tokens in a batch (controls memory per batch)
--max-num-batched-tokens 32768

Higher values increase throughput but also latency. Profile your workload:

1
2
3
4
5
6
7
# Short outputs (chatbots, Q&A)
--max-num-seqs 256
--max-num-batched-tokens 16384

# Long outputs (content generation)
--max-num-seqs 64
--max-num-batched-tokens 32768

Prefix Caching for Repeated Prompts

If your system prompt is consistent across requests:

1
2
3
4
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen2.5-7B-Instruct \
  --enable-prefix-caching \
  --port 8000

Benchmark impact:

1
2
3
4
5
# Without prefix caching: system prompt processed every request
# With prefix caching: system prompt KV cache reused

# For a 500-token system prompt at 50 req/s:
# Savings: ~25,000 tokens/second of prefill computation

Chunked Prefill for Lower Latency

Prevent long prompts from blocking short ones:

1
2
--enable-chunked-prefill \
--max-num-batched-tokens 2048

This interleaves prefill (processing input) with decode (generating output), reducing latency variance.

Speculative Decoding

Use a smaller model to predict tokens, verified by the large model:

1
2
3
4
5
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen2.5-7B-Instruct \
  --speculative-model Qwen/Qwen2.5-0.5B-Instruct \
  --num-speculative-tokens 5 \
  --port 8000

Best for:

Low-concurrency, latency-sensitive workloads
When draft model predictions align well with the main model

Not recommended when:

High concurrency (batching is more efficient)
Diverse, unpredictable outputs

Multi-GPU Scaling

Tensor Parallelism (Single Node)

Split model across GPUs within one machine:

1
2
3
4
5
# 4x A100 for Qwen2.5-72B
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen2.5-72B-Instruct \
  --tensor-parallel-size 4 \
  --port 8000

Pipeline Parallelism

For models that don’t divide evenly:

1
--pipeline-parallel-size 2

Horizontal Scaling (Multiple Nodes)

Deploy multiple vLLM instances behind a load balancer:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
# docker-compose.yml
services:
  vllm-1:
    image: vllm/vllm-openai:latest
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    command: >
      --model Qwen/Qwen2.5-7B-Instruct
      --port 8000
    ports:
      - "8001:8000"

  vllm-2:
    image: vllm/vllm-openai:latest
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ['1']
              capabilities: [gpu]
    command: >
      --model Qwen/Qwen2.5-7B-Instruct
      --port 8000
    ports:
      - "8002:8000"

  nginx:
    image: nginx:alpine
    ports:
      - "8000:80"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf
    depends_on:
      - vllm-1
      - vllm-2

Load balancer configuration:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
upstream vllm_cluster {
    least_conn;
    server vllm-1:8000;
    server vllm-2:8000;
}

server {
    listen 80;
    
    location / {
        proxy_pass http://vllm_cluster;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
        proxy_read_timeout 300s;
    }
}

Production Architecture

Health Checks and Readiness

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
from fastapi import FastAPI, Response
import httpx

app = FastAPI()
VLLM_URL = "http://localhost:8000"

@app.get("/health")
async def health():
    """Liveness probe—is the process running?"""
    return {"status": "ok"}

@app.get("/ready")
async def ready():
    """Readiness probe—can we serve requests?"""
    async with httpx.AsyncClient() as client:
        try:
            resp = await client.get(f"{VLLM_URL}/health", timeout=5.0)
            if resp.status_code == 200:
                return {"status": "ready"}
        except Exception:
            pass
    return Response(status_code=503)

Request Queuing and Backpressure

Protect the system from overload:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
import asyncio
from fastapi import FastAPI, HTTPException
from contextlib import asynccontextmanager

MAX_CONCURRENT = 100
semaphore = asyncio.Semaphore(MAX_CONCURRENT)

@asynccontextmanager
async def rate_limit():
    if semaphore.locked():
        # Check queue depth
        if sum(1 for _ in range(MAX_CONCURRENT) if semaphore.locked()) > MAX_CONCURRENT * 2:
            raise HTTPException(503, "Server overloaded")
    
    async with semaphore:
        yield

@app.post("/v1/chat/completions")
async def chat_completions(request: ChatRequest):
    async with rate_limit():
        return await forward_to_vllm(request)

Timeout Handling

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
import httpx
from fastapi import HTTPException

async def generate_with_timeout(payload: dict, timeout: float = 60.0):
    async with httpx.AsyncClient() as client:
        try:
            response = await client.post(
                "http://localhost:8000/v1/chat/completions",
                json=payload,
                timeout=timeout
            )
            response.raise_for_status()
            return response.json()
        except httpx.TimeoutException:
            raise HTTPException(504, "Generation timed out")
        except httpx.HTTPStatusError as e:
            raise HTTPException(e.response.status_code, str(e))

Observability

Prometheus Metrics

vLLM exposes metrics at /metrics:

1
2
3
4
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen2.5-7B-Instruct \
  --disable-log-requests \
  --port 8000

Key metrics to monitor:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
# Prometheus alerts
groups:
  - name: vllm
    rules:
      - alert: HighGPUMemory
        expr: vllm_gpu_memory_usage_bytes / vllm_gpu_memory_total_bytes > 0.95
        for: 5m
        
      - alert: HighLatency
        expr: histogram_quantile(0.99, vllm_request_latency_seconds_bucket) > 10
        for: 5m
        
      - alert: QueueBacklog
        expr: vllm_pending_requests > 100
        for: 2m

Structured Logging

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
import structlog
from datetime import datetime

logger = structlog.get_logger()

async def log_request(request_id: str, payload: dict, response: dict, duration: float):
    logger.info(
        "llm_request",
        request_id=request_id,
        model=payload.get("model"),
        input_tokens=response.get("usage", {}).get("prompt_tokens"),
        output_tokens=response.get("usage", {}).get("completion_tokens"),
        duration_ms=duration * 1000,
        timestamp=datetime.utcnow().isoformat()
    )

Kubernetes Deployment

Production-ready Kubernetes manifest:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
apiVersion: apps/v1
kind: Deployment
metadata:
  name: qwen-vllm
  labels:
    app: qwen-vllm
spec:
  replicas: 2
  selector:
    matchLabels:
      app: qwen-vllm
  template:
    metadata:
      labels:
        app: qwen-vllm
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8000"
        prometheus.io/path: "/metrics"
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        args:
        - "--model"
        - "Qwen/Qwen2.5-7B-Instruct"
        - "--gpu-memory-utilization"
        - "0.90"
        - "--max-model-len"
        - "8192"
        - "--enable-prefix-caching"
        - "--port"
        - "8000"
        ports:
        - containerPort: 8000
          name: http
        resources:
          limits:
            nvidia.com/gpu: 1
            memory: "32Gi"
          requests:
            nvidia.com/gpu: 1
            memory: "24Gi"
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 120
          periodSeconds: 10
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 180
          periodSeconds: 30
        env:
        - name: VLLM_LOGGING_LEVEL
          value: "WARNING"
      tolerations:
      - key: "nvidia.com/gpu"
        operator: "Exists"
        effect: "NoSchedule"
---
apiVersion: v1
kind: Service
metadata:
  name: qwen-vllm
spec:
  selector:
    app: qwen-vllm
  ports:
  - port: 80
    targetPort: 8000
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: qwen-vllm
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: qwen-vllm
  minReplicas: 2
  maxReplicas: 8
  metrics:
  - type: Pods
    pods:
      metric:
        name: vllm_pending_requests
      target:
        type: AverageValue
        averageValue: "50"

Performance Checklist

Before going to production:

Memory: Set gpu-memory-utilization based on load testing
Context: Limit max-model-len to actual requirements
Batching: Tune max-num-seqs for your concurrency patterns
Caching: Enable prefix caching if system prompts are consistent
Quantization: Use AWQ for memory-constrained deployments
Monitoring: Configure Prometheus metrics and alerts
Health checks: Implement proper readiness/liveness probes
Timeouts: Set appropriate client and server timeouts

Conclusion

High-performance vLLM deployments require understanding the trade-offs between throughput, latency, and resource utilization. Key optimizations:

Memory: Quantization and careful KV cache sizing
Throughput: Continuous batching and prefix caching
Latency: Chunked prefill and speculative decoding
Scale: Tensor parallelism within nodes, load balancing across nodes

At Sajima Solutions, we architect and optimize AI infrastructure for production workloads. Contact us to deploy high-performance LLM services for your organization.