Enterprises across finance, legal, healthcare, and customer service are deploying AI assistants that understand their specific domain. The challenge isn’t building the AI—it’s deploying it at scale, keeping costs reasonable, and maintaining response times that users expect.

This guide shows how to deploy production AI systems that handle thousands of concurrent requests while keeping infrastructure costs manageable. We’ll use UAE-developed Falcon models as an example, but the patterns apply to any large language model.

Why Falcon?

Falcon models offer:

  • Open weights: Commercially usable (Falcon 180B uses a custom license)
  • Strong performance: Competitive with models 2-3x larger
  • Regional significance: Developed in the UAE
  • Multiple sizes: 7B, 40B, and 180B parameters

Why vLLM?

vLLM provides:

  • PagedAttention: Efficient memory management for longer contexts
  • Continuous batching: Handle multiple requests efficiently
  • CUDA optimizations: Faster than naive implementations
  • OpenAI-compatible API: Drop-in replacement for existing code

Installation

1
2
3
4
5
6
7
8
9
# Create virtual environment
python -m venv vllm-env
source vllm-env/bin/activate

# Install vLLM
pip install vllm

# For Falcon specifically, you may need
pip install einops

Serving Falcon-7B

Start the server:

1
2
3
4
5
python -m vllm.entrypoints.openai.api_server \
  --model tiiuae/falcon-7b-instruct \
  --trust-remote-code \
  --port 8000 \
  --tensor-parallel-size 1

For Falcon-40B (requires multiple GPUs):

1
2
3
4
5
python -m vllm.entrypoints.openai.api_server \
  --model tiiuae/falcon-40b-instruct \
  --trust-remote-code \
  --port 8000 \
  --tensor-parallel-size 4

Using the API

Chat Completions

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
import requests

response = requests.post(
    "http://localhost:8000/v1/chat/completions",
    json={
        "model": "tiiuae/falcon-7b-instruct",
        "messages": [
            {
                "role": "user",
                "content": "Explain quantum computing in simple terms."
            }
        ],
        "max_tokens": 256,
        "temperature": 0.7
    }
)

result = response.json()
print(result["choices"][0]["message"]["content"])

Streaming Responses

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
import requests

response = requests.post(
    "http://localhost:8000/v1/chat/completions",
    json={
        "model": "tiiuae/falcon-7b-instruct",
        "messages": [
            {"role": "user", "content": "Write a short poem about Dubai."}
        ],
        "max_tokens": 256,
        "stream": True
    },
    stream=True
)

for line in response.iter_lines():
    if line:
        print(line.decode())

With OpenAI SDK

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="tiiuae/falcon-7b-instruct",
    messages=[
        {"role": "user", "content": "What are the benefits of cloud computing?"}
    ],
    max_tokens=256
)

print(response.choices[0].message.content)

Falcon Prompt Format

Falcon-instruct models expect this format:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
def format_falcon_prompt(user_message: str) -> str:
    return f"User: {user_message}\nAssistant:"

# For multi-turn conversations
def format_conversation(messages: list) -> str:
    prompt = ""
    for msg in messages:
        if msg["role"] == "user":
            prompt += f"User: {msg['content']}\n"
        elif msg["role"] == "assistant":
            prompt += f"Assistant: {msg['content']}\n"
    prompt += "Assistant:"
    return prompt

Optimizing Performance

Memory Optimization

1
2
3
4
5
6
# Reduce GPU memory with KV cache quantization
python -m vllm.entrypoints.openai.api_server \
  --model tiiuae/falcon-7b-instruct \
  --trust-remote-code \
  --gpu-memory-utilization 0.9 \
  --max-model-len 2048

Batching Configuration

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
from vllm import LLM, SamplingParams

# Initialize with custom settings
llm = LLM(
    model="tiiuae/falcon-7b-instruct",
    trust_remote_code=True,
    max_num_batched_tokens=8192,
    max_num_seqs=256
)

# Generate for multiple prompts efficiently
prompts = [
    "Explain machine learning.",
    "What is blockchain?",
    "Describe cloud computing."
]

sampling_params = SamplingParams(
    temperature=0.7,
    max_tokens=256
)

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(f"Prompt: {output.prompt[:50]}...")
    print(f"Response: {output.outputs[0].text}\n")

Production Deployment

Docker Container

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
FROM nvidia/cuda:12.1-runtime-ubuntu22.04

WORKDIR /app

RUN apt-get update && apt-get install -y python3-pip
RUN pip install vllm einops

EXPOSE 8000

CMD ["python", "-m", "vllm.entrypoints.openai.api_server", \
     "--model", "tiiuae/falcon-7b-instruct", \
     "--trust-remote-code", \
     "--port", "8000"]

Kubernetes Deployment

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
apiVersion: apps/v1
kind: Deployment
metadata:
  name: falcon-vllm
spec:
  replicas: 1
  selector:
    matchLabels:
      app: falcon-vllm
  template:
    metadata:
      labels:
        app: falcon-vllm
    spec:
      containers:
      - name: vllm
        image: myorg/falcon-vllm:latest
        resources:
          limits:
            nvidia.com/gpu: 1
        ports:
        - containerPort: 8000
        env:
        - name: CUDA_VISIBLE_DEVICES
          value: "0"
---
apiVersion: v1
kind: Service
metadata:
  name: falcon-vllm
spec:
  selector:
    app: falcon-vllm
  ports:
  - port: 8000
    targetPort: 8000

Benchmarking

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import time
import requests
from concurrent.futures import ThreadPoolExecutor

def send_request():
    start = time.time()
    response = requests.post(
        "http://localhost:8000/v1/completions",
        json={
            "model": "tiiuae/falcon-7b-instruct",
            "prompt": "The future of AI is",
            "max_tokens": 100
        }
    )
    return time.time() - start

# Concurrent load test
with ThreadPoolExecutor(max_workers=10) as executor:
    futures = [executor.submit(send_request) for _ in range(100)]
    latencies = [f.result() for f in futures]

print(f"Average latency: {sum(latencies)/len(latencies):.2f}s")
print(f"P95 latency: {sorted(latencies)[95]:.2f}s")

Conclusion

Falcon + vLLM provides:

  • High-performance open-source LLM inference
  • UAE-developed model supporting regional AI initiatives
  • Production-ready deployment with familiar APIs

At Sajima Solutions, we deploy and fine-tune LLMs for organizations across the Gulf. Contact us to bring AI capabilities to your business.