Enterprises across finance, legal, healthcare, and customer service are deploying AI assistants that understand their specific domain. The challenge isn’t building the AI—it’s deploying it at scale, keeping costs reasonable, and maintaining response times that users expect.
This guide shows how to deploy production AI systems that handle thousands of concurrent requests while keeping infrastructure costs manageable. We’ll use UAE-developed Falcon models as an example, but the patterns apply to any large language model.
Why Falcon?
Falcon models offer:
- Open weights: Commercially usable (Falcon 180B uses a custom license)
- Strong performance: Competitive with models 2-3x larger
- Regional significance: Developed in the UAE
- Multiple sizes: 7B, 40B, and 180B parameters
Why vLLM?
vLLM provides:
- PagedAttention: Efficient memory management for longer contexts
- Continuous batching: Handle multiple requests efficiently
- CUDA optimizations: Faster than naive implementations
- OpenAI-compatible API: Drop-in replacement for existing code
Installation
1
2
3
4
5
6
7
8
9
| # Create virtual environment
python -m venv vllm-env
source vllm-env/bin/activate
# Install vLLM
pip install vllm
# For Falcon specifically, you may need
pip install einops
|
Serving Falcon-7B
Start the server:
1
2
3
4
5
| python -m vllm.entrypoints.openai.api_server \
--model tiiuae/falcon-7b-instruct \
--trust-remote-code \
--port 8000 \
--tensor-parallel-size 1
|
For Falcon-40B (requires multiple GPUs):
1
2
3
4
5
| python -m vllm.entrypoints.openai.api_server \
--model tiiuae/falcon-40b-instruct \
--trust-remote-code \
--port 8000 \
--tensor-parallel-size 4
|
Using the API
Chat Completions
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
| import requests
response = requests.post(
"http://localhost:8000/v1/chat/completions",
json={
"model": "tiiuae/falcon-7b-instruct",
"messages": [
{
"role": "user",
"content": "Explain quantum computing in simple terms."
}
],
"max_tokens": 256,
"temperature": 0.7
}
)
result = response.json()
print(result["choices"][0]["message"]["content"])
|
Streaming Responses
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
| import requests
response = requests.post(
"http://localhost:8000/v1/chat/completions",
json={
"model": "tiiuae/falcon-7b-instruct",
"messages": [
{"role": "user", "content": "Write a short poem about Dubai."}
],
"max_tokens": 256,
"stream": True
},
stream=True
)
for line in response.iter_lines():
if line:
print(line.decode())
|
With OpenAI SDK
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
| from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed"
)
response = client.chat.completions.create(
model="tiiuae/falcon-7b-instruct",
messages=[
{"role": "user", "content": "What are the benefits of cloud computing?"}
],
max_tokens=256
)
print(response.choices[0].message.content)
|
Falcon-instruct models expect this format:
1
2
3
4
5
6
7
8
9
10
11
12
13
| def format_falcon_prompt(user_message: str) -> str:
return f"User: {user_message}\nAssistant:"
# For multi-turn conversations
def format_conversation(messages: list) -> str:
prompt = ""
for msg in messages:
if msg["role"] == "user":
prompt += f"User: {msg['content']}\n"
elif msg["role"] == "assistant":
prompt += f"Assistant: {msg['content']}\n"
prompt += "Assistant:"
return prompt
|
Memory Optimization
1
2
3
4
5
6
| # Reduce GPU memory with KV cache quantization
python -m vllm.entrypoints.openai.api_server \
--model tiiuae/falcon-7b-instruct \
--trust-remote-code \
--gpu-memory-utilization 0.9 \
--max-model-len 2048
|
Batching Configuration
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
| from vllm import LLM, SamplingParams
# Initialize with custom settings
llm = LLM(
model="tiiuae/falcon-7b-instruct",
trust_remote_code=True,
max_num_batched_tokens=8192,
max_num_seqs=256
)
# Generate for multiple prompts efficiently
prompts = [
"Explain machine learning.",
"What is blockchain?",
"Describe cloud computing."
]
sampling_params = SamplingParams(
temperature=0.7,
max_tokens=256
)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(f"Prompt: {output.prompt[:50]}...")
print(f"Response: {output.outputs[0].text}\n")
|
Production Deployment
Docker Container
1
2
3
4
5
6
7
8
9
10
11
12
13
| FROM nvidia/cuda:12.1-runtime-ubuntu22.04
WORKDIR /app
RUN apt-get update && apt-get install -y python3-pip
RUN pip install vllm einops
EXPOSE 8000
CMD ["python", "-m", "vllm.entrypoints.openai.api_server", \
"--model", "tiiuae/falcon-7b-instruct", \
"--trust-remote-code", \
"--port", "8000"]
|
Kubernetes Deployment
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
| apiVersion: apps/v1
kind: Deployment
metadata:
name: falcon-vllm
spec:
replicas: 1
selector:
matchLabels:
app: falcon-vllm
template:
metadata:
labels:
app: falcon-vllm
spec:
containers:
- name: vllm
image: myorg/falcon-vllm:latest
resources:
limits:
nvidia.com/gpu: 1
ports:
- containerPort: 8000
env:
- name: CUDA_VISIBLE_DEVICES
value: "0"
---
apiVersion: v1
kind: Service
metadata:
name: falcon-vllm
spec:
selector:
app: falcon-vllm
ports:
- port: 8000
targetPort: 8000
|
Benchmarking
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
| import time
import requests
from concurrent.futures import ThreadPoolExecutor
def send_request():
start = time.time()
response = requests.post(
"http://localhost:8000/v1/completions",
json={
"model": "tiiuae/falcon-7b-instruct",
"prompt": "The future of AI is",
"max_tokens": 100
}
)
return time.time() - start
# Concurrent load test
with ThreadPoolExecutor(max_workers=10) as executor:
futures = [executor.submit(send_request) for _ in range(100)]
latencies = [f.result() for f in futures]
print(f"Average latency: {sum(latencies)/len(latencies):.2f}s")
print(f"P95 latency: {sorted(latencies)[95]:.2f}s")
|
Conclusion
Falcon + vLLM provides:
- High-performance open-source LLM inference
- UAE-developed model supporting regional AI initiatives
- Production-ready deployment with familiar APIs
At Sajima Solutions, we deploy and fine-tune LLMs for organizations across the Gulf. Contact us to bring AI capabilities to your business.