Qwen2.5, developed by Alibaba Cloud, offers exceptional multilingual capabilities—particularly strong in Arabic, English, and Chinese. This makes it ideal for applications serving the Gulf region’s diverse linguistic landscape.
Why Qwen2.5?
- Multilingual excellence: Strong Arabic and English support
- Multiple sizes: 0.5B to 72B parameters
- Extended context: Up to 128K tokens
- Code understanding: Excellent for technical applications
- Open weights: Apache 2.0 license
Setting Up vLLM with Qwen
1
2
3
4
5
6
| pip install vllm transformers
# Start the server
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-7B-Instruct \
--port 8000
|
For the larger 72B model:
1
2
3
4
| python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-72B-Instruct \
--tensor-parallel-size 8 \
--port 8000
|
Arabic Language Support
Qwen2.5 handles Arabic exceptionally well:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
| from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed"
)
# Arabic query
response = client.chat.completions.create(
model="Qwen/Qwen2.5-7B-Instruct",
messages=[
{
"role": "user",
"content": "اشرح لي ما هي الحوسبة السحابية بطريقة بسيطة"
}
],
max_tokens=512
)
print(response.choices[0].message.content)
|
Mixed Language Conversations
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
| messages = [
{
"role": "system",
"content": "You are a helpful assistant fluent in Arabic and English. "
"Respond in the same language as the user's query."
},
{
"role": "user",
"content": "What is machine learning? Then explain it in Arabic."
}
]
response = client.chat.completions.create(
model="Qwen/Qwen2.5-7B-Instruct",
messages=messages,
max_tokens=1024
)
print(response.choices[0].message.content)
|
Building a Translation Service
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
| from fastapi import FastAPI
from pydantic import BaseModel
from openai import OpenAI
app = FastAPI()
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
class TranslationRequest(BaseModel):
text: str
source_lang: str
target_lang: str
@app.post("/translate")
async def translate(request: TranslationRequest):
response = client.chat.completions.create(
model="Qwen/Qwen2.5-7B-Instruct",
messages=[
{
"role": "system",
"content": f"You are a professional translator. "
f"Translate from {request.source_lang} to {request.target_lang}. "
f"Only output the translation, nothing else."
},
{
"role": "user",
"content": request.text
}
],
max_tokens=2048,
temperature=0.3
)
return {"translation": response.choices[0].message.content}
|
Long Context Applications
Qwen2.5 supports up to 128K context:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
| # Document Q&A with long context
def answer_from_document(document: str, question: str) -> str:
response = client.chat.completions.create(
model="Qwen/Qwen2.5-7B-Instruct",
messages=[
{
"role": "system",
"content": "Answer questions based on the provided document. "
"If the answer isn't in the document, say so."
},
{
"role": "user",
"content": f"Document:\n{document}\n\nQuestion: {question}"
}
],
max_tokens=1024
)
return response.choices[0].message.content
# Process a long legal document
with open("contract.txt", "r") as f:
contract = f.read() # Can be 100K+ tokens
answer = answer_from_document(
contract,
"What are the termination conditions?"
)
|
Function Calling
Qwen2.5 supports tool use:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
| tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City name, e.g., Dubai, Abu Dhabi"
}
},
"required": ["location"]
}
}
},
{
"type": "function",
"function": {
"name": "convert_currency",
"description": "Convert between currencies",
"parameters": {
"type": "object",
"properties": {
"amount": {"type": "number"},
"from_currency": {"type": "string"},
"to_currency": {"type": "string"}
},
"required": ["amount", "from_currency", "to_currency"]
}
}
}
]
response = client.chat.completions.create(
model="Qwen/Qwen2.5-7B-Instruct",
messages=[
{"role": "user", "content": "What's the weather in Dubai and convert 1000 AED to USD"}
],
tools=tools,
tool_choice="auto"
)
# Process tool calls
for tool_call in response.choices[0].message.tool_calls:
print(f"Function: {tool_call.function.name}")
print(f"Arguments: {tool_call.function.arguments}")
|
Optimizing for Production
Quantization for Efficiency
1
2
3
4
5
| # AWQ quantization for faster inference
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-7B-Instruct-AWQ \
--quantization awq \
--port 8000
|
Speculative Decoding
Use a smaller model to speed up generation:
1
2
3
4
5
| python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-7B-Instruct \
--speculative-model Qwen/Qwen2.5-0.5B-Instruct \
--num-speculative-tokens 5 \
--port 8000
|
Kubernetes Deployment
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
| apiVersion: apps/v1
kind: Deployment
metadata:
name: qwen-vllm
spec:
replicas: 2
selector:
matchLabels:
app: qwen-vllm
template:
metadata:
labels:
app: qwen-vllm
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- "--model"
- "Qwen/Qwen2.5-7B-Instruct"
- "--port"
- "8000"
resources:
limits:
nvidia.com/gpu: 1
memory: "32Gi"
requests:
memory: "24Gi"
ports:
- containerPort: 8000
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 60
---
apiVersion: v1
kind: Service
metadata:
name: qwen-vllm
spec:
type: LoadBalancer
selector:
app: qwen-vllm
ports:
- port: 80
targetPort: 8000
|
Conclusion
Qwen2.5 with vLLM provides:
- Excellent multilingual support for Arabic, English, and more
- Long context for document processing
- Tool use for agentic applications
- Production-ready deployment options
At Sajima Solutions, we deploy multilingual AI solutions tailored for the Gulf region. Contact us to bring intelligent language capabilities to your applications.