Qwen2.5, developed by Alibaba Cloud, offers exceptional multilingual capabilities—particularly strong in Arabic, English, and Chinese. This makes it ideal for applications serving the Gulf region’s diverse linguistic landscape.

Why Qwen2.5?

  • Multilingual excellence: Strong Arabic and English support
  • Multiple sizes: 0.5B to 72B parameters
  • Extended context: Up to 128K tokens
  • Code understanding: Excellent for technical applications
  • Open weights: Apache 2.0 license

Setting Up vLLM with Qwen

1
2
3
4
5
6
pip install vllm transformers

# Start the server
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen2.5-7B-Instruct \
  --port 8000

For the larger 72B model:

1
2
3
4
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen2.5-72B-Instruct \
  --tensor-parallel-size 8 \
  --port 8000

Arabic Language Support

Qwen2.5 handles Arabic exceptionally well:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"
)

# Arabic query
response = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[
        {
            "role": "user",
            "content": "اشرح لي ما هي الحوسبة السحابية بطريقة بسيطة"
        }
    ],
    max_tokens=512
)

print(response.choices[0].message.content)

Mixed Language Conversations

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
messages = [
    {
        "role": "system",
        "content": "You are a helpful assistant fluent in Arabic and English. "
                   "Respond in the same language as the user's query."
    },
    {
        "role": "user",
        "content": "What is machine learning? Then explain it in Arabic."
    }
]

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=messages,
    max_tokens=1024
)

print(response.choices[0].message.content)

Building a Translation Service

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
from fastapi import FastAPI
from pydantic import BaseModel
from openai import OpenAI

app = FastAPI()
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

class TranslationRequest(BaseModel):
    text: str
    source_lang: str
    target_lang: str

@app.post("/translate")
async def translate(request: TranslationRequest):
    response = client.chat.completions.create(
        model="Qwen/Qwen2.5-7B-Instruct",
        messages=[
            {
                "role": "system",
                "content": f"You are a professional translator. "
                          f"Translate from {request.source_lang} to {request.target_lang}. "
                          f"Only output the translation, nothing else."
            },
            {
                "role": "user",
                "content": request.text
            }
        ],
        max_tokens=2048,
        temperature=0.3
    )
    
    return {"translation": response.choices[0].message.content}

Long Context Applications

Qwen2.5 supports up to 128K context:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# Document Q&A with long context
def answer_from_document(document: str, question: str) -> str:
    response = client.chat.completions.create(
        model="Qwen/Qwen2.5-7B-Instruct",
        messages=[
            {
                "role": "system",
                "content": "Answer questions based on the provided document. "
                          "If the answer isn't in the document, say so."
            },
            {
                "role": "user",
                "content": f"Document:\n{document}\n\nQuestion: {question}"
            }
        ],
        max_tokens=1024
    )
    
    return response.choices[0].message.content

# Process a long legal document
with open("contract.txt", "r") as f:
    contract = f.read()  # Can be 100K+ tokens

answer = answer_from_document(
    contract,
    "What are the termination conditions?"
)

Function Calling

Qwen2.5 supports tool use:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "City name, e.g., Dubai, Abu Dhabi"
                    }
                },
                "required": ["location"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "convert_currency",
            "description": "Convert between currencies",
            "parameters": {
                "type": "object",
                "properties": {
                    "amount": {"type": "number"},
                    "from_currency": {"type": "string"},
                    "to_currency": {"type": "string"}
                },
                "required": ["amount", "from_currency", "to_currency"]
            }
        }
    }
]

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[
        {"role": "user", "content": "What's the weather in Dubai and convert 1000 AED to USD"}
    ],
    tools=tools,
    tool_choice="auto"
)

# Process tool calls
for tool_call in response.choices[0].message.tool_calls:
    print(f"Function: {tool_call.function.name}")
    print(f"Arguments: {tool_call.function.arguments}")

Optimizing for Production

Quantization for Efficiency

1
2
3
4
5
# AWQ quantization for faster inference
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen2.5-7B-Instruct-AWQ \
  --quantization awq \
  --port 8000

Speculative Decoding

Use a smaller model to speed up generation:

1
2
3
4
5
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen2.5-7B-Instruct \
  --speculative-model Qwen/Qwen2.5-0.5B-Instruct \
  --num-speculative-tokens 5 \
  --port 8000

Kubernetes Deployment

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
apiVersion: apps/v1
kind: Deployment
metadata:
  name: qwen-vllm
spec:
  replicas: 2
  selector:
    matchLabels:
      app: qwen-vllm
  template:
    metadata:
      labels:
        app: qwen-vllm
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        args:
        - "--model"
        - "Qwen/Qwen2.5-7B-Instruct"
        - "--port"
        - "8000"
        resources:
          limits:
            nvidia.com/gpu: 1
            memory: "32Gi"
          requests:
            memory: "24Gi"
        ports:
        - containerPort: 8000
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 60
---
apiVersion: v1
kind: Service
metadata:
  name: qwen-vllm
spec:
  type: LoadBalancer
  selector:
    app: qwen-vllm
  ports:
  - port: 80
    targetPort: 8000

Conclusion

Qwen2.5 with vLLM provides:

  • Excellent multilingual support for Arabic, English, and more
  • Long context for document processing
  • Tool use for agentic applications
  • Production-ready deployment options

At Sajima Solutions, we deploy multilingual AI solutions tailored for the Gulf region. Contact us to bring intelligent language capabilities to your applications.