Multilingual LLM Deployment: Qwen2.5 with vLLM | Sajima Solutions

Qwen2.5, developed by Alibaba Cloud, offers exceptional multilingual capabilities—particularly strong in Arabic, English, and Chinese. This makes it ideal for applications serving the Gulf region’s diverse linguistic landscape.

Why Qwen2.5?

Multilingual excellence: Strong Arabic and English support
Multiple sizes: 0.5B to 72B parameters
Extended context: Up to 128K tokens
Code understanding: Excellent for technical applications
Open weights: Apache 2.0 license

Setting Up vLLM with Qwen

1
2
3
4
5
6
pip install vllm transformers

# Start the server
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen2.5-7B-Instruct \
  --port 8000

For the larger 72B model:

1
2
3
4
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen2.5-72B-Instruct \
  --tensor-parallel-size 8 \
  --port 8000

Arabic Language Support

Qwen2.5 handles Arabic exceptionally well:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"
)

# Arabic query
response = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[
        {
            "role": "user",
            "content": "اشرح لي ما هي الحوسبة السحابية بطريقة بسيطة"
        }
    ],
    max_tokens=512
)

print(response.choices[0].message.content)

Mixed Language Conversations

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
messages = [
    {
        "role": "system",
        "content": "You are a helpful assistant fluent in Arabic and English. "
                   "Respond in the same language as the user's query."
    },
    {
        "role": "user",
        "content": "What is machine learning? Then explain it in Arabic."
    }
]

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=messages,
    max_tokens=1024
)

print(response.choices[0].message.content)

Building a Translation Service

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
from fastapi import FastAPI
from pydantic import BaseModel
from openai import OpenAI

app = FastAPI()
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

class TranslationRequest(BaseModel):
    text: str
    source_lang: str
    target_lang: str

@app.post("/translate")
async def translate(request: TranslationRequest):
    response = client.chat.completions.create(
        model="Qwen/Qwen2.5-7B-Instruct",
        messages=[
            {
                "role": "system",
                "content": f"You are a professional translator. "
                          f"Translate from {request.source_lang} to {request.target_lang}. "
                          f"Only output the translation, nothing else."
            },
            {
                "role": "user",
                "content": request.text
            }
        ],
        max_tokens=2048,
        temperature=0.3
    )
    
    return {"translation": response.choices[0].message.content}

Long Context Applications

Qwen2.5 supports up to 128K context:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# Document Q&A with long context
def answer_from_document(document: str, question: str) -> str:
    response = client.chat.completions.create(
        model="Qwen/Qwen2.5-7B-Instruct",
        messages=[
            {
                "role": "system",
                "content": "Answer questions based on the provided document. "
                          "If the answer isn't in the document, say so."
            },
            {
                "role": "user",
                "content": f"Document:\n{document}\n\nQuestion: {question}"
            }
        ],
        max_tokens=1024
    )
    
    return response.choices[0].message.content

# Process a long legal document
with open("contract.txt", "r") as f:
    contract = f.read()  # Can be 100K+ tokens

answer = answer_from_document(
    contract,
    "What are the termination conditions?"
)

Function Calling

Qwen2.5 supports tool use:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "City name, e.g., Dubai, Abu Dhabi"
                    }
                },
                "required": ["location"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "convert_currency",
            "description": "Convert between currencies",
            "parameters": {
                "type": "object",
                "properties": {
                    "amount": {"type": "number"},
                    "from_currency": {"type": "string"},
                    "to_currency": {"type": "string"}
                },
                "required": ["amount", "from_currency", "to_currency"]
            }
        }
    }
]

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[
        {"role": "user", "content": "What's the weather in Dubai and convert 1000 AED to USD"}
    ],
    tools=tools,
    tool_choice="auto"
)

# Process tool calls
for tool_call in response.choices[0].message.tool_calls:
    print(f"Function: {tool_call.function.name}")
    print(f"Arguments: {tool_call.function.arguments}")

Optimizing for Production

Quantization for Efficiency

1
2
3
4
5
# AWQ quantization for faster inference
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen2.5-7B-Instruct-AWQ \
  --quantization awq \
  --port 8000

Speculative Decoding

Use a smaller model to speed up generation:

1
2
3
4
5
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen2.5-7B-Instruct \
  --speculative-model Qwen/Qwen2.5-0.5B-Instruct \
  --num-speculative-tokens 5 \
  --port 8000

Kubernetes Deployment

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
apiVersion: apps/v1
kind: Deployment
metadata:
  name: qwen-vllm
spec:
  replicas: 2
  selector:
    matchLabels:
      app: qwen-vllm
  template:
    metadata:
      labels:
        app: qwen-vllm
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        args:
        - "--model"
        - "Qwen/Qwen2.5-7B-Instruct"
        - "--port"
        - "8000"
        resources:
          limits:
            nvidia.com/gpu: 1
            memory: "32Gi"
          requests:
            memory: "24Gi"
        ports:
        - containerPort: 8000
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 60
---
apiVersion: v1
kind: Service
metadata:
  name: qwen-vllm
spec:
  type: LoadBalancer
  selector:
    app: qwen-vllm
  ports:
  - port: 80
    targetPort: 8000

Conclusion

Qwen2.5 with vLLM provides:

Excellent multilingual support for Arabic, English, and more
Long context for document processing
Tool use for agentic applications
Production-ready deployment options

At Sajima Solutions, we deploy multilingual AI solutions tailored for the Gulf region. Contact us to bring intelligent language capabilities to your applications.