Running Qwen Locally: A Developer's Guide to Private AI | Sajima Solutions

Running AI models locally gives developers complete privacy, zero API costs, and the freedom to experiment without rate limits. Qwen—Alibaba’s open-weight model family—offers some of the best performance-per-parameter ratios available, making it ideal for local development workflows.

Why Run Qwen Locally?

Privacy: Your code never leaves your machine
Zero cost: No API fees or token limits
Offline access: Work anywhere without internet
Customization: Fine-tune for your specific needs
Low latency: No network round-trips

Quick Start with Ollama

The fastest way to get Qwen running locally:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Install Ollama (Linux)
curl -fsSL https://ollama.com/install.sh | sh

# Pull Qwen2.5 (3B is great for most dev machines)
ollama pull qwen2.5:3b

# Or the coding-focused variant
ollama pull qwen2.5-coder:7b

# Start chatting
ollama run qwen2.5:3b

For macOS:

1
2
3
brew install ollama
ollama serve &
ollama pull qwen2.5:3b

Choosing the Right Model Size

Model	VRAM Needed	Best For
qwen2.5:0.5b	1GB	Quick queries, low-end hardware
qwen2.5:3b	4GB	General development, good balance
qwen2.5-coder:7b	8GB	Code generation, refactoring
qwen2.5:14b	16GB	Complex reasoning, longer context
qwen2.5:32b	24GB+	Maximum capability

Using Qwen as a Coding Assistant

API Access from Your Code

Ollama exposes an OpenAI-compatible API:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # Required but not validated
)

response = client.chat.completions.create(
    model="qwen2.5-coder:7b",
    messages=[
        {
            "role": "system",
            "content": "You are a helpful coding assistant. Be concise."
        },
        {
            "role": "user",
            "content": "Write a Python function to parse ISO 8601 dates"
        }
    ],
    temperature=0.2  # Lower for more deterministic code
)

print(response.choices[0].message.content)

Streaming Responses

For a better developer experience with long outputs:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

stream = client.chat.completions.create(
    model="qwen2.5-coder:7b",
    messages=[
        {"role": "user", "content": "Explain async/await in Python with examples"}
    ],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

IDE Integration

VS Code with Continue

Install the Continue extension and configure it for Ollama:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
{
  "models": [
    {
      "title": "Qwen Coder",
      "provider": "ollama",
      "model": "qwen2.5-coder:7b"
    }
  ],
  "tabAutocompleteModel": {
    "title": "Qwen Fast",
    "provider": "ollama",
    "model": "qwen2.5:3b"
  }
}

Neovim with Avante or Codecompanion

For Neovim users, add to your config:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
-- Using codecompanion.nvim
require("codecompanion").setup({
  adapters = {
    ollama = function()
      return require("codecompanion.adapters").extend("ollama", {
        schema = {
          model = { default = "qwen2.5-coder:7b" },
        },
      })
    end,
  },
  strategies = {
    chat = { adapter = "ollama" },
    inline = { adapter = "ollama" },
  },
})

Building a Local Code Review Tool

Create a simple script to review your git diffs:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
#!/usr/bin/env python3
"""Review staged git changes with local Qwen."""

import subprocess
from openai import OpenAI

def get_staged_diff():
    result = subprocess.run(
        ["git", "diff", "--cached"],
        capture_output=True,
        text=True
    )
    return result.stdout

def review_code(diff: str) -> str:
    client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
    
    response = client.chat.completions.create(
        model="qwen2.5-coder:7b",
        messages=[
            {
                "role": "system",
                "content": """You are a code reviewer. Analyze the diff and provide:
1. A brief summary of changes
2. Potential bugs or issues
3. Suggestions for improvement

Be concise and actionable."""
            },
            {
                "role": "user",
                "content": f"Review this diff:\n\n```diff\n{diff}\n```"
            }
        ],
        temperature=0.3
    )
    
    return response.choices[0].message.content

if __name__ == "__main__":
    diff = get_staged_diff()
    if not diff:
        print("No staged changes to review.")
    else:
        print(review_code(diff))

Save as git-review and add to your PATH for quick reviews before commits.

Running with llama.cpp for Maximum Performance

For even better performance, use llama.cpp directly with GGUF models:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# Clone and build llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j

# Download a quantized Qwen model
# Visit huggingface.co/Qwen and look for GGUF versions

# Run the server
./llama-server \
    -m qwen2.5-coder-7b-instruct-q4_k_m.gguf \
    -c 8192 \
    -ngl 99 \  # Offload all layers to GPU
    --port 8080

Quantization Trade-offs

Quantization	Size (7B)	Quality	Speed
Q8_0	~7.5GB	Best	Slower
Q6_K	~5.5GB	Excellent	Good
Q4_K_M	~4.0GB	Great	Fast
Q4_0	~3.5GB	Good	Fastest

For coding tasks, Q4_K_M offers the best balance of quality and speed.

Testing API Integrations Locally

Replace cloud APIs during development:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
import os

# Switch between local and production
if os.getenv("USE_LOCAL_LLM"):
    client = OpenAI(
        base_url="http://localhost:11434/v1",
        api_key="ollama"
    )
    model = "qwen2.5-coder:7b"
else:
    client = OpenAI()  # Uses OPENAI_API_KEY
    model = "gpt-4"

# Your code works identically with both
response = client.chat.completions.create(
    model=model,
    messages=[{"role": "user", "content": prompt}]
)

Running in Docker

For reproducible environments:

1
2
3
4
5
6
7
FROM ollama/ollama:latest

# Pre-pull the model
RUN ollama serve & sleep 5 && ollama pull qwen2.5-coder:7b

EXPOSE 11434
CMD ["serve"]

Build and run:

1
2
docker build -t qwen-local .
docker run -d --gpus all -p 11434:11434 qwen-local

Performance Tips

1. Enable GPU Acceleration

Ensure Ollama uses your GPU:

1
2
3
4
5
# Check GPU detection
ollama ps

# For NVIDIA, ensure CUDA is available
nvidia-smi

2. Tune Context Length

Reduce context for faster responses when you don’t need it:

1
2
3
4
5
6
response = client.chat.completions.create(
    model="qwen2.5-coder:7b",
    messages=messages,
    max_tokens=500,  # Limit output
    # Ollama uses the model's default context
)

3. Use the Right Model for the Task

Quick completions: qwen2.5:0.5b or qwen2.5:3b
Code generation: qwen2.5-coder:7b
Complex reasoning: qwen2.5:14b or higher

Troubleshooting

Model Too Slow

1
2
3
4
5
# Check if GPU is being used
ollama ps

# Try a smaller quantization
ollama pull qwen2.5:3b-instruct-q4_0

Out of Memory

1
2
3
4
5
6
# Use a smaller model
ollama pull qwen2.5:1.5b

# Or reduce context in Modelfile
echo "PARAMETER num_ctx 2048" >> Modelfile
ollama create qwen-small -f Modelfile

Connection Refused

1
2
3
4
5
# Ensure Ollama is running
ollama serve

# Check the port
curl http://localhost:11434/api/tags

Conclusion

Local Qwen models provide:

Complete privacy for proprietary code
Zero ongoing costs after initial setup
Instant availability for offline development
Full control over model behavior and resources

The Qwen family’s excellent performance-to-size ratio makes it particularly suited for local development, where hardware constraints matter.

At Sajima Solutions, we help teams integrate AI into their development workflows—whether local, cloud, or hybrid. Contact us to optimize your AI-powered development environment.